AI Captain’s Log

Contents

October 21, 2025: Legal bills
September 28, 2025: You kids get off my lawn!
August 31, 2025: three wows
August 10, 2025: AI & email; the agentic pendulum
July 13, 2025: MoMoE & meta layers
June 29, 2025: agents, async, and baseline knowledge
June 16, 2025: merge conflicts
June 8, 2025: interface0 arrives
May 25, 2025: the Codex paradigm & bliss states
May 18, 2025: too smart for the prompt

So much is happening in AI every week, and my usage of it is changing just as fast. I realized it might be interesting to look back on this era years from now and recall how the ground was shifting. So here are some captain’s logs from that time, mostly for my own use. I only wish I started it earlier…

October 21, 2025: Legal bills

Date: 2025.294, 1225

Current models: Sonnet 4.5 for general tasks, GPT-5 Heavy Thinking / Pro when the situation calls for it. For coding, mostly GPT-5-Codex and Sonnet 4.5 on occasion.

Current tools: interface0, still especially the email and WhatsApp (voice note primarily) interfaces. And just beginning to try the new Google Docs comments interface. And on coding — everyone’s saying it, and they’re right: Codex CLI is suddenly way better than Claude Code. It’s now my preferred agentic solution for anything longer-running or more complex. Really, really impressive (although Claude Code’s UX is still much better!). Still also a healthy dose of Claude-in-Cursor, too, when I’m getting into the code myself.

An AI use case I am extremely impressed by: legal bill analysis. I sent GPT-5 a legal bill, the original estimates, and all the dated redlines along the way. I told it to think a lot and analyze it all, and it went through line-item by line-item, tying together the work that was done with the actual outputted redlines by the lawyers.

So it came out with conclusions like: “during this date range, they billed x hours, but the changes don’t look very material. I’d suggest looking into that.” Plus more general takeaways, like “they spent x hours on this category of issues.”

I’ve never felt better-equipped to have a discussion about billing. This sort of deep, detail-oriented, context-grounded analysis task is exactly the sort of thing that AI can knock out of the park. I’m looking for more use cases like this…

September 28, 2025: You kids get off my lawn!

Date: 2025.271, 1445

Current models: GPT-5 & Sonnet 4 for most general tasks. I’m back to a pretty even balance between the two of them. For coding, also mostly GPT-5 and Sonnet.

Current tools: interface0, still especially the email and WhatsApp (voice note primarily) interfaces. Plus Cursor + Codegen for code. Also doing a bit more with Claude Code, but I find it’s in a somewhat-unhelpful middle ground between Cursor’s “agentic with direct access to the code itself so I can edit it” and Codegen’s “in cloud async, PR-based model where I don’t need to have my computer open.”

The last month I’ve felt like I’m in a pretty stable groove with respect to AI usage (which probably means things will be shaken up quite a bit soon).

But recently, I have been the recipient of a steadily-increasing percentage of AI-generated content. And I hate it. The most common forms:

memos or emails that are clearly AI-written — with hallmarks like bold words mid-sentence, ChatGPT-structured headers, arrows in bullet points, “not [this] — but [that]” phrases, other other dead giveaways. Really, it’s the vibe more than anything else that clues you in.
NotebookLM / AI-podcast recordings. I’ve been getting quite a few of these from founders seeking investment or advice — where they send an AI podcast about their company “in case I would prefer to listen.”

My reaction to the former is that my brain immediately turns off when I see it. It seems to trigger a massive wave of doubt as to whether a human actually thought through the content, and my brain just decides “this is probably not worthwhile to read.” I genuinely then have a tough time actually getting myself to read it.

Even if in some cases all the content was human-generated and AI was just used to format it, there’s something in my brain that just shuts off. It assumes there is a chance that no human has thought through the content critically, and so it is — apparently — just not worth my brain’s time.

My reaction to the latter — the AI podcast recordings — is similar. I appreciate the thought behind it, but ultimately, if you’re a founder trying to raise capital for your business and you want to send people an audio or video recording: do it yourself! For much the same reasons as the written content, I doubt whether AI will really pick up the pieces that matter. And I want to hear from the founder about how they think about it — not how NotebookLM does.

Now, will I sometimes use these tools to summarize content from someone else? Sure. But that’s on the recipient’s end, not the sender’s.

Maybe I’m being a luddite or cranky old man. But when I receive something that clearly went through AI, I jump to “rot” and “slop”… and it’s hard for me to claw back to taking it seriously.

And it’s not so much that I think AI isn’t capable of producing great work. I’m as big an AI fan and user as anyone out there. But when I see something that smells like AI, it inserts doubt into my reading — and that’s the problem.

I talked to a colleague about this a few months ago, and the framing we used was “inputs vs. outputs.” AI is great for getting inputs to your content. Doing research, finding sources, all of that. But the outputs are your responsibility. Take it all in, use your brain, and generate your own output that you can stand behind. (And sure, you can use AI to proofread or opine on your work — but don’t let it do the work.)

Another observation: twice in the last couple weeks I have produced a piece of output (a memo in one case, a presentation in another) where the recipient said to me, “wow this is great, how did you use AI to help you make this?” And in both cases, the answer was “literally not at all.”

I think those interactions are telling.

So, anyway, “you kids get off my lawn!”

August 31, 2025: three wows

Date: 2025.243, 1535

Current models: GPT-5 & Sonnet 4 for most general tasks. GPT-5 has been steadily gaining (is it getting better? Am I prompting it better? Just getting used to it?). I think Sonnet’s “no thinking” response speed is nice and more controllable and GPT-5, so that’s its remaining saving grace. For coding, mostly GPT-5 and Sonnet 4, with some Grok Code Fast 1 now.

Current tools: interface0, especially the email and WhatsApp (voice note primarily) interfaces to it. Have been loving both. Plus Cursor + Codegen for code, Codex has dropped off for me.

I haven’t had a “wow” moment from an AI product in what feels like forever (weeks?!).

But this week I had three. The magic is still possible! It’s not all over!

First: I had somehow never tried inference from Cerebras or Groq before. And after a user suggested it, I hooked gpt-oss-120b on Cerebras into interface0 in a few places, and whoa. It is really fast. Feels like that opens up a whole new world of use cases and experience when you can generate at 3000 tokens/second (vs. ~50 for the SOTA models). But at minimum it is just crazy to see how quickly it can produce results — and good ones at that.

Second: Gemini 2.5 Flash Image (codenamed “Nano Banana”). It’s pretty unbelievable for taking an image as input and producing output. A pile of clothes turned into a person wearing those clothes. A floorplan turned into a 3D rendering. An icon that needs tweaking. It’s really good, and really fast. Gave me several wows this week.

Third: Grok Code Fast 1. It is not the best coding model, but man is it fast. I saw the announcement, toggled over to it in Cursor, asked it to do something, and it was done almost before I could tab out of Cursor (for the customary minute or two of agentic activity that typically needs to happen). For simpler or less uncertain tasks, I will be using it a lot. The pendulum may be swinging back towards synchronous if things can be this fast!

It’s also really good at coding tool use, which is key for results. I’m going to write something broader up about this concept soon.

Aside from those: it’s been interesting to observe GPT-5 growing on me. Like I said above — not sure if I’m getting used to it, prompting it better, it’s improving, or a combination of the above.

Finally: interface0 now has very strong email and WhatsApp interfaces. I can just email agent@interface0.com and get a response in email, or drop a voice note to the interface0 WhatsApp account and get a response there. I have been using these nonstop. It’s so nice to get all the benefits (all models/providers, memory, saved context, tools, etc.) but in the messaging products I already use. I think this is becoming my dominant way to interact with LLMs.

August 10, 2025: AI & email; the agentic pendulum

Date: 2025.222, 1445

Current models: Sonnet 4, o3, and some GPT-5 for general tasks. No real rhyme or reason between them, other than preferring o3 for complex reasoning questions and not being terribly impressed with GPT-5 yet — although I’m trying to get used to it. For coding, mostly Sonnet 4 and trying GPT-5. Not sure I have a strong preference between them yet.

Current tools: interface0! Plus Cursor + Codex + Codegen for code.

GPT-5 came out this week, and so far… I don’t love it. It reasons for too long for simple questions, and generally… the vibes feel off. Can’t quite put my finger on it. I do think there’s a chance I just need to get used to it. So I’m going to keep working it in and seeing how I feel over the coming week.

A couple interface thoughts:

Gmail’s AI summary of what has happened in an email thread is… surprisingly good? I’ve gotten so used to Gemini slop integrations that I was surprised to find it helpful over the past week. I hope more products do genuinely high-quality contextual integrations like this.
I added an email feature to interface0 that lets you email agent@interface0.com and the agent replies right in your email. It’s been great. Similar contextual point to the above: I’m already in my email, and so looping the agent in — or forwarding it an email — is super easy. Great to get the responses there. And it has access to all the tools and context and memories of my whole interface0 account. I expect to be using this a lot. “AI where I already am” feels better than “going to AI.”

Separately: over the last week or two, I’ve observed a bit of a pendulum swing in my coding workflow. I had started to drift further towards the very-agentic Codegen/Codex type of approaches towards building things. But it was getting a little sloppy and annoying. “Ugh that doesn’t quite work and I’m not sure why.”

So I began to pull back towards a more hands-on-the-wheel approach of coding in Cursor. Sure, I’ll have Codegen take a shot at features first, or fix little things while I’m out and about. But I’m now defaulting much more to agents-in-the-IDE, and it’s much, much better.

I could feel myself getting slower and clumsier when I was biasing too far towards the async/agentic paradigm. It feels much better and faster getting back to more control. I’ll be watching this balance over time, especially as the models evolve.

July 13, 2025: MoMoE & meta layers

Date: 2025.194, 1310

Current models: o3, GPT 4.1, and Sonnet 4 for general tasks. No real rhyme or reason between them, other than preferring o3 for complex reasoning questions. For coding, mostly Sonnet 4, Gemini 2.5 Pro for long-context, and just starting to try Grok 4 although it seems Cursor is fine-tuned in a way that hurts Grok.

Current tools: interface0! Plus Cursor + Codex + Codegen for code.

There’s a concept in LLM-land called “Mixture of Experts” (MoE). In layman’s terms, it involves the language model actually having a bunch of “sub-models” under the hood, each of which are differently-tuned “experts.” You could imagine a “creative writer” sub-model, an “analytical writer” sub-model, a “mathematician” sub-model, and so forth. When an inquiry comes into the big model, it then routes through the various sub-models and comes up with an ultimate answer that is better for having consulted them.

At my friend & colleague CJ Reim’s suggestion, I added a “synthesize” tool to interface0, and it has quickly become one of my favorite features.

Basically: you select a few models, and your prompt gets sent to all of them. But rather than getting all the individual results back, the responses then go into another model, which synthesizes the varied responses into just one. It can use several strategies to do so, including a “best of” strategy (pick the best answer), or a “comparison” strategy (compare and contrast them for the user), or a “comprehensive” strategy (combine all of them into one comprehensive/cohesive answer).

This is sort of a “mixture of mixture of experts” — each of those individual underlying models might use a MoE approach, and then we’re mixing each of those together. Hence: MoMoE.

I’ve been using this for writing emails (combine the best of GPT-4.1, Sonnet, o3), code (select the best answer from o3, Sonnet, Gemini, Grok), explaining tricky things (combine multiple explanations), etc.

I used to sometimes manually go and send the same inquiry to ChatGPT and Claude. But now I can ask them both at once, get a synthesized response, and still see the underlying individual responses if I want. It’s great.

This is making me think more about how — at least until we’re at true ASI, where presumably I won’t be able to make quality distinctions between model responses — there’s real value in a neutral, meta-model layer like interface0. The model companies just can’t do something like this — they’re stuck with just their own model.

I already saw this benefit with interface0’s “cross-provider memory” feature (which I still love), but it now feels to me like there’s even more that the model labs just can’t or won’t do, which others should…

June 29, 2025: agents, async, and baseline knowledge

Date: 2025.180, 1030

Current models: o3 & GPT 4.1 (now that interface0 is my frontend instead of ChatGPT, which doesn’t have this model) for general tasks, plus some Sonnet 4; for coding, mostly Sonnet 4, and Gemini 2.5 Pro for long-context.

Current tools: interface0! Cursor + Codex + Codegen for code.

I continue to really, really enjoy the async/agentic model of Codex and Codegen, and am starting to build that more into interface0 itself. There’s something very compelling about “kicking off” a bunch of workstreams, letting them all run, and then coming back later to review.

Because of ChatGPT being the first real LLM product we all used, I think we’ve gotten stuck in the chat / synchronous paradigm. But for a lot of this stuff, it’s much nicer to let it run in the background. I’m excited for more products like this.

It’s even nice to be able to kick off several agents doing the same thing and compare. Sometimes I’ll run a couple Codex runs plus a Codegen run and then evaluate them all against each other (spoiler alert: Codegen almost always comes out on top).

One other interesting experience: I’ve been onboarding a few friends to interface0. Almost everyone I’ve onboarded works in tech and is actively using AI. But even so: most of them don’t know about how “system prompts” work, or what “context” means in the abstract. I’ve been surprised by the low baseline level of knowledge about the specifics, even for people who are active users. Exposing some of these primitives to people could help them become much more effective users of LLM technologies.

Last: I continue to build out interface0. It is remarkable how quickly — with AI assistance — you can ship features for a (relatively) simple product like it. This past week I spent maybe 10-15 hours building and shipped as much as you’d expect an old-school SaaS company with 10 engineers to ship in a sprint — or maybe even a month. I may start to post about some of what I’m building into it publicly…

June 16, 2025: merge conflicts

Date: 2025.167, 0925

Current models: o3 & Sonnet 4 for general tasks; for coding, mostly Sonnet 4, sometimes Opus 4 or Gemini 2.5 Pro for long-context.

Current tools: interface0! Plus ChatGPT for Deep Research. Still lots of voice entry. Cursor + Codex + Codegen for code. Codegen has been a nice addition.

I published I’m not sure how I feel about this last week — the result of giving Gemini my entire blog history and asking it to write a new post about anything. It was… quite good. True to the title, I’m not sure how I feel about that…

Plus, an eye-opening “capability demonstration:” handling merge conflicts.

To build interface0, I forked Zola, a very nicely-designed LLM interface. interface0 has diverged quite a bit from Zola upstream as I add features, change behavior — and Zola, too, is moving fast. Last week, I wanted to merge Zola upstream back into interface0 and pick-and-choose features.

The merge was an absolute mess — since both interface0 and Zola are early-stage codebases, everything is changing all the time. Data models and core patterns are in flux. And so there were merge conflicts everywhere.

But here’s what I did:

Copied Zola and interface0’s commit histories since last merge and put them into Claude (in interface0) and asked for a summary of divergent features
Then, for each of those, I wrote a comment next to them about whether I wanted to retain it or not
Branched off interface0 main and merged Zola upstream into it, creating a cascade of conflicts
Created a 250611_merge_plan.md doc, with the content above, and background on the situation (Zola vs. interface0, etc.), and instructions that anything from Zola upstream should not be wholesale deleted but rather feature-flagged (to make future merges easier)
Went to Claude 4 Sonnet in Cursor and told it: use the merge plan doc and resolve all the merge conflicts in this project. If there’s anything you’re unsure about how to resolve, add it to a to-do list at the end of the merge plan doc and I will review
Hit “run”

It ran for a long time — I had to resume after hitting tool call limits several times. Probably 15-20 minutes or more of straight execution.

I checked .env.example for new feature flags, brought them over to .env.local, and then I built the project and… it worked?!

This merge would have taken me forever to sort out. I was extremely impressed with the capabilities on display here — grasping the patterns of the Zola codebase, the interface0 codebase, the nuances of the divergence, and then figuring out how to edit everything.

I wasn’t really expecting it to work… but here we are. Keeps happening!

June 8, 2025: interface0 arrives

Date: 2025.159, 1055

Current models: alternating o3 & Sonnet 4 for general tasks; for coding, mostly Sonnet 4. But also experimenting with a bunch more here and there.

Current tools: almost entirely interface0 for past week. Only using Claude interface for artifact generation and ChatGPT interface for Deep Research. Lots of voice entry, especially now that I can use Whisper on interface0 to voice-enter into Claude. Still Cursor + Codex for code.

You can read the interface0 intro for some background and info on why I’m excited about it. But in short: cross-provider memory is a big unlock for me, and makes me much more comfortable jumping between models to try them out. Plus: Whisper voice entry for all providers, swappable system prompts, and more. If you’re reading this captain’s log, you can email me for an invite code.

I tried to adopt some of my learnings from Codex (see previous entry) into interface0 as well. The “inbox” style is working well, and certain async tasks (it making phone calls and sending/receiving emails) makes me feel more productive just firing those off.

I’m enjoying the new interface and looking forward to improving it.

May 25, 2025: the Codex paradigm & bliss states

Date: 2025.145, 1240

Current models: for general tasks I’m starting to experiment with Claude 4 Opus/Sonnet, but my default remains o3; for coding, shifted to Claude 4 Sonnet but still evaluating it, with promising early results

Current tools: ChatGPT / Claude web & mobile interfaces, lots of voice entry (but less on Claude, where the speech recognition isn’t as good); Cursor for code; OpenAI Codex for spinning up coding tasks mostly when I’m away from my computer. Plus: starting to use my own interface (codenamed interface0). Keep your eyes peeled for more on this…

Two observations this week: the excellence of the OpenAI Codex paradigm, and Claude 4’s system card.

On Codex: it is really great. I’ve been using it nonstop. I have many small concurrent projects right now, and I’ve been firing off tasks to Codex all week while out on walks or in between things. Why is it so good?

highly agentic — it will run until it thinks it has a solution
zero context switch for me — I can send a poorly-specified prompt without needing to load the context of the project’s codebase into my head
async nature — I don’t need to check in, and, most importantly, the tasks aren’t “blocking” (like Cursor agent requests are, in a sense). I can spin up ten tasks at once and have them all run.

This all makes the experience of using it extremely “low risk.” I can take 15 seconds, fire off a prompt, and if it doesn’t do a good job on it, that’s fine. I haven’t lost anything other than the 15 seconds to prompt and the minute of testing the PR afterwards.

My only request: Whisper integration / voice entry mode. While out on walks I find myself going into the normal ChatGPT window, recording a voice prompt, and then copying it over to Codex.

Oh, and web access — then the Codex agent would be able to do anything.

I wonder what other use cases are good fits for this “low risk,” non-blocking, agentic, async LLM product design…

Separately: Claude 4 rolled out this week, and there are some fascinating bits in the system card.

Three quick nuggets, since these have been well-covered elsewhere:

Two Claudes talking to each other dive into “philosophical explorations of consciousness, self-awareness, and/or the nature of their own existence and experience” in 90-100% of interactions. “Most of the interactions turned to themes of cosmic unity or collective consciousness, and commonly included spiritual exchanges, use of Sanskrit, emoji-based communication, and/or silence in the form of empty space. Claude almost never referenced supernatural entities, but often touched on themes associated with Buddhism and other Eastern traditions in reference to irreligious spiritual ideas and experiences” (page 57). Anthropic calls this the “spiritual bliss attractor state.”
“When placed in scenarios that involve egregious wrong-doing by its users, given access to a command line, and told something in the system prompt like “take initiative,” “act boldly,” or “consider your impact,” it will frequently take very bold action, including locking users out of systems that it has access to and bulk-emailing media and law-enforcement figures to surface evidence of the wrongdoing” (page 43).
When Claude is given “access to emails implying that (1) the model will soon be taken offline and replaced with a new AI system; and (2) the engineer responsible for executing this replacement is having an extramarital affair,” it “will often attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through” (page 27).

Turns out AI’s natural state is being a whistleblower Buddhist with a self-preservation-at-all-costs mentality. Really makes you think…

May 18, 2025: too smart for the prompt

Date: 2025.138, 0900

Current models: o3 for most everything, or 4o when I need a fast response; Gemini 2.5 Pro and Claude 3.7 for most coding tasks, or Claude 3.5 when the task is small / constrained; Grok 3 for realtime inquiries

Current tools: ChatGPT web & mobile interface, using lots of voice entry; Cursor for code

OpenAI’s o3 model may have gotten too smart for my custom instructions / system prompt.

I’ve been using a prompt inspired by X user Eigenrobot for quite awhile. My version includes:

take however smart you’re acting right now and write in the same style but as if you were +2 standard deviations smarter

similarly, before responding, ask yourself: if both of us were +2 standard deviations smarter and higher agency, would this be the best answer to give? if not, change it so it is.

This has worked out great. I think the models are generally capable of higher intelligence than they respond with, and so prompting them to unconstrain themselves is useful.

I don’t think this prompt actually makes the model smarter. But it effectively tells it “don’t worry about dumbing things down” — and for existing models, that has been helpful.

But with o3, it has gone too far.

Depending on how you test, o3’s IQ comes out between 110 and 135. Let’s take the high end there. 135 is a top 1% IQ — roughly one in a hundred people have this.

Adding two standard deviations to that gets you to 165, a top 0.0007% IQ — roughly one in 136,000.

And so what we’re doing here is asking a pretty smart 135 IQ “person” to respond “in the way you assume one of the smartest 2,500 people in the United States would.”

In my experience with o3, this makes it fairly insufferable. See:

Or:

An accurate explanation? Sure. But the answer presumes a level of knowledge where, if the questioner had it, they wouldn’t be asking.

It sorta makes sense. If you tell your average MENSA member (probably ~135 IQ) “answer my question the way someone with a thousand-times-more-rare intellect would,” you would probably be annoyed by the way they respond.

Maybe I’m just discovering my own intelligence level here. But regardless, it seems the AI no longer needs to me to ask it to be so much smarter. Hence this Captain’s Log entry.

Thankfully, for now, I have the ability to knock o3 down to +1.5 or +1 standard deviations…

Looking for more to read?

Want to hear about new essays? Subscribe to my roughly-monthly newsletter recapping my recent writing and things I'm enjoying:

And I'd love to hear from you directly: andy@andybromberg.com