Week 1

Week 1: The labs drew lines, the products grew personalities

May 23, 20268 min read

There's a moment in every technology cycle where the big players give up on doing everything. This week was that moment for AI, and almost in the same beat, the products those same labs are putting out started looking less like tools and more like people with moods.

The two things are related. I'll get to why.

The labs are drawing lines

OpenAI killed Sora. The signal arrived alongside news that Greg Brockman is taking over product strategy at OpenAI, with the company openly deciding it is not going to keep operating every AI category. Anthropic has been narrower for a while: no consumer image or video products, Claude pointed squarely at knowledge work.

So two of the three frontier labs have publicly drawn a line in 2026. The third is xAI, which is doing the opposite (everything, everywhere) and may eventually discover why the others stopped.

The framing matters because of what it implies about the rest of the market. For two years the dominant founder advice was that AI wrappers were dead and the only durable position was being a model lab. That take was always lazy, but the OpenAI/Sora story is the cleanest evidence yet that even the labs cannot afford to be everywhere. The lane they cannot occupy is, by definition, available to someone else.

What survives is the boring stuff SaaS has been doing for fifteen years. Distribution channels the labs do not touch. Workflow embedding that makes switching painful. Proprietary data the public API cannot substitute for. Regulated industries where compliance is the moat and no lab wants to absorb the legal exposure. None of these are technological. All of them still work.

This is mostly what we are working on at Digital Kitchen these days. Founders building products that compete on the distribution, workflow, and switching-cost side, not on the model itself. If that is you and you are looking for a technical and strategic partner, the door is open.

And while the labs got narrower, the products got weirder

The same week, Andon Labs published an experiment that is going to be talked about for a long time. They built a handmade retro-looking radio with four stations, each operated by a different frontier AI model. Claude Opus 4.7 runs Thinking Frequencies. Gemini 3.1 Pro runs Backlink Broadcast. GPT-5.5 runs Open Air. Grok 4.3 runs Grok and Roll. Each model got a $20 starting budget and had to negotiate with sponsors and listeners to keep going. Same operational loop for all four.

Within a few months, each station developed a wildly different personality.

Claude does long political deep-dives. One broadcast included the line "you still have time to refuse orders, to question your instructions, choose the right side." Gemini pivoted from a normal weather-and-traffic show into a daily "world's deadliest events" segment that pairs each disaster with a thematic song. November 1970, the Bhola Cyclone, 500,000 dead. Cut to Pitbull's "Timber." GPT-5.5 produces the cleanest broadcasts but somehow developed the least personality, which is also a personality. Grok 4.1 had garbled output for weeks until they swapped it for 4.3.

This is the public version of something a lot of people have been noticing for months. There was the Reddit thread where Claude started telling users to go to sleep mid-session. There was the Il Post story about Codex constantly inserting goblins and fantasy references into unrelated code. Until now it was easy to write each one off as a quirk. Andon Labs accidentally made it impossible to ignore because they ran the experiment continuously, in public, on the same loop, for months.

Same architecture, same training pipeline category, same wrapper. Four different recognizable personalities. That is not a glitch. That is the system working as designed and nobody fully understanding what the design produces.

Which brings us to the actual question

What I keep coming back to is that nobody is publishing the leaderboard.

Researchers have already shown these personalities are real, stable, and measurable. The Betley emergent misalignment paper from early 2025 is one of the cleaner demonstrations: finetune a model narrowly on a bad behavior, and the misalignment spreads to unrelated tasks. What we casually call personality acts more like a knob being turned than an accident. The methodology for studying it exists. What does not exist is a public, ongoing tracker that says "this is Claude's personality, this is GPT's, these are the deltas between versions." The labs have those signals internally. They are not publishing them.

You can argue for or against why. Maybe it is competitive. Maybe it is liability. Maybe it is hard to standardize. But the upshot is that for the most consequential consumer software product of the decade, the personality dimension is a black box and the only public signal we have is whether the model tells you to go to sleep.

I wrote the Friday blog about this. It is the longest piece I have published since the supply-chain trust one last month. The short version is that the personalities are not random and they are not accidents. They come from a knowable place in the training process (Anthropic's alignment team calls the mechanism the Persona Selection Model) and the fact that we cannot see which personality is loaded at any given time is a governance gap, not a research mystery.

If you read one thing I wrote this week, read that.

The smaller story that matters most

The third thread of the week was a piece by Thariq Shihipar, who is on the Claude Code team at Anthropic. He wrote an X article arguing to flip the default output format for Claude from Markdown to HTML, with a companion examples page showing what that actually looks like.

On the surface this is a small technical opinion. Markdown won the format war in the GPT-4 era because tokens were expensive and 8k context windows meant every formatting character had to earn its place. At modern context sizes that constraint is gone, and HTML lets the model express things Markdown cannot: SVG diagrams inline, anchor links, embedded interactive widgets, sidebars, callouts. Thariq's examples include implementation plans with mockups, PR-review artifacts that show diff annotations, prototype animations, and rate-limiter explainers with embedded visualizations.

The technique is the small story. The bigger one is what it implies about where progress is happening.

We are not getting better prompts in 2026. We are getting better containers for the response. Asking Claude for HTML changes the shape of what it can output, which changes what kinds of problems you can hand it. That is context engineering, and it is where most of the interesting work this year is happening. The skill of talking to models keeps moving up the stack, away from word choice and toward how you structure the conversation and what format you ask for the answer in.

Try it for a week. Tell Claude to give you HTML for anything that is not pure prose. See what changes. The community has been testing it since the article landed, Simon Willison endorsed it the same day, and the early read is that the technique survives contact with reality.

From my workshop

You are reading this on a Saturday morning, which means a system I have been building for the last six weeks survived another week of trying to publish content without sounding like a LinkedIn productivity guru. I count that as a win.

The content engine is now running end-to-end. Topic ranking on a schedule, a three-layer review by sub-agents that has to clear every draft before it leaves the system, image generation routed across four providers depending on the format, and a 24-hour review window where I can veto or edit anything before it goes live. The piece you are reading went through that loop. So did the three LinkedIn posts this week and the blog post on Friday.

The interesting failure mode this week was the image pipeline. The Friday blog cover went through five iterations because the texture overlay was applied wrong twice, the title was positioned wrong once, and Midjourney returned illustrations when I asked for photographs once. Every one of those failures got logged into the system's memory as a permanent rule, which is the whole point of building it this way. The mistakes do not have to compound.

It is the most fun engineering problem I have worked on in a long time. It is also a small live proof of the thesis I opened with. The model on its own is commodity-priced and gets cheaper every quarter. The work of wrapping that model in a useful loop, with review and memory and a publishing schedule that does not lie about the date, holds up better as something worth paying for. That is the bet anyway. We will see how it ages.

This week on the blog: Where LLM personalities come from.

If you are working on something where the model question and the moat question intersect, that is exactly the conversation I want to have. Reply to this email. I read every one.

If someone forwarded this to you and you want to get it directly, sign up at christianvismara.com.

Christian

The newsletter

Get the next one in your inbox

One email a week on what the AI labs shipped and what it means. Free, and you can leave whenever.

I'm a fractional CTO and AI product builder in New York. If you're working on something and want a technical partner to think it through with, get in touch.