GPT-5.5 Made Codex Feel Like a Small Team

Codex is hard to explain because I'm not using it as a place to ask for answers anymore.

It feels more like I can tell the computer what I'm trying to do, and it can go do a bunch of it.

The old AI loop was: ask a model to make an output.

Write this paragraph. Make this image. Build this tiny app. Answer this question.

Cool. Useful. I still do that.

What feels new right now is that GPT-5.5 is good enough at staying with a messy problem that I can start treating it like a small team across a bunch of different projects.

The actual shape is: one lane is on the website, one lane is on analytics, one lane is on image variants, one lane is on a remote machine, one lane is on a script packet, one lane is checking whether the fleet is healthy.

And sometimes it's not even different tasks. Sometimes it's copies of the same problem.

Three agents try three versions of the same visual idea.

Four agents research four angles for the same episode.

One agent designs the pipeline, another critiques it, another tries to break it, another turns it into something I can actually ship.

Then I look at what happened and make the call.

That's the part that feels different.

I'm not just asking AI for outputs anymore.

I'm starting to command the work.

How I Got Here

I talked about the learning curve in my Village Global interview, and looking back it's kind of obvious how I ended up here.

At the time, a lot of my job was SQL, Python, dashboards, and analysis. Then GPT-4 came out and I had this weird moment of, wait, for the practical version of analytics, this is better than me.

Not theoretically better.

Like, I would ask it for something, come up with my own answer, look at what it did, and be like, no way, that's more elegant.

That was when I was like, okay, I can't really pretend my old job is the same job anymore.

Then I started asking ChatGPT to teach me how to build small programs. The first program was basically, teach me how to build a GPT-powered chatbot. And it was like 20 lines of code. I was like, okay, this is very meta. I can ask ChatGPT to build another program that uses GPT.

Then AutoGPT and BabyAGI showed up, and BabyAGI was the one where I was like, wait, if you configure a chatbot as an agent, how far does it go? You give it a job, it breaks the job down, it works in a loop toward the goal.

It didn't work perfectly.

But the idea was obvious.

Then Open Interpreter showed the next piece. Let the model use the computer. Let it write shell commands. Let it write Python. Let it move through the machine faster than I can click around the screen.

That was radioactive in 2023 because nobody was really sandboxing it properly, but as a concept it was like, oh, this is way cooler than the model giving me a paragraph.

Then came Cursor, Claude Code, Codex.

If you want the timestamp ladder, it is basically all in that interview: GPT-4 makes me rethink analytics around 4:18, BabyAGI makes the agent loop click around 13:35, Open Interpreter makes computer use obvious around 15:52, and the fleet setup shows up around 29:03.

When I look back at that whole run, the tools being imperfect is almost beside the point.

What mattered is that I kept asking the model to do bigger chunks of the job.

First the model helps you write.

Then it helps you code.

Then it uses tools.

Then it keeps state.

Then it works for longer.

Then it starts coordinating with other agents.

That's why the new Codex matters to me. It doesn't feel like some random new app. It feels like the latest version of a pattern I've been circling for three years.

And Karpathy saw the product path before most people were even thinking this way.

In his 2022 Lex Fridman interview, the useful timestamp is 2:48:53. The take is not "autocomplete is cool." The take is: what if the product that starts as GitHub Copilot keeps getting better until it becomes a digital worker?

That is the idea I keep coming back to.

Copilot matters because it was the baby version of this. It starts by finishing the line. Then the function. Then the task. Then, eventually, the whole messy job.

BabyAGI matters because it showed the loop.

Open Interpreter matters because it showed the computer-use part.

Codex is where those threads start turning into something I can use every day.

What GPT-5.5 Changes For Me

With GPT-5.5, it's not, oh cool, nicer answers.

It's that it can stay with a longer chain of work.

Most useful work is not one prompt.

It's more like: figure out where we are, notice the constraints, make a move, check what happened, fix whatever is next, and leave the project in a better state than you found it.

That sounds simple, but no, that's where all the mess is.

Most real jobs get messy immediately. The repo has old decisions in it. The site has images and routing and analytics and weird frontmatter. The browser is logged in. The draft has a Substack version. The media pipeline has folders of half-good renders. The remote machine has a process running. There is some generated file you should not touch. There is some private file you should definitely not read out loud.

That's why when people say "AI writes code," I'm like, kind of, but that's too small.

Code is just the first place where this gets obvious because code talks back. You can change something, run the check, see what broke, and keep going.

But the same shape shows up everywhere.

Moving a website. Trying three versions of a feature in three different copies. Turning a folder of renders into a post. Looking at analytics after a redesign. Making a dashboard I can reuse next time. Writing four script packets in parallel. Checking whether a remote fleet is still alive. Cleaning up a machine. Taking a draft and making it Substack-readable. Using a desktop app because there is no nice API for the job.

To me, that's all the same kind of work.

And once one agent is good enough to move one piece of structured work forward, the next obvious question is: why am I running one?

Why I Run More Than One

When I say fleet, I don't mean chaos.

I mean a set of agents where each one has a job, or a copy of the same job.

One reads the situation.

One tries a version.

One checks the result.

One researches a different angle.

One looks at the browser or the remote computer.

And now the fleet is not just flat.

My Codex config has multi-agent mode on. The relevant part is max_threads = 16 and max_depth = 3.

So, in normal English: Codex can keep up to 16 agent threads open, and spawned agents can spawn their own subagents up to three levels from the root session.

That matters because the shape of work changes.

It's not just me giving five assistants five tasks. It can be a tree of jobs with limits. A lead agent can say: go try this version, go research this angle, go check this machine, go test this artifact, go look at this other copy of the project. Then it pulls the results back together.

But the limits are the whole reason it doesn't become nonsense.

I'm not trying to use all 16 threads or go three levels deep every time.

Those are the limits so it doesn't run away.

If every agent starts wandering around and spawning more agents because it feels powerful, you don't get leverage. You get fog.

So my boring rule is still:

One agent, one clear job, one surface to own.

The reason it works is that the jobs are boring and specific.

The reading agent is not also editing. The editing agent is not pretending it tested the work. The agent trying version A is not messing with version B. The verification agent is not rewriting the project. The browser agent is not opining on the whole architecture.

When the model is weak, a fleet multiplies confusion.

When the model is strong, a small bounded fleet starts becoming useful.

That's what GPT-5.5 changes for me. The individual agent is finally strong enough that the delegation starts paying off.

The Computer Stuff Matters

My work does not live in one app.

It lives across code projects, websites, notes, image tools, video tools, browser sessions, dashboards, terminal windows, remote machines, and random desktop apps.

This is why computer use is a huge deal to me.

Some work does not have a clean official connection. You have to open the app, pick the file, upload the asset, click through the form, check what rendered, and make sure the right result happened.

It's boring work, but it's a massive amount of real work.

If the model can reason through the task and safely operate the interface, the boundary of automation moves again.

So, again, I don't really buy the frame of "AI writes code."

The more interesting part is that AI can inspect the project, use the browser, help with the desktop app, coordinate with the remote machine, check the artifact, and hand the result to the next agent.

That feels less like a chatbot and more like a command layer over the computer.

What I Actually Use It For

I scanned my Codex session history because I wanted to see what I actually use this for, not what I say I use it for.

The pattern was obvious. It is not one repo. It is not one product. It was more than a thousand threads across a lot of working folders.

I use Codex when the work has state.

By state I mean: there is a real situation already sitting there. Files, branches, screenshots, browser tabs, transcripts, generated assets, old decisions, weird config, an error message, a half-finished project on some computer somewhere.

"Write me code" does not describe that.

The more accurate prompt is:

Look at the real situation and move it forward.

That shows up in a few buckets.

There is site and publishing work: redesign a page, fix the mobile version, add videos, tune rankings, pull analytics, turn the result into a dashboard, make the newsletter version shorter.

There is media and show work: character sheets, concept art, prompt variations, transcripts, script packets, episode premises, render paths, critique loops, upload flows, final exports.

There is variant work: make copies of the problem, send different agents in different directions, compare what came back, keep the best one.

There is fleet work: check the Mac minis, inspect tmux panes, keep Hermes agents alive, run recurring scans, figure out why one lane is stuck, route work to the right machine.

There is research and eval work: not "summarize this for me," but turn the research into a list, scorecard, dashboard, brief, or reusable test.

There is local machine work: change the terminal setup, inspect config, install or remove tools, clean up old panes, explain what some weird setting means before changing it.

There is browser and app work: upload this, download that, check the dashboard, open the file, take a screenshot, compare the result.

Those are all the same kind of problem. It's not a blank page. There is already a messy situation, and I need it moved forward.

Skills Are Basically Saved Lessons

When I say I use skills, I don't mean magic.

A skill is basically a saved way of working.

If I keep telling Codex the same instruction over and over, I probably don't need a better prompt. I need to make the setup remember the lesson.

Maybe it belongs in the project instructions.

Maybe it belongs in a skill.

Maybe it belongs in a custom agent.

Maybe it belongs in an automation.

This is one part people miss. The system gets better because the setup accumulates habits.

Read the actual project before making suggestions.

Use the right browser lane.

Generate the image, then inspect the result.

Render the document before saying it is done.

Run the build before claiming the site works.

Keep private files private.

Use a smaller agent for a bounded subtask.

That's not prompt decoration. That's me taking the stuff I keep saying and making the environment remember it.

My Job Now

I don't think typing is the bottleneck anymore.

The bottleneck is being good at giving command.

What am I actually trying to do? Which folder is the target? Which app is in scope? Which computer is doing the work? Which agent owns which part? What should stay private? What check proves the work moved?

That's mostly what I'm doing now.

I still decide what matters.

I still judge taste.

I still pick the direction.

But I don't want to spend my life moving between tabs, files, dashboards, terminals, browser windows, and app UIs just to keep a workflow alive.

I want to command the workflow.

That's how I'm using GPT-5.5 and Codex right now.

Not as a single assistant.

As a small team I can command.