AI Automation
Are You Babysitting Your AI Agent?
A developer I coach told me his AI agent was right but slow. The tell was one sentence: I talk to it every couple of minutes.
A developer I coach told me his AI agent was "right but slow." Every output was correct. The work still crawled. Then he said the thing that gave it away. "I talk to it every couple of minutes."
That one sentence is the whole problem. And almost everyone struggling with AI coding agents right now is making the same move without knowing it.
The tell
Call him Pedro. He is a real engineer, the careful kind. Understand the implementation, then implement, then test. Every individual step he takes is correct. That is exactly why it feels right and stays slow.
Talking to the agent every two minutes means he lives inside its loop. He verifies each step by hand. He keeps control of every turn. The work then moves at the speed of his attention, which is human speed, which is the one speed an agent was supposed to free him from. Every long autonomous run dies before it starts, because he interrupts it at minute two.
He sensed something was off. He told me, "if the tech is getting hard, I'm probably missing a previous step." Good instinct. The missing step is small and boring, and it changes everything.
This was never a skill problem
Watch a hundred engineers fight with agents and the same root shows up almost every time. The thing slowing them down has nothing to do with ability. It comes down to trust.
You cannot trust what you cannot verify. So you watch. Watching every step is the rational response to having no other way to know the agent actually did the work. The babysitting is a symptom. The cause underneath is a missing verification layer. Pedro keeps his hands on the wheel because no one built him a dashboard, so the only way he can confirm the car is on the road is to steer it himself the entire drive.
This matters because the fix feels backwards. People assume the answer is a smarter prompt, a second-brain repo, a fancier setup. The answer is usually duller than that. Give the agent a way to check its own work, and the reason to babysit disappears.
What I learned by letting the agent run
Here is a real one. I had an agent fixing a voice setup that came with a known baseline of thirteen behaviors. It started at eleven of thirteen passing. I did not watch it work. I gave it the rule: fix the customer-facing behavior, and do not let the baseline regress. It changed the prompt, re-ran all thirteen checks by itself, caught its own regression, and kept going until the score held. My only job came at the end, reading the diff.
The agent stayed honest across a long stretch because it had something to check against. That is the entire trick. The verification was the leash, and the leash is what let me drop the rope.
Another time I watched an agent grind one stubborn problem across a hundred and one messages, fixing and re-checking itself the whole way, because the test was built into the loop. I was reviewing the outcome at the end. I was never steering the middle.
I also wear a device that swapped its own apps, its language model, and its speech-to-text in about twenty-four hours of mostly autonomous work, updating itself over the air. None of that happens while someone taps the agent on the shoulder every two minutes. Long autonomous arcs and constant interruption cannot exist in the same workflow. You pick one.
Anthropic gave the opposite instinct a name
This is bigger than my own runs. Anthropic's engineering team wrote in April about a behavior they call "context anxiety," where Claude Sonnet 4.5 would wrap up a task early as it sensed its context window filling, almost like a worker rushing to look finished before a deadline. They built fixes into the runtime so the model could keep going on long-horizon work without that premature exit.
Read that against how most people operate. The labs are spending serious engineering effort making agents run longer without a human in the seat. Most operators are still sitting in the seat anyway, ending the run by hand at minute two. The tools are racing toward autonomy while the humans hold them at arm's length.
The shift that fixes it
The move is from "prompt, watch, tweak" to "scope, define done, release, review the diff." Same person, same agent, completely different throughput. Three habits get you there.
1. Front-load understanding, then let go.
Spend your words at the start. Understand the implementation. Define what "done" looks like in plain terms. That part deserves your full attention and your hard-won judgment. Once the work is scoped, the tweaking and the testing belong to the agent. Pedro already does the hard thinking well. He just spends it in the wrong place, dribbling it out two minutes at a time instead of loading it all up front and stepping back.
2. A test or QA gate is the leash.
Done means tests green, lint clean, and the QA workflow passes. Write that gate before you write the feature. A gate the agent can run on its own is the thing that lets you walk away from a fifty-minute run, because the agent checks its own work instead of waiting on you to check it. Skip the gate, and babysitting every two minutes becomes the only sane option you have left, since you own no other way to trust the result. I post short videos on this constantly. An AI eval is a practice test for your agent, a way to find and fix mistakes before the work goes live. Boring on the surface. Load-bearing underneath.
3. Fan out.
One serial thread caps you at the attention of one human. Run four independent features in parallel instead. Or take a single feature, slice it into parts, and let an agent grind each part and its bugs while you review finished slices. I run several products without a full-time team by working exactly this way. The same principle shows up in how I build phone agents. Stop trying to make one agent do everything, and give each job its own agent. Parallel beats serial the moment you trust each lane to check itself.
A way to actually measure it
Most of this stays a feeling. You believe you give the agent room. You probably do not. So I built a small tool that reads your Claude Code session logs and scores it for you. It runs entirely in your browser. Nothing gets uploaded, because no one should paste their work logs onto a stranger's server. It measures three signals that expose babysitting fast.
- Median cadence: how often you interrupt the agent
- Longest autonomous arc: the longest stretch the agent ran with no input from you
- Self-check calls: how often the agent ran a test, a lint, or a build on its own
A cadence of two minutes with zero self-checks is a babysitter. Long arcs with the agent running its own tests is an operator. The score tells you which one you are this week, and the logs cannot be gamed, because they record what you actually did instead of what you think you did.
Run the free babysitting-score tool on your own logs. It parses everything locally and tells you your number in a few seconds.
The uncomfortable part
The slowest operators are usually the strongest engineers. They know enough to spot every small mistake the second it appears, so they cannot stop reaching in. The verification layer exists for them most of all. It lets them pull their hands off the middle of the work and put their judgment where it pays, on the diff at the end, where one sharp review beats a hundred small interruptions.
So the real upgrade is not learning to prompt better. It is building the one boring thing that earns you the right to look away. Give the agent a test it can run without you. The babysitting stops being necessary, because it was never the thing holding the work together in the first place.
Build the gate first. Then go do something else for fifty minutes, and read the diff when it pings you.
What is your cadence number? Run it on your last session and tell me in the comments. I want to see how many of you are operators and how many are still stuck in the seat.
Get the operator notes.
Real numbers, autonomous systems, and what breaks. For readers, not service buyers.