The seven levels of agentic engineering

Everyone is "using AI to code" now. The phrase is so broad that it tells you almost nothing. A junior dev typing prompts into a chat window and a senior engineer running a fleet of long-lived autonomous agents are both, technically, doing the same thing. They are not doing the same thing.

After a year of building real systems with Claude, watching teams adopt agentic tools, and shipping a handful of products where agents wrote a meaningful share of the code, I've started to see a clear ladder. Seven rungs. Each one demands a different mental model, a different toolchain, and — most importantly — a different definition of what your job actually is.

This essay walks the ladder.

Level 0 — Autocomplete

The starting point. The model lives inside your editor and finishes the line you're already typing. Copilot's original pitch. Cursor's tab key.

At Level 0 the human is fully in the loop. The model is a faster keyboard. It guesses the next token, the next line, occasionally the next function. You read every character before it lands. Latency is the only metric that matters.

The skill at this level is the same skill you've always had: knowing what you want to type. The model just helps you type it faster. Most engineers who say "AI didn't really change my job" are stuck here, and from inside Level 0 they're right. It really doesn't change much.

Level 1 — Conversational pair

You stop typing into the editor and start typing into a chat box. "Refactor this method." "Write a test for this." "Explain why this is slow."

The model now produces blocks of code instead of tokens. You read them, paste them, run them, fix them. The loop is still tight — minutes per turn — but the unit of work is bigger. You are no longer composing the code. You are composing the prompt.

This is where most professional engineers actually live in 2026. It's a real productivity bump — maybe two times for routine work — but it has hard ceilings. The model has no memory of your codebase. You spend a lot of time copy-pasting context in. You get better at writing prompts than at writing code, which feels weird and is slightly true.

The mistake at Level 1 is thinking you've arrived. You haven't. You've just learned to use a chat interface.

Level 2 — Editor agents

The model gets a body. It reads files itself. It runs tools. It writes files. It can see your project structure, jump between files, and make coordinated edits across multiple locations at once.

Cursor's agent mode, Copilot Workspace, the agent panes inside JetBrains and VS Code — these are all Level 2. You give a task ("add pagination to the posts index"), the agent fans out across the relevant files, makes the edits, and comes back with a diff. You review. You merge. Or you reject and refine.

Three things change at Level 2:

First, the unit of work jumps another order of magnitude. You're now thinking in tasks, not snippets. "Build the form, the controller, the migration, and the tests" is one ask, not four.

Second, context becomes the bottleneck. The agent is only as good as the slice of the codebase it can hold in its head. The whole craft becomes shaping that slice — through file layout, naming, well-placed comments, and ruthless pruning of dead code. Your repo is now an interface for a new kind of reader.

Third, you stop reading every line. This is the psychological leap. You start trusting diffs the way you trust well-tested library code: spot-check, run, move on. People who can't make this leap stall here forever.

But Level 2 has a fundamental limitation: the agent can edit files, but it can't run your code. It doesn't know if the tests pass. It doesn't see the runtime error. It's writing in the dark, and you're the one who flips the light switch.

Level 3 — Task agents

The agent gets a terminal.

This is the rung that changes everything. At Level 3, the agent can not only read and write files — it can execute them. Run the test suite. See the failure. Read the stack trace. Edit the code. Run the tests again. Iterate until green. All without you touching the keyboard.

Claude Code, Aider in terminal mode, Devin, Codex CLI — these are Level 3 tools. The defining property is the verify-edit loop: the agent proposes a change, checks its own work, and self-corrects. The human goes from being the runtime to being the reviewer.

This is a bigger jump than it sounds. At Level 2, every iteration goes through you: the agent writes, you run, you report the error, the agent tries again. At Level 3, the agent can do ten iterations in the time it takes you to read one diff. The loop speed goes from human-scale to machine-scale.

The skills that emerge at Level 3:

Environment design. The agent needs a workspace it can run in — the right tests, the right dev server, the right database fixtures. If your dev environment requires twelve manual steps to set up, an agent can't use it. Level 3 forces you to fix your developer experience, which was probably broken for humans too.
Guardrail engineering. The agent has a terminal. What can it not do? Can it delete production data? Push to main? Install packages? The boundary between "sandbox" and "blast radius" becomes a design decision, not an afterthought.
CLAUDE.md / AGENTS.md as a craft. These files are no longer nice-to-haves. They are the agent's onboarding docs. The quality of the file directly determines the quality of the output. Writing a great CLAUDE.md is as important as writing great code.
Review posture shift. You're no longer reviewing "did the agent understand my prompt?" You're reviewing "did the agent's self-verified solution actually solve the right problem?" The mistakes are subtler and more interesting.

Most teams that say they're "doing agentic engineering" are somewhere between Level 2 and Level 3 right now. The difference is whether the agent can verify its own work. If you're still copy-pasting errors back into chat, you're at Level 2. If the agent sees the error and fixes it before you do, you're at Level 3.

Level 4 — Background agents

The agent moves out of your editor and into a queue.

You file a "ticket" — sometimes literally an issue, sometimes a row in a database, sometimes a comment on a PR — and the agent picks it up, works on it for minutes or hours, opens a pull request, and pings you. You weren't watching. You were doing other work, or sleeping, or in a meeting.

GitHub's agent-on-issues, Devin's async mode, Claude's background tasks, the homegrown loops people are quietly running on their own infra — all Level 4. The defining property is asynchrony. The human is no longer the throttle.

This is the first level where the org chart starts to bend. If one engineer can dispatch ten background agents, what is that engineer? Not a coder. Not really a manager. Something in between. A foreman, maybe. A tech lead with a team that doesn't need lunch breaks but does need remarkably specific instructions.

The skills that matter at Level 4:

Task framing. The ability to write a brief that an agent can actually execute without clarification. This is harder than it sounds and is roughly the same skill as writing a great Jira ticket for a competent but new engineer.
Verification at scale. You can't deeply review ten PRs an hour. You need tests, linters, type checks, screenshot diffs, and a sense for which kinds of changes are safe to skim and which demand a real read.
Containment. Agents need sandboxes. Ephemeral branches, ephemeral environments, ephemeral databases. The blast radius of a bad change has to be small enough that "merge and see" is a reasonable strategy.
Triage. When five agents come back and three of them got it wrong, you need a fast way to decide which to retry, which to fix yourself, and which to throw away.

The economics get weird at Level 4. The cost of agent runs (tokens, compute, CI minutes) becomes a real budget line. But the output per engineer-hour can be five to ten times what it was at Level 2 — if the task framing is good and the verification infrastructure is solid.

Level 5 — Agent fleets

Multiple agents working on the same codebase at the same time, possibly coordinating, possibly stepping on each other's toes, definitely producing more code than any human can review one line at a time.

A planning agent decomposes a feature into tasks. A coding agent picks up each task. A review agent reads the diffs. A test agent runs the suite and writes new tests for uncovered branches. A docs agent updates the changelog. They share a workspace. They open PRs against each other. You sit on top of the whole thing as the editor-in-chief.

This is bleeding-edge in 2026 and very few teams are doing it well. The hard problems are not the ones you'd expect. It isn't the AI quality — frontier models are plenty good. It's everything around them:

Merge conflict economics. When two agents touch the same file, who wins? How do you replay a task on top of a moved target without burning thousands of tokens?
Cost discipline. A poorly-bounded agent fleet can spend a thousand dollars on a feature that would have taken a human an afternoon. You need budgets, kill switches, and a sense of when to give up.
Trust calibration. Some tasks are safe to ship without human review (a typo fix, a dependency bump that passes CI). Some require deep reading. The fleet has to know the difference, and so do you.
Memory. Without a real sense of "what we already tried and why it didn't work," agent fleets re-invent the same broken solutions in a loop. Persistent memory is the unsexy but critical infrastructure layer.

The engineer at Level 5 is doing something that doesn't yet have a clean name. It is not coding. It is not management. It is closer to directing, in the film sense — you're choosing the shots, setting the tone, cutting the bad takes, making sure the whole thing adds up to a coherent piece of work.

Level 6 — Goal-driven autonomy

The final rung, and mostly theoretical. You hand the agent a goal — "grow ARR 20% in Q2," "rewrite the billing system off Stripe," "ship a competitor to X" — and it figures out the rest. Plans, executes, measures, iterates. You check in occasionally to set direction and approve the things only a human can approve: pricing, hiring, public communication, irreversible decisions.

Nobody is here yet. The capability isn't the bottleneck — the trust is. We don't have the verification infrastructure to let an agent commit to anything with real-world consequences without a human in the loop. We don't have insurance products. We don't have the legal framework. We don't, frankly, have the social contract.

But the distance between Level 5 and Level 6 is shrinking. Every improvement in model reasoning, every new tool-use capability, every better eval framework makes the trust gap narrower. The question isn't whether we'll get there. It's whether the people building the trust infrastructure — the guardrails, the evals, the kill switches, the audit trails — can keep up with the people building the capability.

The engineers who will be most valuable in a Level 6 world are not the ones writing the most code today. They're the ones building the judgment layer — the systems that decide what's safe to automate and what still needs a human in the chair.

Where you actually are

Be honest. Most engineers reading this are at Level 1, drifting toward Level 2. Most teams are at Level 2 or 3, with one or two people experimenting at Level 4 and the rest of the org pretending the experiments don't exist.

There is no shame in being on a lower rung. The rungs are real and each one takes work to climb. But there is a cost to refusing to climb, and the cost compounds. Every rung up multiplies the unit of work you can take on without losing control. The engineers who climb fastest are not necessarily the smartest — they are the ones who are willing to stop reading every line, willing to trust the loop, willing to redefine "doing the work" in ways that feel uncomfortable at first.

What changes at each level

A shorthand:

L0 → L1: You stop typing code and start typing intent.
L1 → L2: You stop pasting context and start shaping a codebase the agent can navigate.
L2 → L3: You stop being the runtime. The agent verifies its own work.
L3 → L4: You stop watching and start dispatching.
L4 → L5: You stop dispatching one task at a time and start directing a small studio.
L5 → L6: You stop telling it what to build and start telling it what to want.

The skill ceiling at every level is higher than it looks from below. Level 1 looks trivial from Level 2. Level 2 looks naive from Level 3. Level 3 looks manual from Level 4. The ladder doesn't get easier — it gets stranger.

What I'd tell my past self

A year ago I was a Level 2 engineer who thought he was at Level 3. I was running agents in my editor, calling it "agentic," and quietly burning out because the loop still ran at human speed. The shift came when I forced myself to:

Give the agent a terminal, not just an editor. The moment the agent could run tests and see its own failures, the iteration speed changed by an order of magnitude. This was the single biggest unlock.
Write tasks instead of typing prompts. A task is a brief. A prompt is a sentence. The difference is a table of contents.
Build verification infrastructure first. Tests, types, screenshot diffs, smoke runs. Without these, every agent run ends in a slow, expensive read-through. With them, you can ship.
Trust the diff. Spot-check, run, move on. Be wrong sometimes. Catch it in the next loop. The economics only work if you let the loop carry you.
Keep a running ledger of what worked. Memory across sessions matters more than memory within a session. Write things down, not for future humans, but for future agents.

None of this is comfortable. All of it is the job now.

The seven levels are not a leaderboard. They are a map. Find yourself on it. Decide which rung you actually want to be on. And start climbing — slowly, carefully, with your eyes open.

The seven levels of agentic engineering.