AI against underengineering

Everybody loves to talk about overengineering. It’s the villain in every war story, the thing your principal engineer warns about at architecture review, the crime your last tech lead committed before leaving. Overengineering is the noun we reach for when we want to sound wise.

It’s also, for most software in production today, not the problem.

The problem is the opposite. The problem is underengineering: codebases held together by heuristics and hope, touching production without tests, without retries, without idempotency, without a story for the day something goes sideways. Not because the engineers didn’t know better — almost all of them did — but because nobody had the budget, the patience, or the stomach for the boring work that would make it right.

AI agents are the first tool we’ve ever had that might actually change that equation. Not because they write “better” code than a senior engineer (they don’t), but because they do the boring, correct thing at a cost that finally makes the boring, correct thing affordable.

This essay is about why the real opportunity of agentic engineering isn’t speed. It’s rigor.

The underengineering you’ve been living with

Walk into almost any five-year-old production codebase and you will find the same stuff:

A critical external API call with no timeout, no retry, no circuit breaker.
A background job that assumes it runs exactly once, and quietly double-charges customers when it doesn’t.
A migration that nobody ran in staging first.
A “temporary” cache that’s been load-bearing for three years.
An integration test suite that hasn’t been green since Q2.
A README that describes the system as it was planned, not as it exists.
A feature flag that was supposed to be deleted two releases ago and now nobody is sure who owns it.

Every one of these is a conscious shortcut. Someone on the team, at some point, knew the right thing to do. They wrote a ticket. They put it in the backlog. They moved on, because there was a release on Friday and the right thing was going to take two days and nobody had two days.

Multiply that by every engineer on the team, every sprint, for five years, and you get the state of production software in 2026: not a cathedral, not a mess of premature abstractions, but a scaffolded hut with a lot of duct tape in the load-bearing places.

We have a long list of words for the things humans build when they know better and do it anyway. The most honest one is underengineering.

Why humans underengineer

It’s not stupidity. Engineers skip the boring correctness work for three structural reasons.

The work is invisible. Nobody thanks you for the retry logic that stopped a 3am incident from happening. The reward loop for defensive engineering is, by definition, an absence. Features ship and get demoed. Hardening does neither.

The work is boring. Writing the sixth integration test for a third-party webhook is not why you became an engineer. It doesn’t stretch you. It doesn’t teach you anything. It doesn’t feel like building. It feels like homework.

The work is slow. Refactoring the data layer to add idempotency keys everywhere is a two-week project that ships nothing visible. In a world of two-week sprints and quarterly OKRs, two weeks with nothing visible is a career risk.

So engineers make a rational trade. They ship the feature, file the ticket, and move on. The hut gets another room, and the duct tape gets a little more load.

This is the gap AI agents can actually close.

The real opportunity isn’t speed

The dominant 2025–2026 narrative around AI coding tools has been about velocity — “10x faster,” “ship in a weekend,” “one-person unicorn.” And with that narrative came the backlash: studies (InfoQ, Stack Overflow, The New Stack) pointing out that AI-assisted teams churn out “highly functional but architecturally blind” code faster than anyone can review it. JetBrains started calling it shadow tech debt. Margaret Storey named a sharper version of the same idea: cognitive debt, the accumulating mental tax of reasoning about a system you didn’t really write.

All of that is real. If your whole agent story is “chat with the model, paste the diff, merge,” you are not shipping rigor. You are shipping debt at a higher clock speed.

But that’s a Level-1 use of agents, and it’s not where the interesting story is. The interesting story is what happens when you stop pointing agents at the thing you were already doing, and start pointing them at the thing you were never going to do.

The boring correctness work. The hardening. The tests that don’t exist. The retry logic. The idempotency. The migrations that should have been split into three. The observability that should have been wired up before the incident, not after.

Agents don’t get bored. They don’t feel the invisibility. They don’t care about quarterly OKRs. An agent will happily grind on integration tests for six hours while you sleep. It will refactor a hundred callsites in one go without asking whether this is really the best use of its time. It will write the retry wrapper, then write the test for the retry wrapper, then write the doc comment for the retry wrapper, then open the PR.

That’s not a 10x multiplier on a human. That’s a tool that finally matches the shape of the work humans were always skipping.

What to actually hand the agents

Here’s the playbook I’ve been running. It assumes you have a real agent harness — not a chat window, but something that can run tests, read logs, and iterate on failure — and it assumes you trust it inside a bounded scope. If you don’t have that yet, build it first. Everything downstream depends on it.

1. Close the correctness gaps you already know about

Every team has a list. It’s in the backlog tagged “tech debt” or “hardening” or “later”. Pull that list out. Hand the agents the tickets nobody wants to do:

Add retries with jitter to every external call that doesn’t have them.
Add idempotency keys to every endpoint that mutates state.
Add timeouts to every unbounded operation.
Add structured logging around every seam where data crosses a system boundary.
Add tests for every bug fix in the last 12 months that didn’t ship with one.

None of this is glamorous. All of it is compoundingly valuable. All of it is the kind of work that used to be economically irrational to prioritize.

2. Write the tests you were never going to write

Agents are uniquely good at the long tail of tests — the edge cases, the null handling, the timezone bugs, the unicode edge cases, the “what happens if this field is missing” cases. The ones a human would file under “not worth it” and skip.

Point the agent at a module, ask for an exhaustive list of edge cases, let it propose tests, then review the ones it writes. You will find real bugs. Every time. I have never done this exercise on a production system and not found at least one live issue.

3. Refactor the shortcuts into the real thing

Find the places in the codebase where everyone has always said “one day we should do this properly” and let an agent do it properly. Database access that should go through a repository. HTTP calls that should go through a client with backoff. Serialization that should live in one place instead of six. The kind of refactor a human would have scheduled as a Q3 initiative.

A careful agent, running with tests as a safety net, can do this kind of work in hours. The result is not overengineering. It’s the baseline the code should have had from day one.

4. Build the observability you promised yourself

Every team has promised itself “proper observability” at some point. Most never get there, because wiring up metrics, traces, and structured logs across a whole codebase is exactly the kind of tedious, mechanical, crosscutting work that makes humans reach for the backlog button.

Agents love that work. Give them a list of operations that should be instrumented and a convention to follow, and let them grind. The resulting telemetry is what lets you actually understand the system in production — which is, incidentally, the opposite of cognitive debt.

5. Write the docs, for real

Not the README. The real docs. The one-page runbook for every critical background job. The data flow diagram for the payment path. The list of every feature flag and who owns it. The onboarding doc that doesn’t lie about the current state of the system.

Agents can generate a first draft of all of this from the code itself. A human still has to review it and fix the wrong parts. But a reviewed agent draft is infinitely better than the nothing that exists today.

The guardrails that make this work

None of the above is a license to let agents run wild. If you do this without structure, you will generate exactly the shadow tech debt the backlash is worried about. A few rules I’ve converged on:

Keep the human in the architecture seat. Agents are allowed to make mechanical and local decisions. Shape, structure, and module boundaries are yours. If the agent wants to introduce a new abstraction, that’s a review moment, not a merge moment.

Tests are the contract, not a suggestion. The agent’s definition of “done” is green tests. If there are no tests, the agent is doing step 2 before step 1.

Every diff is reviewable by one human in under ten minutes. If it isn’t, split it. Shadow tech debt is what happens when diffs stop being reviewable. Small, focused, boring PRs are the antidote.

Treat agent output as junior-engineer output. Not because the model is stupid, but because the review posture is the same. You’re looking for subtle wrong turns and missing context, not typos.

Keep a log of what the agents did and why. This is the single best defense against cognitive debt. When the system misbehaves in six months, you want to be able to read the story of how it got that way — not scroll through a hundred squashed commits with messages like “fix per review”.

Engineering, finally

For most of my career, the tension in this job has been the same. You know what the right thing to build is. You know the hardening work, the tests, the refactors, the observability, the runbooks. And you know you won’t get to most of them, because there isn’t enough of you, and the visible work will always win.

Agentic engineering, done well, is the first time that equation has actually changed. Not because a model is smarter than you. It isn’t. But because the boring, correct, invisible, slow work — the work that turns underengineered systems into engineered ones — finally has a worker that’s suited to it.

The teams that will get the most out of this era are not the teams trying to replace senior engineers with agents. They are the teams using agents to finally do the engineering their senior engineers always wanted to do.

The hut gets a foundation. The duct tape comes off. The system becomes legible again.

That’s the upside worth optimizing for.

AI against underengineering.