the m9sh.
← Blog / Apr 2, 2026

Building foundations that outlast you.

At some point in every growing organization, someone becomes the person who sees the whole system. Here is the playbook for doing that job well -- and making yourself unnecessary, which is the whole point.

At some point in every growing engineering organization, someone becomes the person who sees the whole system. Not by title. Not by choice, usually. By gravity. They're the one who touched the CI pipeline last, who wrote the Terraform modules everyone copies, who gets pulled into every cross-team architecture discussion because they're the only person who knows how both sides work.

If that person is you right now, this essay is for you. Not a war story about how great it is to be indispensable — it isn't, and you shouldn't want to be. This is the playbook for how to do the job well, build the foundations that let others move independently, and eventually make yourself unnecessary. Which is, quietly, the whole point.

What the job actually is

Being the platform person on a team is not about building everything yourself. It's a person who decides how things get built so that multiple teams can build on top of a shared foundation without stepping on each other. You own the abstractions, the conventions, the shared infrastructure, and — crucially — the "why" behind decisions that nobody else remembers making.

In practice, there are three types of work, and every week is a rotation between them:

  • Unblocking. Helping others move. Reviewing PRs, answering questions, fixing the deploy that broke because someone used the wrong parameter. This is the work that makes you feel useful and slowly eats your entire calendar if you let it.
  • Building. Moving things yourself. Writing the actual code, the infrastructure, the tooling. This is the work that produces leverage, and it's the first thing to get squeezed out by meetings. Protect it aggressively.
  • Documenting. Making the system legible to everyone, not just you. Writing up decisions, updating dashboards, recording why things are the way they are. This is the work that makes you replaceable, which is the goal.

If you're spending all your time on the first type and none on the third, you're building a single point of failure. You're not a platform. You're a bottleneck with good intentions.

Conventions and shared code — you need both

The first instinct when you take on platform responsibility is to build shared things. A common library. A shared deployment pipeline. An internal framework that standardizes how services talk to each other. That instinct is correct — shared code is how you eliminate duplication, enforce correctness, and give every team a running start on new services.

But shared code without conventions is a trap. You end up with a library that works perfectly for the team that wrote it and confuses everyone else. And conventions without shared code are a suggestion that slowly drifts into six different interpretations across six teams.

The answer is both, working together:

  • Shared code for the mechanical things. Terraform modules, logging wrappers, CI/CD templates, HTTP client factories with retry and circuit-breaker baked in. Things where consistency isn't a preference — it's a correctness requirement. If every team writes their own retry logic, half of them will get the backoff formula wrong. Ship a library. Own it as a team.
  • Conventions for the structural things. How services are named. How configs are organized. What the logging fields mean. How migrations are sequenced. Where secrets live. These are the decisions that shared code can't enforce — they're about shape, not implementation. Write them down. Review for compliance. Update them when they stop fitting.

The key is that neither works alone:

  • A shared Terraform module library and a Terraform style guide. The module handles the boilerplate; the guide explains when to use it and how to extend it.
  • A shared logging wrapper and a logging convention. The wrapper implements the convention; the convention explains what the fields mean and why they matter.
  • A CI/CD template and a CI/CD standard. The template gives you a working pipeline in five minutes; the standard tells you what's required (security scans, type checks, test coverage thresholds) and what's optional.

When you're part of a team — even a small one — you can split this naturally. One person owns the library, another reviews for convention compliance, the whole team contributes examples and edge cases. The ownership is distributed. The standard is shared. Nobody is a bottleneck because the code and the conventions reinforce each other.

The failure mode to watch for: building shared code and assuming the convention is implicit. It isn't. If the only documentation for how the logging wrapper works is the source code, you don't have a convention — you have a dependency. Write the standard separately from the code. Make it readable by someone who's never seen the library. That's how teams scale without losing coherence.

The migration pattern that actually works

Big infrastructure projects — AWS migrations, framework upgrades, database moves — are where one-person platform teams either prove their value or drown. The key insight is that you should never try to do the whole thing yourself, and you should never try to parallelize it from day one.

The pattern that works, every time:

  1. Pick the smallest, least critical service. Migrate it completely. Learn everything that's going to go wrong on something where the blast radius is small.
  2. Extract the reusable patterns from that first migration. Terraform modules, runbooks, checklists, automation scripts. Now you have patterns, not guesses.
  3. Migrate the next service using those patterns. Faster this time. Fix the patterns where they don't fit. The patterns get better with each iteration.
  4. Hand the patterns to the teams. Let them migrate their own services using your templates. You review. They execute. They learn the new infrastructure by doing, not by reading your docs.
Migration pattern

The first service might take three weeks. The last three might take one week each — done by other engineers following the templates you built. That's the leverage: you do the hard, ambiguous first migration yourself, extract the repeatable pattern, then multiply through the team.

The total wall-clock time is shorter than doing everything yourself, and at the end of it, multiple people understand the new infrastructure instead of just you. That's not a side effect. That's the whole point.

Build observability that explains itself

When you're the only person who deeply understands the system, you can't afford to also be the only person who can respond to alerts. That's a recipe for burnout, pager fatigue, and a team that never develops operational confidence.

The fix is building observability that is self-documenting. Every dashboard, every alert, every runbook should be usable by someone who has never seen it before. Concretely:

  • Every dashboard gets a description panel at the top explaining what it monitors and what "healthy" looks like. If someone opens the dashboard and doesn't know whether the numbers are good or bad, the dashboard is incomplete.
  • Every alert includes context in the message itself. Not "DynamoDB throttling detected" — but "DynamoDB throttling on [table] in [region]. Likely cause: [typical scenario]. Runbook: [link]. Last occurrence: [date]." The person who gets paged at 2am shouldn't need to call you to understand what they're looking at.
  • Saved queries, not tribal knowledge. Instead of "look at the logs," create named, saved CloudWatch Insights queries with descriptions. "Run this query, the answer is in the first row" is a better runbook than "grep for the error."

The test: can any engineer on the team respond to any alert without calling you? If the answer is no, your observability isn't done. It's not about the tooling — Prometheus, Grafana, CloudWatch, whatever. It's about whether the system explains itself to people who didn't build it.

Mentoring without managing

If you're in this role, you probably aren't a manager. You don't do performance reviews. You don't control anyone's career trajectory. Your mentoring is purely opt-in, which means it only works if it's genuinely useful.

Three things that work better than scheduled 1:1s:

PR reviews as teaching moments. Not "this is wrong, fix it" — but "here's why I'd do this differently, and here's the context you might not have." The most valuable thing you can transfer in a PR review isn't a code correction. It's the historical context that explains why the system is the way it is. Engineers who understand the why make better decisions than engineers who only know the what.

Pairing on ambiguity. When someone is stuck on a genuinely hard problem — not a syntax error, but an architectural decision with no clear right answer — sit with them for an hour. Don't give them the answer. Model the process of working through ambiguity. How do you decompose a problem you don't understand? How do you decide between two approaches when you don't have enough information? That process is the skill, and it's invisible until someone watches you do it out loud.

Sharing failures openly. Talk about decisions you got wrong. Migrations that took longer than you estimated. Designs that didn't survive contact with production. The engineers who learn the most from you are the ones who see that experience doesn't mean you stop making mistakes — it means you recognize them faster, recover more gracefully, and document them so the next person doesn't repeat them.

Know when to grow

Being the platform-focused engineer is a phase, not a permanent identity. It works when the system is small enough for a few people to hold in their heads and the team is lean enough that conventions and shared code together cover most of the surface. It stops working when:

  • You're the bottleneck on more than two things simultaneously. If three teams are waiting on you for three different reasons in the same week, you've outgrown the model. The leverage of conventions isn't enough; you need more people or automated enforcement.
  • The conventions need policing, not just documentation. If teams are drifting from the standard and you're spending more time reviewing for compliance than building, you need linters, CI checks, policy-as-code — or a second person.
  • You can't draw the system from memory anymore. When the failure propagation graph has more nodes than you can hold in your head, the system has outgrown the one-person model. Not because you're not smart enough, but because the cognitive load is unsustainable.

The hardest part of this role is saying "we've outgrown me" before something breaks. It feels like admitting failure. It's actually the most responsible thing you can do. The goal was never to be indispensable. The goal was to build the foundations that let the team scale — and then to be honest about when the foundations need more people standing on them.

The short version

If you find yourself as the sole platform engineer in a growing organization:

  1. Protect your build time. It's the source of all leverage.
  2. Write conventions, not libraries. Standards scale to one person; shared code doesn't.
  3. Do the first hard thing yourself, extract the pattern, hand it to the team.
  4. Build observability that doesn't require you to explain it.
  5. Mentor by sharing process and context, not answers.
  6. Know when the model has outgrown you, and say so before it breaks.

The best outcome isn't "I'm the only one who understands the system." The best outcome is "anyone on the team can understand the system, because I made it legible." That's the job. Making the invisible visible, so the team can move without you in the room.

Get new essays by email

No spam. Unsubscribe anytime.

MU

Marcin Urbanski

Engineering lead. 11+ years shipping distributed systems at scale.