The Kubernetes upgrade that taught me release engineering is a leadership problem

The Kubernetes 1.34 upgrade should have taken two days. It took six weeks. Not because the technical work was hard. The API deprecations were well-documented. The migration path was clear. The actual kubectl and manifest changes were straightforward.

It took six weeks because three teams each assumed someone else owned the migration plan.

Generate a realistic homelab editorial photo: compact rack, mini PCs, blinking LEDs, labeled cables, dark room, teal and tungsten lighting, 16:9, no people, no pristine showroom look.

The kind of infrastructure that teaches you by breaking.

Leadership got more concrete for me once I realized release engineering and infrastructure are really trust systems. It also builds on what I learned earlier in “Kubernetes 1.33 and the features that finally made me stop questioning container orchestration for small teams.” The infrastructure stack, ctrlpane, and even my dotfiles all orbit the same idea now: the best teams move fast because the defaults are stable, not because the heroics are impressive.

Editorial supporting image for the section "How a Two-Day Task Becomes Six Weeks" in the article "The Kubernetes upgrade that taught me release engineering is a leadership problem". Show mini PC rack, patch cables, terminal windows, Grafana-style dashboard glow, labeled hardware with signs of real use for "The Kubernetes upgrade that taught me release engineering is a leadership problem". Focus on one operational artifact that makes the post feel lived-in rather than conceptual. Color palette: teal, zinc, dim server LEDs, warm tungsten accents. Mood: measured, confident, strategic, scarred enough to sound calm while saying hard things. Composition: 16:9 landscape image, documentary/editorial feel, no text overlays, no stock-photo polish. Avoid: No spotless enterprise data-center stock photos, no blue LED overkill, no clip-art servers, no text overlays.

The infrastructure mess that made the lesson stick.

How a Two-Day Task Becomes Six Weeks

Platform team assumed application teams would update their manifests first. Application teams assumed platform would handle the cluster upgrade and manifests would just work. The SRE function assumed platform would coordinate the sequence. Nobody scheduled the work. Nobody wrote the runbook. Nobody said the words “I own this.”

For four weeks, the upgrade existed as a Jira ticket that moved between boards. Each team triaged it as not their top priority because they genuinely believed the prerequisite work belonged to someone else. By the time I noticed the stall, we had burned a month of calendar time on a task that required roughly sixteen hours of engineering effort.

Release Engineering Is Coordination, Not Code

I pulled three engineers into a room. One from platform, one from the payments team, one from the API team. I asked a single question: who is going to write the step-by-step sequence for this upgrade, including who does what and in what order?

Silence. Then Marcus from platform said he could write it. Two days later he had a runbook. Two days after that, the upgrade was done. The technical work was exactly as easy as we all knew it was. The missing piece was never technical.

Day 1: Platform upgrades the control plane on staging, runs the conformance suite
Day 1: Application teams review deprecated API warnings from staging cluster logs
Day 2: Application teams submit manifest patches, platform reviews and merges
Day 2: Platform upgrades production control plane, application teams verify their services
Day 2: SRE monitors error rates for 4 hours, platform declares the upgrade complete

Five steps. Two days. That is what the work actually was. Everything before that was coordination theater.

What I Changed Permanently

After the k8s incident, I made release engineering an explicit leadership function. Every cross-cutting technical migration now has three things before any work begins:

A named owner. Not a team. A human. Their job is the sequence, not the code.
A written runbook with explicit handoff points. Who does what, in what order, and what blocks what.
A calendar commitment. The work goes on the sprint. Not the backlog. The sprint.

The pattern I missed was simple: cross-team technical work does not self-organize. Individual team work self-organizes beautifully because ownership is clear. The moment work crosses a team boundary, it needs a coordinator or it stalls. That coordinator is a leadership function, not a technical one.

The Broader Lesson

Every organization has work that falls between team boundaries. Database migrations that affect three services. Security patches that touch every deployment pipeline. Dependency upgrades that require synchronized releases. This work is rarely hard. It is almost always under-coordinated.

Create a realistic infrastructure editorial image with a homelab rack and nearby monitoring screen, subtle Grafana-like charts, dark graphite and green palette, 4:3, no branded UI, no stock-photo polish.

Homelab, but treated like a real environment.

By the time I wrote this, the lesson was bigger than the tool or incident. The job had become setting defaults a team could trust, then proving those defaults in systems like infrastructure and ctrlpane. That is leadership work, not just technical taste.

Release engineering fails when nobody owns the sequence. The technical work is usually the easy part.

I now budget roughly 10% of my own time for identifying and assigning cross-cutting work before it stalls. It is the highest-leverage leadership activity I have found. Not because the work itself is important, though it usually is, but because stalled cross-cutting work erodes team trust faster than almost anything else. When three teams each believe the others are blocking them, resentment builds even though nobody is actually at fault.

The Kubernetes upgrade cost us six weeks of calendar time and zero weeks of engineering time. The gap between those two numbers is entirely a leadership failure. Mine.

Release engineering is invisible when it works and catastrophic when it fails. The Kubernetes upgrade stalled because we treated it as a technical task when it was actually a coordination problem. The migration sequence required input from three teams, a rollback plan, and a communication timeline. None of that was technical work. All of it was leadership work. After the upgrade finally shipped, we documented the release engineering process as a first-class artifact, not a footnote in the runbook. Every major infrastructure change now starts with a migration owner, a timeline, and a stakeholder communication plan. The technical steps are the easy part. The hard part is making sure everyone knows what is happening, when it is happening, and what to do if it goes wrong. That is the leadership lesson Kubernetes taught me. The upgrade itself took two days of focused engineering work once the coordination was in place. Two days of work, six weeks of calendar time. The ratio tells you everything about where the real bottleneck was. Infrastructure upgrades at any scale are leadership problems first and technical problems second. The teams that internalize this ship upgrades on schedule. The teams that treat upgrades as purely technical work accumulate drift until the next upgrade becomes even harder.