The on-call rotation that was just me, and why I finally admitted that was not sustainable

From April 2024 to June 2025, every production alert that fired at FinanceOps went to my phone. At 3 AM, at dinner, during my daughter’s school play. I was the entire on-call rotation. The team had grown to four engineers by early 2025, but the alert routing still pointed to me. When someone asked why we did not have a rotation, I said the team was too small. The real reason was that I did not trust anyone else to handle a production incident in our financial system without making things worse.

Generate a realistic engineering desk scene with terminal diffs, code review panes, model output snippets implied visually, charcoal and violet palette, 16:9, no robots, no sci-fi holograms, no text overlays.

Useful AI looks more like leverage than magic.

This is where the homelab stopped being a hobby and started acting like a leadership tool. It also builds on what I learned earlier in “Building Cloudflare Tunnel access to our homelab staging so the team stops saying “works on my machine”.” The infrastructure and ctrlpane work gave me a cheap place to pressure-test release habits, GitOps discipline, and failure modes before I asked the team to trust those defaults at work.

Editorial supporting image for the section "The Real Reason" in the article "The on-call rotation that was just me, and why I finally admitted that was not sustainable". Show mini PC rack, patch cables, terminal windows, Grafana-style dashboard glow, labeled hardware with signs of real use for "The on-call rotation that was just me, and why I finally admitted that was not sustainable". Focus on one operational artifact that makes the post feel lived-in rather than conceptual. Color palette: teal, zinc, dim server LEDs, warm tungsten accents. Mood: tense but controlled, operational, carrying team and system weight at the same time. Composition: 16:9 landscape image, documentary/editorial feel, no text overlays, no stock-photo polish. Avoid: No spotless enterprise data-center stock photos, no blue LED overkill, no clip-art servers, no text overlays.

The infrastructure mess that made the lesson stick.

The Real Reason

The trust issue was not about the team’s competence. Raj, Deepa, and Marcus were all capable engineers. The issue was that our production environment had no runbooks, no documented incident response procedures, and no way for someone other than me to understand what a specific alert meant or what to do about it. The system was operable by exactly one person, and that person had built it.

This was entirely my fault. Every time I resolved a production incident, I fixed the immediate problem and moved on. I did not document what the alert meant, what I investigated, what I found, or what I did to fix it. The knowledge lived in my head. Making someone else on-call without that knowledge was not delegation. It was abandonment. I would have been setting them up to fail at 3 AM with no context and no guidance.

But acknowledging that the knowledge transfer had not happened did not excuse the delay. I had 14 months to write runbooks. I had 14 months to pair with someone during an incident. I had 14 months to build the documentation that would make a rotation possible. I chose to keep handling incidents myself because it was faster in the moment and because, if I am honest, being the only person who could save production made me feel indispensable.

The Wake-Up Call

The wake-up call was not burnout, although I was burned out. It was a conversation with our CEO in May 2025 where he asked what would happen if I were unavailable for a week. Not on vacation where I could check Slack. Genuinely unavailable. Hospitalized. Out of contact. What happens to production?

The honest answer was: nobody would know what to do. Alerts would fire. Engineers would look at dashboards they had never studied. They would read error messages they had never seen. They would make guesses about which service was failing and why. Some of those guesses would be wrong. In a financial system, wrong guesses during incident response can turn a degradation into a data integrity issue.

The CEO did not lecture me. He just said, “So you are a single point of failure for the entire company.” He was right. I had been the bottleneck on code review, architecture decisions, and deploys, and I had worked on all of those. But on-call was the one bottleneck I had actively maintained by not building the systems that would let someone else take it.

Building the Rotation

Building a real on-call rotation took six weeks of dedicated effort. The work was not technical. It was documentation and knowledge transfer.

Documented every production alert: what it means, what to check first, what the likely causes are, and what the remediation steps are. 42 alerts, 42 runbook entries.
Created an incident response playbook: how to acknowledge an alert, how to assess severity, when to escalate, how to communicate status updates, and how to write a post-incident review.
Paired with each engineer during three real incidents: they drove, I guided. Each pairing built confidence and uncovered gaps in the runbooks.
Set up a PagerDuty rotation: one week on, three weeks off. Each engineer takes one week per month. I take the same rotation as everyone else.
Ran three incident simulation drills: fake alerts during business hours where the on-call engineer had to follow the playbook while the team observed and provided feedback.

The runbook documentation was the most time-consuming part. Each alert entry needed enough context for someone seeing the alert for the first time to understand what was happening and what to do without calling me. I wrote 42 runbook entries in three weeks. Several of them revealed alerts that were misconfigured or no longer relevant, which we cleaned up as part of the process.

The First Rotation

Deepa took the first on-call rotation in mid-June. During her week, two alerts fired. The first was a Kafka consumer lag warning that she diagnosed and resolved in 20 minutes using the runbook. The second was a PostgreSQL connection pool saturation alert at 11 PM that she handled by scaling up PgBouncer and filing a ticket for investigation the next morning. Both resolutions were correct. Neither required my involvement.

Total incidents during Deepa’s first rotation: 2
Incidents resolved without escalation: 2
Runbook entries that needed updating based on her experience: 3
My involvement: reading the post-incident summaries the next morning

The runbook updates she suggested after her rotation were valuable because they reflected a fresh perspective. Steps that I considered obvious were ambiguous to someone encountering the alert for the first time. She added clarity to three entries that made them more useful for Marcus and Raj when their rotations came.

What I Learned

Create an editorial workflow image showing AI-assisted engineering under human oversight: pull request, checklist, quiet confidence, cobalt and graphite palette, 1:1, no futuristic clichés.

The point is judgment, not novelty.

By this stage the job had changed. I was no longer just picking a tool or fixing a bug. I was carrying the blast radius across product, compliance, sales, and hiring. That is exactly why I kept pressure-testing the same lesson inside lifeos and bisen-apps.

If your on-call rotation is one person, you do not have a rotation. You have a single point of failure that will eventually fail. The barrier to sharing on-call is never team size. It is the knowledge transfer work you have been avoiding because handling incidents yourself is easier in the short term.

Three months into the rotation, I sleep through the night for the first time in over a year during my off weeks. The team handles incidents competently. The runbooks improve with every rotation as each engineer adds their perspective. The system is more resilient with four people who can operate it than it ever was with one person who built it. I should have done this 12 months earlier. The only thing that stopped me was me.