The day our monolith's database hit 80% CPU and nobody noticed until sales called
A slow query compounded by a missing index took our PostgreSQL instance to the brink. Sales noticed before engineering did because client reports were timing out.
The first sign something was wrong came from our head of sales, not from any of our monitoring dashboards. She pinged me on Slack at 2:47 PM on a Tuesday asking why three different clients were reporting that their reconciliation reports were hanging. I pulled up Grafana and saw our primary PostgreSQL instance sitting at 82 percent CPU utilization. It had been climbing for over an hour. Our alerting threshold was set at 90 percent, which meant we were ten minutes from a full-blown outage and nobody on the engineering team had any idea.
This is where the homelab stopped being a hobby and started acting like a leadership tool. It also builds on what I learned earlier in “The night our database ran out of connections and what I learned about pooling.” The infrastructure and ctrlpane work gave me a cheap place to pressure-test release habits, GitOps discipline, and failure modes before I asked the team to trust those defaults at work.
How We Got There
The root cause was embarrassingly simple. A product manager had requested a new dashboard widget showing transaction volume trends over the past 90 days. The engineer who built it wrote a query that scanned the entire transactions table with a date range filter but no index on the created_at column for the specific partition in question. In development with 10,000 rows it ran in 40 milliseconds. In production with 14 million rows it took 47 seconds per execution. The widget refreshed every 60 seconds. Multiple clients viewing dashboards simultaneously meant we had a dozen copies of this query competing for CPU.
The fix was a single CREATE INDEX statement that took the query from 47 seconds to 3 milliseconds. The entire incident, from first client complaint to resolution, was 38 minutes. But the damage was already done. Three clients had escalated to their account managers. Our sales team had to make apologetic phone calls. And engineering looked like we were asleep at the wheel, which, in a sense, we were.
Why Our Alerting Failed
We had monitoring. We had dashboards. We had alerting rules configured in Grafana. But our thresholds were set by copying defaults from a blog post six months earlier and never revisiting them. The CPU alert was set at 90 percent, which sounds reasonable until you realize that PostgreSQL performance degrades long before you hit that number. Query latency starts climbing noticeably around 60 to 70 percent CPU on our instance size. By 80 percent, queries that normally take 5 milliseconds were taking 200.
- CPU alert threshold was set at 90 percent when degradation starts at 60 percent
- No alert on query latency percentiles, only on raw resource utilization
- No alert on the number of active connections or queries running longer than 10 seconds
- Dashboard panels existed but nobody had them open during working hours
- No Slack integration for warning-level alerts, only critical ones
The fundamental problem was that we had built observability infrastructure without building observability culture. The dashboards existed as artifacts of an initial setup sprint. Nobody looked at them daily. Nobody had iterated on the thresholds after we learned how our production workload actually behaved. We treated monitoring as a one-time setup task instead of an ongoing practice.
What We Changed
After the incident we spent a full day overhauling our alerting configuration. The changes were not technically complex but they required actually thinking about what mattered versus what was easy to measure.
- Dropped CPU warning threshold to 60 percent, critical to 75 percent
- Added P95 and P99 query latency alerts with a 500 millisecond warning threshold
- Added an alert for any query running longer than 10 seconds
- Added connection pool utilization alerts at 70 percent capacity
- Set up a dedicated Slack channel for warning-level alerts that the on-call engineer monitors
- Instituted a weekly review of the top 10 slowest queries from pg_stat_statements
The weekly query review turned out to be the highest-value change. We started catching performance regressions before they became incidents. A new query that suddenly appears in the top 10 slowest list is a leading indicator that something changed in a recent deployment. We catch it during the review instead of waiting for a client to call.
SELECT query, calls, mean_exec_time::numeric(10,2) AS avg_ms, total_exec_time::numeric(10,2) AS total_ms, rowsFROM pg_stat_statementsORDER BY mean_exec_time DESCLIMIT 10;The Real Lesson
This incident was not really about a missing index. Missing indexes happen. The real failure was that our observability setup gave us a false sense of security. We had the tools but not the practices. We had the dashboards but not the habits. Observability without proper thresholds is just expensive logging. You are collecting data nobody acts on until it is too late.
By this stage the job had changed. I was no longer just picking a tool or fixing a bug. I was carrying the blast radius across product, compliance, sales, and hiring. That is exactly why I kept pressure-testing the same lesson inside ftryos and pipeline-sdk.
The gap between having monitoring and being observable is the same gap between owning a fire extinguisher and knowing where it is when the kitchen is on fire.
I also learned that your alerting thresholds need to be calibrated to your specific workload, not copied from a generic guide. PostgreSQL on an 8-core instance with mixed OLTP and reporting queries behaves differently than PostgreSQL on a 32-core instance running pure OLTP. The numbers that matter are the ones you discover by watching your own system under real load over weeks and months.
Sales calling before engineering notices is the kind of organizational failure that erodes trust between teams. It took weeks to rebuild confidence with our account managers that we actually had production under control. The technical fix took 38 minutes. Rebuilding trust took the rest of the quarter.