The day our monolith's database hit 80% CPU and nobody noticed until sales called

The first sign something was wrong came from our head of sales, not from any of our monitoring dashboards. She pinged me on Slack at 2:47 PM on a Tuesday asking why three different clients were reporting that their reconciliation reports were hanging. I pulled up Grafana and saw our primary PostgreSQL instance sitting at 82 percent CPU utilization. It had been climbing for over an hour. Our alerting threshold was set at 90 percent, which meant we were ten minutes from a full-blown outage and nobody on the engineering team had any idea.

Generate a realistic dark editorial illustration for a fintech engineering article: layered ledger rows, payment states, query bottleneck cues, subtle SQL-console influence, deep navy and amber palette, 16:9, no text, no clip art, no fake corporate stock feel.

Where the data model or query started fighting back.

This is where the homelab stopped being a hobby and started acting like a leadership tool. It also builds on what I learned earlier in “The night our database ran out of connections and what I learned about pooling.” The infrastructure and ctrlpane work gave me a cheap place to pressure-test release habits, GitOps discipline, and failure modes before I asked the team to trust those defaults at work.

Editorial supporting image for the section "How We Got There" in the article "The day our monolith's database hit 80% CPU and nobody noticed until sales called". Show mini PC rack, patch cables, terminal windows, Grafana-style dashboard glow, labeled hardware with signs of real use for "The day our monolith's database hit 80% CPU and nobody noticed until sales called". Focus on one operational artifact that makes the post feel lived-in rather than conceptual. Color palette: teal, zinc, dim server LEDs, warm tungsten accents. Mood: tense but controlled, operational, carrying team and system weight at the same time. Composition: 16:9 landscape image, documentary/editorial feel, no text overlays, no stock-photo polish. Avoid: No spotless enterprise data-center stock photos, no blue LED overkill, no clip-art servers, no text overlays.

The infrastructure mess that made the lesson stick.

How We Got There

The root cause was embarrassingly simple. A product manager had requested a new dashboard widget showing transaction volume trends over the past 90 days. The engineer who built it wrote a query that scanned the entire transactions table with a date range filter but no index on the created_at column for the specific partition in question. In development with 10,000 rows it ran in 40 milliseconds. In production with 14 million rows it took 47 seconds per execution. The widget refreshed every 60 seconds. Multiple clients viewing dashboards simultaneously meant we had a dozen copies of this query competing for CPU.

The fix was a single CREATE INDEX statement that took the query from 47 seconds to 3 milliseconds. The entire incident, from first client complaint to resolution, was 38 minutes. But the damage was already done. Three clients had escalated to their account managers. Our sales team had to make apologetic phone calls. And engineering looked like we were asleep at the wheel, which, in a sense, we were.

Why Our Alerting Failed

We had monitoring. We had dashboards. We had alerting rules configured in Grafana. But our thresholds were set by copying defaults from a blog post six months earlier and never revisiting them. The CPU alert was set at 90 percent, which sounds reasonable until you realize that PostgreSQL performance degrades long before you hit that number. Query latency starts climbing noticeably around 60 to 70 percent CPU on our instance size. By 80 percent, queries that normally take 5 milliseconds were taking 200.

CPU alert threshold was set at 90 percent when degradation starts at 60 percent
No alert on query latency percentiles, only on raw resource utilization
No alert on the number of active connections or queries running longer than 10 seconds
Dashboard panels existed but nobody had them open during working hours
No Slack integration for warning-level alerts, only critical ones

The fundamental problem was that we had built observability infrastructure without building observability culture. The dashboards existed as artifacts of an initial setup sprint. Nobody looked at them daily. Nobody had iterated on the thresholds after we learned how our production workload actually behaved. We treated monitoring as a one-time setup task instead of an ongoing practice.

What We Changed

After the incident we spent a full day overhauling our alerting configuration. The changes were not technically complex but they required actually thinking about what mattered versus what was easy to measure.

Dropped CPU warning threshold to 60 percent, critical to 75 percent
Added P95 and P99 query latency alerts with a 500 millisecond warning threshold
Added an alert for any query running longer than 10 seconds
Added connection pool utilization alerts at 70 percent capacity
Set up a dedicated Slack channel for warning-level alerts that the on-call engineer monitors
Instituted a weekly review of the top 10 slowest queries from pg_stat_statements

The weekly query review turned out to be the highest-value change. We started catching performance regressions before they became incidents. A new query that suddenly appears in the top 10 slowest list is a leading indicator that something changed in a recent deployment. We catch it during the review instead of waiting for a client to call.

1
SELECT
2
  query,
3
  calls,
4
  mean_exec_time::numeric(10,2) AS avg_ms,
5
  total_exec_time::numeric(10,2) AS total_ms,
6
  rows
7
FROM pg_stat_statements
8
ORDER BY mean_exec_time DESC
9
LIMIT 10;

The Real Lesson

This incident was not really about a missing index. Missing indexes happen. The real failure was that our observability setup gave us a false sense of security. We had the tools but not the practices. We had the dashboards but not the habits. Observability without proper thresholds is just expensive logging. You are collecting data nobody acts on until it is too late.

Create a realistic systems-style editorial image: simplified financial workflow, tidy relational structure, subtle success signal, graphite and ledger-green palette, 4:3, no text labels, no infographic clip art.

The system after the boring-but-correct fix.

By this stage the job had changed. I was no longer just picking a tool or fixing a bug. I was carrying the blast radius across product, compliance, sales, and hiring. That is exactly why I kept pressure-testing the same lesson inside ftryos and pipeline-sdk.

The gap between having monitoring and being observable is the same gap between owning a fire extinguisher and knowing where it is when the kitchen is on fire.

I also learned that your alerting thresholds need to be calibrated to your specific workload, not copied from a generic guide. PostgreSQL on an 8-core instance with mixed OLTP and reporting queries behaves differently than PostgreSQL on a 32-core instance running pure OLTP. The numbers that matter are the ones you discover by watching your own system under real load over weeks and months.

Sales calling before engineering notices is the kind of organizational failure that erodes trust between teams. It took weeks to rebuild confidence with our account managers that we actually had production under control. The technical fix took 38 minutes. Rebuilding trust took the rest of the quarter.