The night our database ran out of connections and what I learned about pooling

The PagerDuty alert fired at 2:17am on a Thursday. Payment processing is down. I rolled out of bed, opened my laptop, and checked the application dashboard. Every API endpoint was returning 500 errors. The error logs showed a single message repeating hundreds of times: “Error: timeout exceeded when trying to connect.” Our application could not get a database connection because every connection in the pool was occupied and not releasing.

The incident lasted 23 minutes. Here is what happened, why it happened, and the connection pooling architecture I rebuilt the next day to make sure it never happens again.

Generate a realistic dark editorial illustration for a fintech engineering article: layered ledger rows, payment states, query bottleneck cues, subtle SQL-console influence, deep navy and amber palette, 16:9, no text, no clip art, no fake corporate stock feel.

Where the data model or query started fighting back.

In that first stretch at FinanceOps, I was still learning how to wear the Head of Engineering title without hiding behind it. It also builds on what I learned earlier in “PostgreSQL 17 just dropped and COPY performance alone makes the upgrade worth it.” The only credibility that mattered was whether the decision survived contact with real money, ugly edge cases, and the next person I would eventually hire. That same bias toward strict boundaries later shaped how I approached ftryos and pipeline-sdk: make correctness boring before you make the API clever.

Editorial supporting image for the section "The Timeline" in the article "The night our database ran out of connections and what I learned about pooling". Show open laptop with SQL query plan or ledger-like table visible, printed reconciliation notes, payment terminal receipt, notebook full of arrows and constraints for "The night our database ran out of connections and what I learned about pooling". Focus on one operational artifact that makes the post feel lived-in rather than conceptual. Color palette: deep blues, oxidized steel, warm amber monitor spill. Mood: hungry, hands-on, slightly sleep-deprived, battle-tested before the title felt real. Composition: 16:9 landscape image, documentary/editorial feel, no text overlays, no stock-photo polish. Avoid: No floating database icons, no cartoon locks, no holograms, no generic fintech stock imagery, no text overlays.

The operational artifact behind the argument.

The Timeline

At 2:00am, our nightly financial report generator kicked off. It runs a series of analytical queries that aggregate payment data across all tenants for the previous day. These queries are intentionally complex: they join five tables, aggregate millions of rows, and produce summary reports that get emailed to stakeholders at 6am.

2:00am — Report generator starts. Opens 8 database connections for parallel query execution.
2:05am — The report queries are all running. Each one takes 5-10 minutes because they scan large date ranges.
2:12am — Payment processor webhook arrives for a real-time transaction. The webhook handler requests a database connection.
2:12am — Pool is at capacity. All 20 connections are occupied: 8 by reports, 12 by normal API traffic that started piling up.
2:13am — Connection requests start timing out. API endpoints return 500 errors.
2:17am — PagerDuty fires. Payment processing is fully stalled.
2:25am — I manually kill the report queries. Connections release. Payment processing resumes.
2:40am — All queued webhooks processed. No data lost, but 23 minutes of degraded service.

Why the Default Pool Configuration Is Dangerous

Our Node.js application used the default pg pool with 20 max connections. Every database operation, from a fast single-row lookup to a slow multi-million-row aggregation, drew from the same pool. This is the default configuration in every Node.js PostgreSQL tutorial and it is dangerously naive for mixed workloads.

1
// BEFORE: Single shared pool (dangerous)
2
const pool = new Pool({
3
  connectionString: DATABASE_URL,
4
  max: 20,  // Every query type shares these 20 connections
5
});

The problem is workload interference. Fast transactional queries (payment processing, API responses) need connections for milliseconds. Slow analytical queries (reports, reconciliation) hold connections for minutes. When slow queries occupy the pool, fast queries cannot get connections even though the database server has plenty of capacity.

The Fix: Separate Pools With PgBouncer

I rebuilt the connection architecture with PgBouncer sitting between the application and PostgreSQL, managing two separate pools with different configurations.

PgBouncer Configuration:

[databases]
transactional = host=localhost dbname=financeops pool_mode=transaction pool_size=30
analytical = host=localhost dbname=financeops pool_mode=session pool_size=5

[pgbouncer]
listen_port = 6432
max_client_conn = 200
default_pool_size = 30
reserve_pool_size = 5
reserve_pool_timeout = 3

Transactional pool (30 connections, transaction mode): Serves API endpoints, webhook handlers, and any query that completes in under 1 second. Transaction mode releases the connection back to the pool at the end of each transaction, not at the end of the client session.
Analytical pool (5 connections, session mode): Serves report generation, reconciliation, and any long-running query. Session mode is required because analytical queries use cursors and prepared statements that are connection-scoped.
Reserve pool (5 connections): Emergency buffer that activates when the transactional pool is exhausted. Provides a 3-second grace period before rejecting connections.

1
// AFTER: Separate pools via PgBouncer
2
const txPool = new Pool({
3
  connectionString: PGBOUNCER_TRANSACTIONAL_URL,
4
  max: 50,  // Client-side limit, PgBouncer manages server-side
5
});
6

7
const analyticsPool = new Pool({
8
  connectionString: PGBOUNCER_ANALYTICAL_URL,
9
  max: 10,
10
});

The Result

After deploying the new architecture, I re-ran the exact scenario that caused the incident. The report generator opened connections from the analytical pool while payment webhooks used the transactional pool. Zero interference. Payment processing continued without any latency increase while reports ran in parallel on their dedicated connections.

The reserve pool has activated twice in the three weeks since deployment, both during traffic spikes where the transactional pool briefly hit capacity. Both times, the reserve connections served requests for about 500 milliseconds before the spike subsided. No timeouts, no dropped requests.

Mean API latency during report generation: unchanged from baseline (previously doubled).
Connection timeout errors: zero since deployment (previously 2-3 per week during report windows).
PgBouncer overhead: 12 MB of RAM. Trivial.
Total database connections used: 35 (30 transactional + 5 analytical) versus 20 before. More connections total but better isolated.

Create a realistic systems-style editorial image: simplified financial workflow, tidy relational structure, subtle success signal, graphite and ledger-green palette, 4:3, no text labels, no infographic clip art.

The system after the boring-but-correct fix.

Looking back, this is one of those builder-phase decisions that bought me leadership credibility before I had any leadership title equity. I was still proving I could be trusted with the boring, consequential calls. That instinct carried straight into ftryos and pipeline-sdk.

If your application mixes fast transactional queries with slow analytical queries and they share a connection pool, you have an incident waiting to happen. Separate your pools. Use PgBouncer. It takes an afternoon to set up and prevents the 2am wake-up call.

Connection pool exhaustion is one of those failures that teaches you how little you understand about your own system under load. The fix was straightforward — proper pool sizing, connection timeouts, and idle connection cleanup — but the diagnosis required understanding every layer between the application and the database. After this incident, we added connection pool metrics to our standard dashboard and set alerts at seventy percent utilization. The best infrastructure lesson is always the one that makes you build better observability before the next failure arrives.