portfolio Anshul Bisen
ask my work

The PostgreSQL query that took 47 seconds and how I got it to 3 milliseconds

A financial report query joined five tables across 2 million rows and took 47 seconds. The fix was not adding an index.

Our CFO opened the monthly financial summary report and waited. And waited. After 47 seconds the browser timed out. She messaged me: “The report page is broken.” It was not broken. The query behind it was just catastrophically slow. Five tables joined across 2 million rows with no WHERE clause filtering and PostgreSQL doing nested loop joins because the planner chose wrong. I spent the next two days turning a 47-second query into a 3-millisecond query, and the fix was not what I expected.

Where the data model or query started fighting back.

In that first stretch at FinanceOps, I was still learning how to wear the Head of Engineering title without hiding behind it. It also builds on what I learned earlier in “Why we picked React Server Components over a separate API layer.” The only credibility that mattered was whether the decision survived contact with real money, ugly edge cases, and the next person I would eventually hire. That same bias toward strict boundaries later shaped how I approached ftryos and pipeline-sdk: make correctness boring before you make the API clever.

The operational artifact behind the argument.

The Original Query

The monthly financial summary aggregates payment data across tenants, groups by status and currency, and joins against invoice and customer tables for enrichment. The query was written during the MVP phase when we had 10,000 rows. At 2 million rows, it fell apart.

-- The 47-second query
SELECT
c.name AS customer_name,
i.invoice_number,
p.status,
p.currency,
SUM(p.amount_cents) AS total_cents,
COUNT(*) AS payment_count
FROM payments p
JOIN invoices i ON p.invoice_id = i.id
JOIN customers c ON i.customer_id = c.id
JOIN payment_methods pm ON p.payment_method_id = pm.id
JOIN tenants t ON p.tenant_id = t.id
WHERE p.created_at >= '2024-07-01'
AND p.created_at < '2024-08-01'
GROUP BY c.name, i.invoice_number, p.status, p.currency
ORDER BY total_cents DESC;

Looking at this query now, the performance problems are obvious. But at the time, I assumed the indexes were the issue because that is what every performance guide tells you to check first. I was wrong.

What EXPLAIN ANALYZE Actually Showed

The first thing I did was run EXPLAIN ANALYZE. The output was 94 lines long. The critical section was buried in the middle.

Nested Loop (cost=0.85..847293.42 rows=1847293 width=52)
-> Seq Scan on payments p (cost=0.00..142847.00 rows=487293 width=24)
Filter: (created_at >= '2024-07-01' AND created_at < '2024-08-01')
Rows Removed by Filter: 1512707
-> Index Scan on invoices i (cost=0.42..1.44 rows=1 width=28)
Index Cond: (id = p.invoice_id)
Planning Time: 2.4 ms
Execution Time: 47,342 ms

The planner chose a Nested Loop join between payments and invoices. For each of the 487,293 payment rows in the date range, PostgreSQL performed an index lookup on the invoices table. That is 487,293 individual index scans. Even though each scan took microseconds, half a million of them took 47 seconds.

We had an index on payments.created_at and an index on invoices.id. The indexes were fine. The problem was the join strategy. PostgreSQL chose Nested Loop because its cost estimates suggested the payments scan would return fewer rows than it actually did. The statistics were stale.

Fix 1: Update Statistics

The first fix was trivial. The table statistics had not been refreshed since we crossed 1 million rows. ANALYZE updates the statistics that the query planner uses for cost estimation.

ANALYZE payments;
ANALYZE invoices;

After ANALYZE, the planner switched from Nested Loop to Hash Join. The query dropped from 47 seconds to 4.2 seconds. Better, but still unacceptable for a dashboard page.

Fix 2: Partial Index on Status

The monthly report only shows completed and refunded payments. Pending and failed payments are excluded. But the query was scanning all payments in the date range and then filtering by status in the GROUP BY. A partial index that only covers the statuses we care about dramatically reduced the scan.

CREATE INDEX idx_payments_report
ON payments (created_at, tenant_id, status, currency, amount_cents)
WHERE status IN ('completed', 'refunded');

This covering index includes all the columns the query needs from the payments table, so PostgreSQL can satisfy the query from the index alone without touching the heap. The partial WHERE clause means the index only covers about 60% of the table (completed and refunded payments), making it smaller and faster to scan.

Fix 3: Materialize the Intermediate Result

The remaining bottleneck was the join against the invoices and customers tables. Even with Hash Join, joining 290,000 payment rows against the full invoices table was slow because the invoices table has 40 columns and the join pulled all of them into memory.

-- Materialize the payment aggregation first
WITH payment_summary AS (
SELECT
invoice_id,
status,
currency,
SUM(amount_cents) AS total_cents,
COUNT(*) AS payment_count
FROM payments
WHERE created_at >= '2024-07-01'
AND created_at < '2024-08-01'
AND status IN ('completed', 'refunded')
GROUP BY invoice_id, status, currency
)
SELECT
c.name AS customer_name,
i.invoice_number,
ps.status,
ps.currency,
ps.total_cents,
ps.payment_count
FROM payment_summary ps
JOIN invoices i ON ps.invoice_id = i.id
JOIN customers c ON i.customer_id = c.id
ORDER BY ps.total_cents DESC;

The CTE aggregates payments first, reducing 290,000 rows to about 8,000 grouped rows. Then the join against invoices and customers operates on 8,000 rows instead of 290,000. The difference is enormous.


The Final Result

After all three fixes, the query runs in 3 milliseconds. From 47 seconds to 3 milliseconds. A 15,000x improvement. The fixes were: update statistics so the planner makes good decisions, add a partial covering index for the specific query pattern, and restructure the query to aggregate before joining.

The system after the boring-but-correct fix.

Looking back, this is one of those builder-phase decisions that bought me leadership credibility before I had any leadership title equity. I was still proving I could be trusted with the boring, consequential calls. That instinct carried straight into ftryos and pipeline-sdk.

When a query is slow, do not guess. Run EXPLAIN ANALYZE and read every line. The planner is telling you exactly what went wrong. Nine times out of ten, the fix is not adding an index. It is helping the planner make better decisions.