Building a fintech reconciliation engine that handles ten million transactions a day
PostgreSQL partitioned tables, Node.js worker pools, idempotent batch processing, and a custom diffing algorithm. Sub-minute reconciliation at ten million daily transactions.
The FinanceOps reconciliation engine is the most critical system we operate. It matches financial records across banks, payment processors, and our internal ledger to verify that every penny is accounted for. At our current scale, that means processing roughly ten million transactions per day with sub-minute reconciliation windows for each batch.
Building a system at this scale with a small team required deliberate architectural choices at every layer. Here is how we did it.
By this point I cared less about sounding smart and more about making the tradeoff legible. It also builds on what I learned earlier in “S3 Vectors at re:Invent made me reconsider our entire RAG architecture.” The systems had enough history that every database or eventing opinion had receipts behind it. That is the same posture I now bring to longer-lived experiments like ftryos and pipeline-sdk: if the constraint is real, say it plainly and design around it.
PostgreSQL Partitioned Tables
Transaction data grows fast. Ten million records per day means roughly 300 million per month. Querying a single table with billions of rows is not viable at the latency requirements we need.
We partition the transactions table by date using PostgreSQL declarative partitioning. Each day gets its own partition. The reconciliation engine primarily queries the current day and the previous day, so the working set stays small relative to the total data volume.
- Daily partitions keep the working set manageable. Index scans on a 10-million-row partition are fast. Index scans on a 3-billion-row table are not.
- Partition pruning eliminates irrelevant data automatically. A query filtered by date only touches the relevant partitions.
- Old partitions can be archived to cold storage without affecting active query performance. We archive partitions older than 90 days to S3.
- VACUUM operations complete quickly on small partitions. A full VACUUM on a billion-row table can take hours and impact production queries.
The Matching Algorithm
Reconciliation is fundamentally a matching problem: given a transaction record from source A and a record from source B, determine whether they represent the same real-world event.
This is harder than it sounds because different sources represent the same transaction differently. The bank sends a settlement record with their reference number. The payment processor sends a webhook with their transaction ID. Our internal ledger has its own primary key. Amounts may differ slightly due to fees applied at different stages. Timestamps differ because systems process events at different times.
Our matching algorithm uses a tiered approach:
- Tier 1: Exact match on idempotency key. When all sources use a shared key, matching is trivial. This covers about 70% of transactions.
- Tier 2: Composite match on amount, currency, and time window. For sources without shared keys, we match on the combination of transaction amount, currency, and a configurable time window. This covers another 25%.
- Tier 3: Fuzzy match with manual review. The remaining 5% cannot be automatically matched due to fee discrepancies, timing issues, or data quality problems. These get flagged for human review with the candidate matches ranked by confidence score.
Node.js Worker Pools
The reconciliation engine runs as a pool of Node.js workers processing batches in parallel. Each worker pulls a batch of unreconciled transactions, runs the matching algorithm, and writes the results back.
- Batch size is tuned to 5,000 transactions per batch. Large enough for efficient database queries, small enough for predictable memory usage.
- Worker count is set to the number of available CPU cores minus one. The remaining core handles the orchestrator process that assigns batches and monitors progress.
- Each worker is stateless. If a worker crashes, the batch it was processing gets reassigned automatically. No transaction is lost because the batch assignment uses a PostgreSQL advisory lock.
- Idempotent processing means any batch can be safely reprocessed. Workers write results with ON CONFLICT clauses that update rather than duplicate.
Sub-Minute Reconciliation Windows
The target is to reconcile each batch within 60 seconds of all source records arriving. Achieving this required eliminating every unnecessary step from the processing pipeline.
- Records are indexed on ingestion, not during reconciliation. By the time the matching algorithm runs, all relevant records are already indexed and queryable.
- The matching query uses a materialized view that pre-joins transaction records with their source metadata. The view refreshes every 30 seconds.
- Results are written in bulk using PostgreSQL COPY instead of individual INSERT statements. The throughput difference is roughly 10x for our batch sizes.
- Monitoring is real-time. A Grafana dashboard shows reconciliation lag by source, match rate by tier, and processing throughput per worker.
What We Got Wrong Initially
The first version of the reconciliation engine tried to process transactions individually instead of in batches. It worked at 100,000 transactions per day. At 1 million, it fell over. The transition to batch processing was the single most impactful architectural change we made.
The second mistake was trying to reconcile in real time instead of in windows. Real-time reconciliation sounds ideal, but it means processing every transaction the moment it arrives, before all sources have reported. This creates false mismatches that generate noise for the operations team. Switching to reconciliation windows, where we wait until all sources are expected to have reported before matching, reduced false mismatch rates by 90%.
This is the phase where individual scars finally turned into repeatable operating principles. I cared less about sounding clever and more about leaving behind a system that stayed sane without me in the room. That is how I build ftryos and pipeline-sdk too.
Reconciliation at scale is a batch processing problem, not a real-time processing problem. The instinct to process immediately is wrong. The discipline to wait for complete data and process in bulk is what makes it work.
The system processes ten million transactions daily. The infrastructure cost is roughly $800 per month in database and compute. The reconciliation accuracy rate is 99.7% before human review and 100% after. It is the system I am most proud of at FinanceOps, not because it is clever, but because it is simple and it works.