Migrating from REST to a hybrid REST/event architecture without stopping the train

By January 2025, our REST API had grown to over 200 endpoints. Most of them were standard CRUD operations that worked fine synchronously. But our reconciliation pipeline had outgrown the request-response model. Bank statement ingestion, transaction matching, and exception routing all needed to happen asynchronously. A single reconciliation batch could take five minutes to process. Holding an HTTP connection open for five minutes is not an architecture, it is a prayer.

Generate a realistic dark editorial illustration for a fintech engineering article: layered ledger rows, payment states, query bottleneck cues, subtle SQL-console influence, deep navy and amber palette, 16:9, no text, no clip art, no fake corporate stock feel.

Where the data model or query started fighting back.

By this phase the work was no longer “just build it.” It also builds on what I learned earlier in “The meeting where product, sales, and engineering all had different definitions of “real-time”.” Every architecture choice had a people cost, an audit cost, and a recovery cost when production disagreed with the plan. That is roughly when the line between FinanceOps systems and projects like flowscape or ftryos got interesting to me: the design only counts if the operators can live with it.

Editorial supporting image for the section "Why Not a Big-Bang Rewrite" in the article "Migrating from REST to a hybrid REST/event architecture without stopping the train". Show open laptop with SQL query plan or ledger-like table visible, printed reconciliation notes, payment terminal receipt, notebook full of arrows and constraints for "Migrating from REST to a hybrid REST/event architecture without stopping the train". Focus on one operational artifact that makes the post feel lived-in rather than conceptual. Color palette: deep blues, oxidized steel, warm amber monitor spill. Mood: tense but controlled, operational, carrying team and system weight at the same time. Composition: 16:9 landscape image, documentary/editorial feel, no text overlays, no stock-photo polish. Avoid: No floating database icons, no cartoon locks, no holograms, no generic fintech stock imagery, no text overlays.

The operational artifact behind the argument.

Why Not a Big-Bang Rewrite

The tempting approach was to move everything to an event-driven architecture. Kafka for all internal communication, REST facades only for external consumers. In theory, this gives you clean separation between synchronous API responses and asynchronous processing. In practice, a big-bang migration of 200 endpoints is a multi-month project that freezes feature development.

We chose the strangler fig pattern instead. The idea is simple: wrap the existing system with a new one, migrate one capability at a time, and keep the old system running until the new one has fully replaced it. In our case, this meant introducing Kafka for new internal domain events while keeping REST for everything that already worked. Old endpoints stayed REST. New processing pipelines used events. The two systems coexisted.

The strangler fig pattern gets talked about a lot in conference talks but rarely with the messy details of what happens when both systems need to interact. Our reconciliation pipeline was the first migration target, and it touched almost every other part of the system. Client records, bank connections, transaction history, reporting. Extracting it cleanly required drawing boundaries we had never drawn before.

The Hybrid Architecture

The architecture we landed on has three layers. External API consumers talk to our REST API exactly as before. The REST API publishes domain events to Kafka when state changes. Internal processing services consume events from Kafka and do their work asynchronously. When processing completes, a notification event gets published, and the WebSocket layer pushes an update to the dashboard.

Client -> REST API -> PostgreSQL (immediate response)
                   -> Kafka (domain event published)
                        -> Reconciliation Service (async processing)
                             -> Kafka (result event)
                                  -> WebSocket (dashboard update)

The key insight is that the REST API response does not wait for event processing. When a client uploads a bank statement, the API immediately responds with a 202 Accepted and a job ID. The client can poll the job status endpoint or subscribe to WebSocket updates. The actual reconciliation happens asynchronously, triggered by the Kafka event.

REST stays for external communication: uploads, queries, authentication
Kafka handles internal domain events: statement.uploaded, reconciliation.started, match.found, match.disputed
PostgreSQL remains the source of truth for all state
WebSocket pushes real-time updates to connected dashboard users
Job status endpoints let non-WebSocket clients poll for completion

What Went Sideways

The first problem was event ordering. When a bank statement upload triggers a reconciliation run that produces 500 match results, those results publish to Kafka as individual events. Kafka guarantees ordering within a partition, but if your partition key is wrong, related events can arrive out of order. We originally partitioned by event type, which meant match events for the same statement could land on different partitions and be processed in arbitrary order.

The fix was partitioning by statement ID instead of event type. All events related to a single statement land on the same partition and are processed in order. This was a two-line code change that took three days to diagnose because the symptoms were non-deterministic. Sometimes events arrived in order. Sometimes they did not. The test suite always passed because it processed events sequentially.

The second problem was dual writes. When the REST API writes to PostgreSQL and publishes to Kafka in the same request handler, either operation can fail independently. If the database write succeeds but the Kafka publish fails, you have state in the database that no consumer knows about. If the Kafka publish succeeds but the database write fails, consumers process an event for state that does not exist.

1
// Dangerous: dual write without transactional guarantee
2
await db.statements.insert(statement)
3
await kafka.publish('statement.uploaded', statement)
4
// If kafka.publish fails, the statement exists
5
// but no consumer will process it
6

7
// Safer: transactional outbox pattern
8
await db.transaction(async (tx) => {
9
  await tx.statements.insert(statement)
10
  await tx.outbox.insert({
11
    topic: 'statement.uploaded',
12
    payload: statement,
13
  })
14
})
15
// A separate process polls the outbox and
16
// publishes to Kafka with at-least-once delivery

We implemented the transactional outbox pattern, where domain events are written to an outbox table within the same database transaction as the state change. A separate polling process reads the outbox and publishes to Kafka. This guarantees that if the state change commits, the event will eventually be published. It adds latency, about 500 milliseconds in our case, but eliminates the dual-write consistency problem.

Three Months In

The hybrid architecture has been running for three months. The reconciliation pipeline is fully event-driven and processes batches in the background without blocking API responses. Reporting, client management, and authentication are still pure REST and there is no plan to migrate them because they do not need asynchronous processing.

Create a realistic systems-style editorial image: simplified financial workflow, tidy relational structure, subtle success signal, graphite and ledger-green palette, 4:3, no text labels, no infographic clip art.

The system after the boring-but-correct fix.

By this stage the job had changed. I was no longer just picking a tool or fixing a bug. I was carrying the blast radius across product, compliance, sales, and hiring. That is exactly why I kept pressure-testing the same lesson inside portfolio, pipeline-sdk, and dotfiles.

Not everything needs to be event-driven. The hybrid approach lets you use events where they add value and REST where synchronous responses are fine. The worst architectural mistake is making everything the same when the requirements are different.

If I were starting from scratch, I would still start with REST and add events only when specific processing pipelines outgrow the request-response model. Event-driven architecture is powerful but it adds operational complexity. Kafka is another piece of infrastructure to monitor, partition, and debug. The strangler fig pattern lets you adopt that complexity incrementally, which is the only sane way for a small team to evolve their architecture.