Running Kafka at startup scale is a decision you will regret exactly once

We adopted Kafka in month six of FinanceOps. The use case was clean: payment event streaming. Banks send webhook events. Payment processors send status updates. Our reconciliation engine needs to consume these events reliably, in order, without losing a single one. Kafka was the obvious choice. Every architecture blog said so.

The first year was fine. Kafka worked exactly as advertised. Events flowed. Consumers consumed. Partitions partitioned. We felt smart.

The second year, Kafka became the single largest line item in our platform team’s time budget.

Generate a realistic dark editorial illustration for a fintech engineering article: layered ledger rows, payment states, query bottleneck cues, subtle SQL-console influence, deep navy and amber palette, 16:9, no text, no clip art, no fake corporate stock feel.

Where the data model or query started fighting back.

By this point I cared less about sounding smart and more about making the tradeoff legible. It also builds on what I learned earlier in “Fintech compliance is not a checkbox. It is an architecture constraint..” The systems had enough history that every database or eventing opinion had receipts behind it. That is the same posture I now bring to longer-lived experiments like ftryos and pipeline-sdk: if the constraint is real, say it plainly and design around it.

Editorial supporting image for the section "What Goes Wrong at Startup Scale" in the article "Running Kafka at startup scale is a decision you will regret exactly once". Show open laptop with SQL query plan or ledger-like table visible, printed reconciliation notes, payment terminal receipt, notebook full of arrows and constraints for "Running Kafka at startup scale is a decision you will regret exactly once". Focus on one operational artifact that makes the post feel lived-in rather than conceptual. Color palette: deep blues, oxidized steel, warm amber monitor spill. Mood: measured, confident, strategic, scarred enough to sound calm while saying hard things. Composition: 16:9 landscape image, documentary/editorial feel, no text overlays, no stock-photo polish. Avoid: No floating database icons, no cartoon locks, no holograms, no generic fintech stock imagery, no text overlays.

The operational artifact behind the argument.

What Goes Wrong at Startup Scale

Kafka is designed for organizations with dedicated infrastructure teams. It assumes someone is watching partition rebalancing, monitoring consumer lag, tuning retention policies, managing broker upgrades, and debugging the occasional split-brain scenario. At a four-person engineering team, that someone is everyone, which means it is no one.

Consumer rebalancing during deployments caused processing pauses of 30-90 seconds. For payment events, 90 seconds of pause means 90 seconds of customers wondering why their transaction is stuck.
Consumer lag monitoring required custom tooling because the default metrics were noisy and the alerting thresholds were wrong for our volume.
Broker upgrades required careful rolling restarts that we could only do during low-traffic windows, which in fintech means Sunday at 3 AM.
Partition count decisions made in month six were wrong by month twelve, and repartitioning is one of those operations that sounds simple and is not.
Schema evolution across topics required a schema registry that added another piece of infrastructure to maintain.

The operational burden was roughly 15-20% of our platform engineer’s time. For a team of four, that is not sustainable. We were spending more time operating the message bus than building the product features that used it.

The Decision Framework I Wish I Had Used

In hindsight, the question was never “should we use Kafka?” The question was “does our use case justify the operational cost of Kafka?” Here is the framework I would use now:

Do you need strict ordering guarantees? If yes, Kafka is strong here. If you need ordering within a partition but not globally, simpler queues with routing can work.
Do you need replay capability? Kafka’s log retention lets you replay events. If your consumers are idempotent and you have other recovery mechanisms, you might not need this.
Do you have someone who will own Kafka operations? Not occasionally. Regularly. If the answer is no, the operational debt will accumulate faster than the technical benefits.
Is your event volume high enough to justify the complexity? If you are processing fewer than 10,000 events per second, a PostgreSQL-backed queue or a managed SQS/SNS setup handles the volume with a fraction of the operational overhead.
Are you using a managed Kafka service? If self-hosting, multiply the operational cost by three. Managed services like Confluent Cloud reduce the burden but add significant cost.

What We Would Do Differently

We did not migrate off Kafka. The sunk cost of our event-driven architecture is too high to justify a migration now that the system is stable. But if I were starting FinanceOps today with the same requirements, I would make different choices:

For payment webhook ingestion: a simple SQS queue with a dead-letter queue for failed processing. The ordering guarantees we needed were per-customer, not global, and SQS FIFO queues handle that.
For reconciliation event streaming: PostgreSQL logical replication to a read replica that the reconciliation engine queries. No message bus at all.
For real-time notifications: a lightweight Redis Streams setup that we already have Redis infrastructure for.

The only use case where I would still choose Kafka is the cross-service event bus that multiple teams need to consume from independently. That is genuinely hard to do well with simpler tools. But at a four-person team, we did not have multiple teams consuming independently. We had one team producing and one team consuming. Kafka was overkill.

The Regret Window

Create a realistic systems-style editorial image: simplified financial workflow, tidy relational structure, subtle success signal, graphite and ledger-green palette, 4:3, no text labels, no infographic clip art.

The system after the boring-but-correct fix.

By the time I wrote this, the lesson was bigger than the tool or incident. The job had become setting defaults a team could trust, then proving those defaults in systems like ftryos and pipeline-sdk. That is leadership work, not just technical taste.

You regret Kafka exactly once: the quarter when the operational burden exceeds the engineering benefit and you realize the migration cost means you are stuck.

That quarter was Q2 of our second year. We got through it by writing better runbooks, automating the consumer lag monitoring, and accepting that Kafka operations are just part of our infrastructure tax. The system is stable now. The regret has faded into acceptance.

But I tell every startup CTO the same thing: unless your event volume, ordering requirements, and team size all point to Kafka, start with the simplest queue that handles your throughput. You can always migrate to Kafka later if you outgrow it. You cannot easily migrate away from Kafka once your architecture depends on it.

Kafka at startup scale is a lesson in operational overhead versus architectural elegance. The event-driven architecture was technically correct for our domain, but the operational burden of running Kafka with a team of four was unsustainable. Consumer lag monitoring, partition rebalancing, schema evolution, and broker maintenance consumed engineering hours that should have gone to product development. If I were building the same system today, I would start with a simpler message queue and migrate to Kafka only when the throughput and ordering guarantees justified the operational cost.