Running Kafka at startup scale is a decision you will regret exactly once
Kafka was justified for our payment event streaming. The operational burden in year two consumed a disproportionate share of platform bandwidth.
We adopted Kafka in month six of FinanceOps. The use case was clean: payment event streaming. Banks send webhook events. Payment processors send status updates. Our reconciliation engine needs to consume these events reliably, in order, without losing a single one. Kafka was the obvious choice. Every architecture blog said so.
The first year was fine. Kafka worked exactly as advertised. Events flowed. Consumers consumed. Partitions partitioned. We felt smart.
The second year, Kafka became the single largest line item in our platform team’s time budget.
By this point I cared less about sounding smart and more about making the tradeoff legible. It also builds on what I learned earlier in “Fintech compliance is not a checkbox. It is an architecture constraint..” The systems had enough history that every database or eventing opinion had receipts behind it. That is the same posture I now bring to longer-lived experiments like ftryos and pipeline-sdk: if the constraint is real, say it plainly and design around it.
What Goes Wrong at Startup Scale
Kafka is designed for organizations with dedicated infrastructure teams. It assumes someone is watching partition rebalancing, monitoring consumer lag, tuning retention policies, managing broker upgrades, and debugging the occasional split-brain scenario. At a four-person engineering team, that someone is everyone, which means it is no one.
- Consumer rebalancing during deployments caused processing pauses of 30-90 seconds. For payment events, 90 seconds of pause means 90 seconds of customers wondering why their transaction is stuck.
- Consumer lag monitoring required custom tooling because the default metrics were noisy and the alerting thresholds were wrong for our volume.
- Broker upgrades required careful rolling restarts that we could only do during low-traffic windows, which in fintech means Sunday at 3 AM.
- Partition count decisions made in month six were wrong by month twelve, and repartitioning is one of those operations that sounds simple and is not.
- Schema evolution across topics required a schema registry that added another piece of infrastructure to maintain.
The operational burden was roughly 15-20% of our platform engineer’s time. For a team of four, that is not sustainable. We were spending more time operating the message bus than building the product features that used it.
The Decision Framework I Wish I Had Used
In hindsight, the question was never “should we use Kafka?” The question was “does our use case justify the operational cost of Kafka?” Here is the framework I would use now:
- Do you need strict ordering guarantees? If yes, Kafka is strong here. If you need ordering within a partition but not globally, simpler queues with routing can work.
- Do you need replay capability? Kafka’s log retention lets you replay events. If your consumers are idempotent and you have other recovery mechanisms, you might not need this.
- Do you have someone who will own Kafka operations? Not occasionally. Regularly. If the answer is no, the operational debt will accumulate faster than the technical benefits.
- Is your event volume high enough to justify the complexity? If you are processing fewer than 10,000 events per second, a PostgreSQL-backed queue or a managed SQS/SNS setup handles the volume with a fraction of the operational overhead.
- Are you using a managed Kafka service? If self-hosting, multiply the operational cost by three. Managed services like Confluent Cloud reduce the burden but add significant cost.
What We Would Do Differently
We did not migrate off Kafka. The sunk cost of our event-driven architecture is too high to justify a migration now that the system is stable. But if I were starting FinanceOps today with the same requirements, I would make different choices:
- For payment webhook ingestion: a simple SQS queue with a dead-letter queue for failed processing. The ordering guarantees we needed were per-customer, not global, and SQS FIFO queues handle that.
- For reconciliation event streaming: PostgreSQL logical replication to a read replica that the reconciliation engine queries. No message bus at all.
- For real-time notifications: a lightweight Redis Streams setup that we already have Redis infrastructure for.
The only use case where I would still choose Kafka is the cross-service event bus that multiple teams need to consume from independently. That is genuinely hard to do well with simpler tools. But at a four-person team, we did not have multiple teams consuming independently. We had one team producing and one team consuming. Kafka was overkill.
The Regret Window
By the time I wrote this, the lesson was bigger than the tool or incident. The job had become setting defaults a team could trust, then proving those defaults in systems like ftryos and pipeline-sdk. That is leadership work, not just technical taste.
You regret Kafka exactly once: the quarter when the operational burden exceeds the engineering benefit and you realize the migration cost means you are stuck.
That quarter was Q2 of our second year. We got through it by writing better runbooks, automating the consumer lag monitoring, and accepting that Kafka operations are just part of our infrastructure tax. The system is stable now. The regret has faded into acceptance.
But I tell every startup CTO the same thing: unless your event volume, ordering requirements, and team size all point to Kafka, start with the simplest queue that handles your throughput. You can always migrate to Kafka later if you outgrow it. You cannot easily migrate away from Kafka once your architecture depends on it.
Kafka at startup scale is a lesson in operational overhead versus architectural elegance. The event-driven architecture was technically correct for our domain, but the operational burden of running Kafka with a team of four was unsustainable. Consumer lag monitoring, partition rebalancing, schema evolution, and broker maintenance consumed engineering hours that should have gone to product development. If I were building the same system today, I would start with a simpler message queue and migrate to Kafka only when the throughput and ordering guarantees justified the operational cost.