My homelab staging cluster caught a production bug before CI did
A memory leak only manifested under sustained 48-hour load. CI tests pass in seconds. The homelab k3s cluster caught what automated tests could not.
On a Tuesday morning, a Grafana alert on my homelab k3s cluster flagged a slow memory climb in the payment reconciliation worker. Memory usage had been creeping up for 36 hours straight, a pattern that looked nothing like the sawtooth of normal garbage collection. I checked the staging environment. Same build. Same config. Same slow climb.
CI had passed the build with green checkmarks two days earlier. Unit tests, integration tests, linting, type checking. Everything clean. Because CI tests run in seconds, and this bug only appeared after 48 hours of sustained load.
Leadership got more concrete for me once I realized release engineering and infrastructure are really trust systems. It also builds on what I learned earlier in “The Kubernetes upgrade that taught me release engineering is a leadership problem.” The infrastructure stack, ctrlpane, and even my dotfiles all orbit the same idea now: the best teams move fast because the defaults are stable, not because the heroics are impressive.
What CI Cannot See
Continuous integration is optimized for fast feedback. Tests run in isolated environments, process a handful of fixtures, and exit. This is exactly right for catching regressions in business logic, type errors, and broken contracts. It is completely wrong for catching a class of bugs that only manifest under sustained operation.
The bug was a closure that captured a growing array of transaction IDs in the reconciliation batch processor. Each batch appended its IDs to an array that was supposed to be cleared after the batch completed. A refactor three days earlier had moved the array initialization outside the batch loop. The processor still worked perfectly for any individual batch. But over thousands of batches, the array grew without bound.
No unit test would catch this. The unit test processes one batch, maybe five. The array grows to a few hundred entries, well within normal memory. The bug only becomes visible when the processor runs continuously for hours, processing thousands of batches, and the array grows to millions of entries.
Why the Homelab Caught It
My homelab staging cluster runs a mirror of our production deployment on a mini PC with 32GB of RAM. It processes a synthetic transaction feed that generates realistic load 24/7. It is not load testing. It is sustained-operation testing, which is a fundamentally different thing.
- CI tests exercise code paths for seconds. The homelab exercises them for days.
- CI tests use fixture data. The homelab uses a synthetic feed that mimics real transaction patterns.
- CI tests measure correctness. The homelab Grafana dashboards measure behavior over time: memory, CPU, connection counts, queue depths.
- CI gives you pass/fail. The homelab gives you trends, and trends reveal slow-burn problems that pass/fail cannot.
The Grafana dashboard showed a clean linear memory increase starting at exactly the timestamp of the deployment. No human would have noticed this in production for at least another week because production memory is noisy. The homelab, running a single consistent workload, made the signal unmistakable.
The Fix and the Lesson
The fix was two lines of code. Move the array initialization back inside the batch loop. The PR took ten minutes. But the fix would have been a lot more expensive if we had found this bug in production, where the reconciliation worker running out of memory would have stalled payment processing for real customers.
The deeper lesson is that your cheapest infrastructure can be your most honest testing environment. My homelab k3s cluster cost $400 in hardware. It runs on a shelf in my office. It has caught three production bugs in the last year that CI could not have found, all of them slow-burn resource issues that only appear under sustained operation.
What I Run on the Homelab Now
After this incident, I expanded what the homelab staging mirror covers:
- All long-running background workers with synthetic load feeds
- Connection pool behavior under sustained concurrent access
- Cache eviction patterns over multi-day windows
- Database connection churn during simulated deploy cycles
Earlier in this story I was mostly trying to survive the blast radius myself. Here I was trying to design a system where the team did not need heroics in the first place. The same philosophy now shapes infrastructure and ctrlpane.
CI tells you if your code is correct. Sustained staging tells you if your code is stable. You need both.
The total cost of running this environment is electricity and the occasional Saturday morning when I update the k3s cluster. The value is catching bugs that would otherwise reach production and manifest as the kind of slow degradation that erodes customer trust before anyone files a ticket. Every startup should have a junk drawer environment that runs your code the way production does: continuously, messily, and without mercy.
The homelab staging environment caught the bug because it ran the same workload patterns as production, just at smaller scale. CI tests run in isolation with mocked dependencies and synthetic data. Staging runs real services talking to real databases with realistic traffic shapes. The gap between those two environments is where production bugs hide. After this incident, we added a weekly staging smoke test that replays a subset of production traffic patterns. The cost is minimal — a few extra minutes of compute on hardware that would otherwise sit idle — and the confidence it provides is disproportionate. The homelab is not a toy. It is the cheapest production-grade testing environment available to a small team, and it earns its rack space every month.