portfolio Anshul Bisen
ask my work

Grafana, Loki, and Tempo: building an observability stack that a four-person team actually uses

Most observability guides assume a platform team. We do not have one. The hard part was not installation but building dashboards engineers actually check daily.

In early 2025, I rebuilt our observability stack from scratch. We had Grafana and Loki running since our early days, but the dashboards were unused, the log queries were ad hoc, and we had no distributed tracing at all. The rebuild was not about tooling. We had the right tools. It was about building an observability practice that four engineers would actually use daily instead of only touching when something was already on fire.

The kind of infrastructure that teaches you by breaking.

This is where the homelab stopped being a hobby and started acting like a leadership tool. It also builds on what I learned earlier in “ArgoCD and GitOps for a team of four: overkill or exactly right.” The infrastructure and ctrlpane work gave me a cheap place to pressure-test release habits, GitOps discipline, and failure modes before I asked the team to trust those defaults at work.

The infrastructure mess that made the lesson stick.

Why Most Observability Setups Fail at Small Teams

The observability content available online is written for teams that have dedicated platform engineers or SRE teams. They recommend comprehensive dashboards covering hundreds of metrics, multi-layer alerting with escalation policies, and trace sampling strategies for high-throughput services. None of that is practical for a four-person team where everyone is also writing features.

Our first observability setup had 12 Grafana dashboards. Nobody looked at them. We had 30 alert rules. Engineers muted the Slack channel because it was noisy. We had Loki collecting logs from every service but no agreed-upon log format, so queries required remembering which service used which field names. The tools were installed. The practice of observability was not.

The rebuild started with a question: what are the five things we most need to know about our system health at any given moment? Not fifty things. Five. Everything else would be available for investigation but not on the primary dashboard.

The Five Dashboards That Matter

We settled on five dashboard panels, all on a single Grafana dashboard that is the browser homepage on every engineer workstation.

  • Kafka consumer lag: the single most important metric for our async reconciliation pipeline. If lag is growing, something is wrong.
  • API response time P95: if the 95th percentile response time exceeds 500 milliseconds, the user experience is degrading.
  • PostgreSQL active connections and query latency: connection pool saturation and slow query trends, the two database problems that have caused our worst incidents.
  • Error rate by service: 5xx responses per minute across all services. Any spike above baseline triggers investigation.
  • Deployment timeline: a vertical line annotation showing when each deployment happened, correlated with all other metrics to instantly see if a deploy caused a change.

That is it. Five panels. One dashboard. If all five are green, the system is healthy and you can focus on feature work. If any of them changes, you investigate. The simplicity is the point. A dashboard with 50 panels is a dashboard nobody looks at.

Structured Logging With Loki

The Loki setup was rebuilt around a single principle: every log line must be a JSON object with a consistent schema. No more unstructured text logs that require regex queries. No more service-specific field names.

{
"level": "error",
"msg": "reconciliation match failed",
"service": "reconciliation-engine",
"traceId": "abc123def456",
"statementId": "stmt_789",
"error": "downstream timeout after 5000ms",
"duration_ms": 5012
}

Every log line includes level, msg, service, and traceId. Domain-specific fields like statementId are added where relevant. The consistent schema means Loki queries work the same way across all services. Finding all errors for a specific reconciliation statement is a single query regardless of which service produced the log.

{service="reconciliation-engine"} | json | level="error" | statementId="stmt_789"

Distributed Tracing With Tempo

Tempo was the newest addition. We use OpenTelemetry auto-instrumentation for our Node.js services, which captures HTTP requests, database queries, and Kafka message processing with zero manual instrumentation code. The traces flow to Tempo and are queryable from Grafana.

The killer feature is exemplars: clickable links from a metric spike directly to the traces that caused it. When the P95 response time spikes on the dashboard, clicking the spike shows the specific requests that were slow. From the trace, you can see which database query or downstream call caused the latency. The debugging flow goes from “something is slow” to “this specific query on this specific request is slow” in three clicks.

  • Auto-instrumentation captures HTTP, PostgreSQL, and Kafka spans without code changes
  • Trace IDs propagate through HTTP headers and Kafka message headers automatically
  • Log lines include the trace ID, linking structured logs to distributed traces
  • Grafana Explore allows querying traces by duration, service, or error status
  • We sample 10 percent of traces in production to keep storage costs manageable

The Impact

After rebuilding the observability stack around these principles, our mean time to detection for production issues dropped by roughly 60 percent. Not because the tools changed. Because the practice changed. Engineers look at the dashboard every morning. They notice when a metric drifts before it becomes an incident. They follow traces from metrics to logs without context-switching between tools.

Homelab, but treated like a real environment.

By this stage the job had changed. I was no longer just picking a tool or fixing a bug. I was carrying the blast radius across product, compliance, sales, and hiring. That is exactly why I kept pressure-testing the same lesson inside infrastructure and ctrlpane.

Observability for a small team is not about having the most comprehensive monitoring. It is about having the five metrics that actually tell you whether your system is healthy, structured logs that are queryable without regex, and traces that connect the dots when something goes wrong. Install less, configure better, and make it the browser homepage.

The entire stack runs on our k3s cluster consuming about 2 GB of RAM and 10 GB of disk. Grafana, Loki, and Tempo are free and open source. The setup took a weekend. The cultural shift took a month. The weekend was the easy part.