Observability on a shoestring: Grafana, Loki, and Tempo for free
We could not afford Datadog. I self-hosted the Grafana stack on our k3s homelab cluster and pointed production at it.
Datadog quoted us 0.10 per GB for log management, plus 400 per month. Four hundred dollars a month for observability at a pre-seed startup that was watching every dollar. I could not justify it. But I also could not fly blind. The database incident two months earlier proved that running without observability is just waiting for the next fire to get out of control before you notice the smoke.
So I self-hosted the entire Grafana observability stack on our k3s homelab cluster. Grafana for dashboards, Loki for logs, Tempo for distributed traces. Total cost: zero dollars beyond the hardware I already owned. Total RAM usage: about 1 GB. Setup time: one weekend.
A lot of my month-one leadership came through infrastructure choices that looked small from the outside. It also builds on what I learned earlier in “ArgoCD on a single-node k3s cluster: overkill or exactly right.” I was building the muscle memory that later fed the infrastructure and ctrlpane projects at home: reproducible defaults, cheap feedback loops, and enough observability that I did not need to guess under pressure.
The Architecture
The stack runs on the homelab k3s cluster, not on our production infrastructure. Production services send telemetry outbound to the homelab through a Cloudflare Tunnel. This keeps the observability system completely separate from production, which means a Grafana crash does not affect the product and a production outage does not take down our monitoring.
- Grafana: Dashboard and alerting UI. Connects to Loki for log queries and Tempo for trace exploration. Runs on about 200 MB of RAM.
- Loki: Log aggregation. Receives logs via Promtail agents on production hosts. Uses filesystem storage on the homelab SSD. Runs on about 400 MB of RAM.
- Tempo: Distributed tracing backend. Receives traces via OpenTelemetry Collector. Uses filesystem storage. Runs on about 300 MB of RAM.
- OpenTelemetry Collector: Receives traces from production services and forwards them to Tempo. Also scrapes Prometheus metrics. Runs on about 100 MB of RAM.
- Promtail: Log shipper. Runs on each production host and tails container logs. Forwards to Loki via the Cloudflare Tunnel.
OpenTelemetry Auto-Instrumentation
The fastest way to get traces from a Node.js application is OpenTelemetry auto-instrumentation. You add the SDK, configure an exporter, and every HTTP request, database query, and external API call gets traced automatically with zero code changes.
// instrumentation.ts — loaded before the applicationimport { NodeSDK } from '@opentelemetry/sdk-node';import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
const sdk = new NodeSDK({ traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT, }), instrumentations: [ getNodeAutoInstrumentations({ '@opentelemetry/instrumentation-http': { enabled: true }, '@opentelemetry/instrumentation-pg': { enabled: true }, '@opentelemetry/instrumentation-express': { enabled: true }, }), ],});
sdk.start();With this configuration, every incoming HTTP request creates a trace. Every PostgreSQL query within that request creates a child span. Every outgoing HTTP call to a payment processor creates another child span. The full request lifecycle is visible in Tempo without adding a single line of instrumentation to the application code.
The Three Dashboards That Catch 90 Percent of Issues
I built three Grafana dashboards that cover the vast majority of production issues. Each one is designed to answer a specific question in under ten seconds.
- API Health: Request rate, error rate, P50/P95/P99 latency, grouped by endpoint. If the error rate spikes or latency doubles, I see it here first. Alert threshold: error rate above 5% for 5 minutes.
- Database Performance: Active connections, query duration by type, lock wait time, replication lag. This catches slow queries and connection pool issues before they cascade. Alert threshold: average query duration above 500ms for 3 minutes.
- Payment Pipeline: Payment processing volume, success rate, reconciliation match rate, webhook delivery latency. This is the business-critical dashboard. Alert threshold: success rate below 98% for 2 minutes.
Each dashboard has a five-minute alert threshold that sends notifications via a Slack webhook. The thresholds are deliberately conservative. I would rather get a false positive and dismiss it than miss a real incident. In three months, we have had six alerts: four real issues caught early and two false positives from brief traffic spikes.
Resource Allocation and Retention
Running the full observability stack on 1 GB of RAM requires careful resource management. Loki and Tempo use filesystem storage instead of S3 or GCS, which is fine for a small operation but limits retention to what the SSD can hold.
- Log retention: 14 days. Older logs are automatically deleted by Loki compaction. For compliance-critical audit logs, we store them separately in PostgreSQL with no expiration.
- Trace retention: 7 days. Traces are large and 7 days is enough to debug any recent issue. For production incidents, we export the relevant traces to a JSON file before they expire.
- Dashboard data: Prometheus metrics have 30-day retention. Grafana dashboards point at the Prometheus data source for historical charts.
- Total disk usage: About 15 GB for 14 days of logs and 7 days of traces across our four production hosts.
Is This Production-Grade?
Honestly, no. A self-hosted observability stack on a homelab is not the same as Datadog or Grafana Cloud. It has no redundancy, no cross-region replication, and if my homelab goes down I lose monitoring until it comes back up. But for a pre-seed startup spending 0, the tradeoff is clear. We get 90% of the value of a managed service at 0% of the cost. When we raise our next round, Grafana Cloud is the likely upgrade because it uses the same query language and dashboard format. The migration will be straightforward.
That was the pattern of my first months at FinanceOps: I did not have management scar tissue yet, so I earned trust by making technical decisions that stayed boring under pressure. The same bias toward strict defaults still shows up in infrastructure and ctrlpane today.
Observability is not optional, but it does not have to be expensive. If you cannot afford a managed observability service, self-host the Grafana stack. One weekend of setup is cheaper than one undetected production incident.