Observability on a shoestring: Grafana, Loki, and Tempo for free

Datadog quoted us $23 per host per month for infrastructure monitoring, plus$ 0.10 per GB for log management, plus $1.70 per million spans for APM. For our four hosts with moderate log volume, that worked out to about$ 400 per month. Four hundred dollars a month for observability at a pre-seed startup that was watching every dollar. I could not justify it. But I also could not fly blind. The database incident two months earlier proved that running without observability is just waiting for the next fire to get out of control before you notice the smoke.

So I self-hosted the entire Grafana observability stack on our k3s homelab cluster. Grafana for dashboards, Loki for logs, Tempo for distributed traces. Total cost: zero dollars beyond the hardware I already owned. Total RAM usage: about 1 GB. Setup time: one weekend.

Generate a realistic homelab editorial photo: compact rack, mini PCs, blinking LEDs, labeled cables, dark room, teal and tungsten lighting, 16:9, no people, no pristine showroom look.

The kind of infrastructure that teaches you by breaking.

A lot of my month-one leadership came through infrastructure choices that looked small from the outside. It also builds on what I learned earlier in “ArgoCD on a single-node k3s cluster: overkill or exactly right.” I was building the muscle memory that later fed the infrastructure and ctrlpane projects at home: reproducible defaults, cheap feedback loops, and enough observability that I did not need to guess under pressure.

Editorial supporting image for the section "The Architecture" in the article "Observability on a shoestring: Grafana, Loki, and Tempo for free". Show mini PC rack, patch cables, terminal windows, Grafana-style dashboard glow, labeled hardware with signs of real use for "Observability on a shoestring: Grafana, Loki, and Tempo for free". Focus on one operational artifact that makes the post feel lived-in rather than conceptual. Color palette: teal, zinc, dim server LEDs, warm tungsten accents. Mood: hungry, hands-on, slightly sleep-deprived, battle-tested before the title felt real. Composition: 16:9 landscape image, documentary/editorial feel, no text overlays, no stock-photo polish. Avoid: No spotless enterprise data-center stock photos, no blue LED overkill, no clip-art servers, no text overlays.

The infrastructure mess that made the lesson stick.

The Architecture

The stack runs on the homelab k3s cluster, not on our production infrastructure. Production services send telemetry outbound to the homelab through a Cloudflare Tunnel. This keeps the observability system completely separate from production, which means a Grafana crash does not affect the product and a production outage does not take down our monitoring.

Grafana: Dashboard and alerting UI. Connects to Loki for log queries and Tempo for trace exploration. Runs on about 200 MB of RAM.
Loki: Log aggregation. Receives logs via Promtail agents on production hosts. Uses filesystem storage on the homelab SSD. Runs on about 400 MB of RAM.
Tempo: Distributed tracing backend. Receives traces via OpenTelemetry Collector. Uses filesystem storage. Runs on about 300 MB of RAM.
OpenTelemetry Collector: Receives traces from production services and forwards them to Tempo. Also scrapes Prometheus metrics. Runs on about 100 MB of RAM.
Promtail: Log shipper. Runs on each production host and tails container logs. Forwards to Loki via the Cloudflare Tunnel.

OpenTelemetry Auto-Instrumentation

The fastest way to get traces from a Node.js application is OpenTelemetry auto-instrumentation. You add the SDK, configure an exporter, and every HTTP request, database query, and external API call gets traced automatically with zero code changes.

1
// instrumentation.ts — loaded before the application
2
import { NodeSDK } from '@opentelemetry/sdk-node';
3
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
4
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
5

6
const sdk = new NodeSDK({
7
  traceExporter: new OTLPTraceExporter({
8
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
9
  }),
10
  instrumentations: [
11
    getNodeAutoInstrumentations({
12
      '@opentelemetry/instrumentation-http': { enabled: true },
13
      '@opentelemetry/instrumentation-pg': { enabled: true },
14
      '@opentelemetry/instrumentation-express': { enabled: true },
15
    }),
16
  ],
17
});
18

19
sdk.start();

With this configuration, every incoming HTTP request creates a trace. Every PostgreSQL query within that request creates a child span. Every outgoing HTTP call to a payment processor creates another child span. The full request lifecycle is visible in Tempo without adding a single line of instrumentation to the application code.

The Three Dashboards That Catch 90 Percent of Issues

I built three Grafana dashboards that cover the vast majority of production issues. Each one is designed to answer a specific question in under ten seconds.

API Health: Request rate, error rate, P50/P95/P99 latency, grouped by endpoint. If the error rate spikes or latency doubles, I see it here first. Alert threshold: error rate above 5% for 5 minutes.
Database Performance: Active connections, query duration by type, lock wait time, replication lag. This catches slow queries and connection pool issues before they cascade. Alert threshold: average query duration above 500ms for 3 minutes.
Payment Pipeline: Payment processing volume, success rate, reconciliation match rate, webhook delivery latency. This is the business-critical dashboard. Alert threshold: success rate below 98% for 2 minutes.

Each dashboard has a five-minute alert threshold that sends notifications via a Slack webhook. The thresholds are deliberately conservative. I would rather get a false positive and dismiss it than miss a real incident. In three months, we have had six alerts: four real issues caught early and two false positives from brief traffic spikes.

Resource Allocation and Retention

Running the full observability stack on 1 GB of RAM requires careful resource management. Loki and Tempo use filesystem storage instead of S3 or GCS, which is fine for a small operation but limits retention to what the SSD can hold.

Log retention: 14 days. Older logs are automatically deleted by Loki compaction. For compliance-critical audit logs, we store them separately in PostgreSQL with no expiration.
Trace retention: 7 days. Traces are large and 7 days is enough to debug any recent issue. For production incidents, we export the relevant traces to a JSON file before they expire.
Dashboard data: Prometheus metrics have 30-day retention. Grafana dashboards point at the Prometheus data source for historical charts.
Total disk usage: About 15 GB for 14 days of logs and 7 days of traces across our four production hosts.

Is This Production-Grade?

Honestly, no. A self-hosted observability stack on a homelab is not the same as Datadog or Grafana Cloud. It has no redundancy, no cross-region replication, and if my homelab goes down I lose monitoring until it comes back up. But for a pre-seed startup spending $400 per month on monitoring versus$ 0, the tradeoff is clear. We get 90% of the value of a managed service at 0% of the cost. When we raise our next round, Grafana Cloud is the likely upgrade because it uses the same query language and dashboard format. The migration will be straightforward.

Create a realistic infrastructure editorial image with a homelab rack and nearby monitoring screen, subtle Grafana-like charts, dark graphite and green palette, 4:3, no branded UI, no stock-photo polish.

Homelab, but treated like a real environment.

That was the pattern of my first months at FinanceOps: I did not have management scar tissue yet, so I earned trust by making technical decisions that stayed boring under pressure. The same bias toward strict defaults still shows up in infrastructure and ctrlpane today.

Observability is not optional, but it does not have to be expensive. If you cannot afford a managed observability service, self-host the Grafana stack. One weekend of setup is cheaper than one undetected production incident.