OpenTelemetry and Grafana: A production-grade observability stack with Tempo, Mimir and Loki
Deploy a production-grade observability stack with OpenTelemetry and Grafana: traces in Tempo, metrics in Mimir, logs in Loki, and SLO-driven alerting in hours.
What this OpenTelemetry + Grafana stack delivers
This article describes a complete, production-ready observability stack built around OpenTelemetry (OTel) and Grafana components that you can deploy in a single weekend. The stack unifies traces, metrics, and logs under a single UI and is designed so teams can instrument once and route telemetry to multiple backends without vendor lock-in. Key pieces are the OpenTelemetry Collector (deployed in gateway mode), Grafana Tempo for distributed traces, Grafana Mimir for scalable Prometheus-compatible metrics, Grafana Loki for label-indexed logs, Grafana for visualization and dashboards, and AlertManager for paging and routing alerts. Prerequisites stated for the deployment are Kubernetes 1.25 or newer, Helm 3, basic familiarity with YAML, and an estimated end-to-end deployment time of roughly 3–5 hours.
Why OpenTelemetry ends the vendor-lock debate
OpenTelemetry is presented here as the standardized telemetry SDK and protocol that decouples instrumentation from backend choice. The stack relies on OTel so application teams instrument code once (or use auto-instrumentation) and send OTLP to a single Collector gateway. From that gateway the Collector can forward traces to Tempo, metrics to Mimir, and logs to Loki — or to third-party vendors simultaneously — without code changes. The Collector also provides batching, validation and basic processing (for example memory limiting and attribute enrichment), so the ingress point is a controlled, vendor-agnostic surface for all services.
Architecture: Collector gateway to Tempo, Mimir, Loki, Grafana and AlertManager
The deployment pattern centers on a single ingress Collector in gateway mode receiving OTLP from apps. That Collector validates, batches, and routes telemetry to three backends: Tempo for traces (object-storage backed), Mimir for long-term Prometheus-compatible metrics, and Loki for structured log aggregation. Grafana sits on top as a unified UI with Explore, dashboards and linked cross-data-source workflows; AlertManager deduplicates and routes alerts to systems such as PagerDuty or Slack.
Conceptually:
- Applications → OTLP → OTel Collector (gateway)
- Collector → Tempo (traces), Mimir (metrics), Loki (logs)
- Grafana queries Tempo/Mimir/Loki for visualization and correlation
- AlertManager receives alerts (Prometheus rules) and routes pages
The recommended minimum storage footprint called out for a small production deployment is approximately 100 GB for Tempo, 50 GB for Loki, and 50 GB for Mimir; Tempo and Mimir are configured to use object storage (S3/GCS/MinIO) for durability and lower index costs.
Installing the OpenTelemetry Collector in gateway mode
Deploy the Collector as the single ingress gateway rather than as a per-node agent. In gateway mode the Collector exposes OTLP endpoints (gRPC and HTTP), applies processors such as batch and memory limiter, optionally upserts attributes like environment, and exports traces, metrics and logs to configured destinations. The Collector configuration routes traces to an OTLP/Tempo exporter, metrics to a Prometheus remote-write exporter pointed at Mimir, and logs to Loki’s push API.
The guide used Helm charts to install the Collector: add the OpenTelemetry Helm repository and deploy the opentelemetry-collector chart with a values file that configures receivers, processors, and exporters. After deployment, confirm ingestion by curling or sending a test span/metric/log through the Collector and checking Collector logs for traces and trace IDs — that is the primary verification step cited.
Auto-instrumentation for Java, Python, Node.js and minimal Go setup
To minimize code changes, use OTel auto-instrumentation where available:
- Java: attach the OpenTelemetry Java agent via JAVA_TOOL_OPTIONS and set OTEL_EXPORTER_OTLP_ENDPOINT to point at the Collector so JVM apps start exporting traces and metrics without changing application code.
- Python: install the opentelemetry distro and the OTLP exporter, then bootstrap auto-instrumentation and point the exporter at the Collector before launching the app.
- Node.js: install the node auto-instrumentations package and run the process under the opentelemetry instrument command with a service name and exporter endpoint configured.
- Go: the Go ecosystem typically requires a small amount of manual setup; initialize a tracer with the OTLP gRPC exporter configured to the Collector endpoint. The article’s deployment pattern treats Go as “minimal manual instrumentation required.”
These approaches let teams capture spans, metrics and logs with minimal developer friction and enable trace ID propagation across language boundaries.
Deploying Tempo for distributed tracing on object storage
Tempo is the tracing backend in this stack and is configured to write raw trace data into object storage (for example S3 or MinIO) while indexing only by trace ID. That design keeps storage costs low because indexing and search costs are minimized. Tempo’s configuration includes S3 bucket and credentials (or MinIO endpoint and access keys for an on-prem object store), ingestion rate limits and distributor receivers listening on OTLP gRPC.
The recommended helm-based deployment pattern uses the Grafana helm charts for Tempo and sets the query frontend URL which Grafana will use to connect Tempo as a data source. Tempo is intentionally set up with object-storage-backed retention and a worker pool configuration so ingestion throughput and queue depth match expected load.
Prometheus + Mimir for horizontally scalable long-term metrics
Mimir takes the place of single-node Prometheus to provide horizontal scalability, replication and durable blocks storage backed by object storage. The stack configures Mimir’s blocks storage to use S3/MinIO, replication factor for high availability (for example a replication factor of three for ingesters), and multiple replicas for ingesters, distributors and queriers to achieve capacity and redundancy. Existing Prometheus deployments can be migrated by converting local TSDB data into blocks and then pointing Prometheus remote_write at Mimir’s distributor push endpoint so future metrics flow into the horizontally scaled store.
Mimir is used as the Prometheus-compatible query endpoint that Grafana will set as the default Prometheus data source, enabling familiar PromQL-based dashboards and alerting rules to continue working at scale.
Loki for label-indexed log aggregation and structured querying
Loki operates like “Prometheus for logs” by indexing only labels rather than full text, making long-term log storage cheaper. The stack configures Loki to store chunks in object storage, sets a schema config and boltdb-shipper for index metadata, and imposes ingestion and stream limits to protect cluster stability. Example queries demonstrate how to filter logs by namespace and application labels, search for entries containing “error”, parse JSON payloads, filter on fields such as latency, and render a formatted line that includes a trace_id for fast trace-to-log correlation.
The recommended Loki configuration includes a chunk-store lookback period (for example 28 days), ingestion throttles and S3/MinIO credentials for persistent storage.
Connecting Tempo, Mimir and Loki inside Grafana for fast correlation
Grafana is configured with three managed data sources: Mimir (Prometheus-compatible), Tempo (traces) and Loki (logs). The Grafana data source configuration enables traces-to-logs linking — for example by instructing Tempo to use Loki as a logs datasource and exposing derived fields so log entries containing trace_id values become clickable and open the matching Tempo trace. Grafana dashboard providers are used to ship SLO dashboards as files mounted into Grafana’s dashboards folder so teams get out-of-the-box SLO panels.
In practice, the workflow is: an operator sees elevated latency in a Grafana service-health panel, jumps to the trace explorer to open a slow trace in Tempo, and from the trace drills into correlated logs in Loki or into Mimir metrics for the affected service. That triage loop is the core value proposition of a unified observability UI.
SLO dashboards, error budgets and alerting on symptoms, not causes
The stack includes a prebuilt SLO dashboard template focused on three SLIs: availability, error-budget remaining, and latency percentiles (for example P99). Availability SLI and error-budget math are defined so panels show a 30-day SLI view and compute remaining budget as a normalized percentage; thresholds drive visual coloring for on-call clarity.
Alerting is designed around symptoms — for example “error budget exhausted” — rather than low-level causes like per-pod CPU spikes. Prometheus alerting rules evaluate SLO equations and produce alerts labeled with service and severity; AlertManager routes critical alerts to PagerDuty and warnings to Slack. AlertManager routing groups alerts, defines wait and repeat intervals, deduplicates, and maps severities to receivers so paging behavior matches operational intent.
Dashboards every on-call engineer needs
Instead of monstrous 50-panel dashboards, the deployment advocates three focused dashboards:
- Service Health (RED method): request rate, error ratio by status code, latency P50/P95/P99, and saturation metrics such as CPU/memory per pod and queue depth.
- Trace Explorer: top slow traces, trace heatmap, Tempo-derived service dependency graph and high-error-trace panels.
- Error-Budget Burndown: daily remaining budget trend, burn-rate windows (1h/6h/24h), multi-burn alert state and top offending services.
These three panels form the typical on-call triage flow: detect via Service Health → inspect traces → consult burn and priority via the burndown chart.
Operational checklist and production readiness
Before considering the stack production-ready, the deployment checklist includes:
- Ingestion testing: push a sample span/metric/log through the Collector and verify it appears in Tempo/Mimir/Loki.
- Retention and sizing: set retention windows — suggested defaults include Mimir 30 days, Tempo 14 days, Loki 30 days — and ensure object storage has versioning and backups.
- Authentication: secure Grafana via OAuth (Google/GitHub) and protect Mimir/Loki ingesters with basic auth or other controls.
- Alert testing: silence a test service and verify PagerDuty receives the page according to AlertManager configuration.
- Runbooks: link each alert to an operational runbook so pages include remediation steps and owners.
The stack also suggests adding additional OTel receivers to collect database telemetry (PostgreSQL, Redis, MongoDB) and optionally adding synthetic monitoring with a Blackbox exporter.
Broader implications for observability, developers and businesses
This deployment pattern illustrates a shift in how teams can approach observability: instrument once, control telemetry at the Collector, and choose backends that separate storage concerns (object storage for traces and chunks) from query engines. For developer teams, auto-instrumentation lowers the bar to capturing cross-service traces and preserving trace context across languages. For platform and SRE teams, horizontally scalable components (Mimir, Tempo backed by object storage, Loki) allow predictable operational boundaries and cost control compared with a single vendor-managed suite.
On the business side, the stack presents a cost argument: the author reports using this collection of open components for clients with a software cost of $0 per month excluding storage, versus a quoted $15k/month spend on a vendor solution. That figure underscores the trade-offs organizations weigh between managed vendor convenience and the operational responsibilities of self-hosting.
How teams can adopt this stack and what to expect
Adopters should plan for:
- Short-term work (3–5 hours) to stand up the base components if prerequisites are met.
- Minor application changes only for languages without full auto-instrumentation (for example Go), and straightforward environment configuration for agent-based auto-instrumentation for Java, Python and Node.js.
- An operational commitment to storage management and access control for Mimir, Tempo and Loki; object storage lifecycle and versioning are critical for data durability.
- A shift in alerting philosophy: move from noisy, low-level cause alerts to SLO-driven symptom alerts that reflect end-user impact and error budgets.
The stack integrates naturally with related topics that platform teams often manage: developer tools for CI/CD to bake instrumentation into images, security software for protecting endpoints and credentials, and automation platforms for lifecycle operations (Helm upgrades, backup jobs, and retention policy changes). Grafana dashboards and the SLO dashboards described are natural internal links to pages about monitoring, SRE practices, and incident response playbooks.
You now have a full-stack approach that combines OpenTelemetry’s single-instrumentation promise, Tempo’s low-cost trace storage, Mimir’s Prometheus-compatibility at scale, Loki’s label-first log model, and Grafana’s integrated UI with SLO-focused dashboards. The deployment model emphasizes end-to-end trace-to-log-to-metric correlation, SLO-first alerting, and storage choices that reduce indexing costs.
Looking ahead, expect continued convergence around standard telemetry protocols and more mature operator tooling for running these projects at scale; teams adopting this stack can incrementally add database and synthetic telemetry receivers, refine SLOs and error-budget policies, and incorporate more automated operational tooling for backups and upgrades so the observability platform becomes a managed component of the infrastructure.
















