Observability in Practice: Logs, Metrics, and Tracing in Production

EN 🇺🇸ArticleMarch 25, 2026•14 min read

#observability#opentelemetry#monitoring#devops#production#sre

Observability in Practice: Logs, Metrics, and Tracing in Production

There is a difference between monitoring and observability that most teams learn the hard way — during an incident at 3am, when all the dashboards are green and something is clearly, undeniably broken.

Monitoring tells you when predefined things go wrong. You set thresholds, you get alerts. This works well for problems you have seen before and thought to instrument. It fails completely for novel failure modes — the ones that actually wake you up at night.

Observability is the property of a system that allows you to understand its internal state from its external outputs. An observable system lets you ask arbitrary questions — "which users are experiencing this specific failure mode?", "what changed in the 15 minutes before latency spiked?" — without needing to have anticipated those questions in advance.

The difference is not philosophical. It determines whether your team spends incidents running scripts and correlating log files by hand, or whether they can ask the right question and get the answer in seconds.

This article covers how to implement observability correctly using the three pillars — logs, metrics, and tracing — with practical tooling choices and real incident examples.

The three pillars of observability

Every observable system emits three types of telemetry, and each answers a different class of question.

Logs answer "what happened?" They are the narrative record of events in your system — structured, timestamped records of what code did when. Logs are the highest-information signal: they can contain arbitrary context about any event.

Metrics answer "how is the system behaving over time?" They are numerical measurements aggregated over time — request rate, error rate, latency percentiles, CPU usage, memory consumption. Metrics are the lowest-cardinality signal: cheap to store, easy to query, perfect for alerting and trending.

Traces answer "what did this specific request do?" They follow a single operation (a user request, a background job, a scheduled task) across all the components it touched — services, databases, caches, queues — with timing information for each step.

The mistake most teams make is treating these as separate systems with separate purposes. True observability comes from connecting them: a trace links to the logs emitted during that trace, and to the metric timeseries that spiked at the same timestamp.

OpenTelemetry: the unified standard

Before diving into implementation, understanding OpenTelemetry is essential, because it changes the tooling landscape.

OpenTelemetry (OTel) is the CNCF-backed open standard for telemetry instrumentation. It provides:

APIs — language-specific interfaces for emitting traces, metrics, and logs from your code.
SDKs — the implementation of those APIs for each supported language (Go, Python, Java, JavaScript, .NET, Ruby, and more).
A protocol (OTLP) — a vendor-neutral wire protocol for shipping telemetry from your services to any backend.
Collectors — agents that receive, process, and export telemetry from your infrastructure.

The critical benefit: you instrument your code once, and you can send that telemetry to any backend — Grafana, Datadog, Honeycomb, Jaeger, AWS X-Ray — by changing configuration, not code. You avoid vendor lock-in at the instrumentation layer.

In 2024, OpenTelemetry is the right default for all new instrumentation. If you are using a vendor-specific SDK, you are creating a migration cost that will materialize eventually.

Implementing structured logging

Raw text logs are an antipattern. Searching through gigabytes of [2026-03-29 03:17:42] ERROR Something went wrong messages requires grep, is slow, and cannot be filtered or aggregated reliably.

Structured logging emits log events as key-value pairs (typically JSON), making them queryable, filterable, and aggregatable by any observability backend.

A well-structured log event:

{
  "timestamp": "2026-03-29T03:17:42.000Z",
  "level": "error",
  "service": "payment-processor",
  "version": "2.4.1",
  "trace_id": "abc123def456",
  "span_id": "789ghi012",
  "user_id": "user_9872643",
  "order_id": "order_8812734",
  "event": "payment_failed",
  "error": "stripe_timeout",
  "amount_cents": 24999,
  "currency": "usd",
  "duration_ms": 5023,
  "message": "Payment failed after 5023ms: Stripe API timeout"
}

Notice what this enables:

Find all payment failures for a specific user: user_id = "user_9872643" AND event = "payment_failed"
Find all Stripe timeouts in the last hour: error = "stripe_timeout" AND timestamp > now-1h
Correlate with a specific request trace: trace_id = "abc123def456"
Alert on error rate: count(level="error") / count(*) > 0.05

The fields every log event should have

Always:

timestamp — ISO8601 with millisecond precision
level — error, warn, info, debug
service — name of the emitting service
version — deployed version of the service
trace_id — links to the distributed trace

When available (inject from request context):

user_id, account_id, org_id — the subject of the operation
request_id — unique ID for the inbound request
Relevant entity IDs (order_id, payment_id, etc.)

For errors:

error — machine-readable error code
error_detail — human-readable detail
stack_trace — for unexpected exceptions

Log levels as a discipline

Use levels deliberately:

error — unexpected failures that require investigation or immediate action. Should always page or alert.
warn — something unexpected but handled. Degrades gracefully. Worth monitoring for accumulation.
info — significant business events. "Order placed." "User signed up." "Scheduled job started/completed." Keep signal-to-noise high.
debug — internal state useful during development. Disabled in production by default; should be enable-able per-service without redeployment.

The most common mistake: treating every log line as equally important. If 80% of your logs are info and every request generates 50 of them, your signal is buried in noise.

Implementing metrics correctly

Metrics are the foundation of alerting. They are cheap to store (a timeseries of numbers), fast to query, and the right tool for trending, thresholds, and SLO tracking.

The four golden signals

Google's Site Reliability Engineering book defined four signals that, if measured for any service, give you nearly complete visibility into that service's health:

Latency — how long it takes to serve a request. Distinguish between the latency of successful and failed requests.
Traffic — how much demand is being placed on the system. Requests per second, messages processed per minute, data volume.
Errors — the rate of failed requests. Failed vs. total, distinguishing explicit HTTP 500s, implicit failures (HTTP 200 with wrong content), and policy failures (committed to SLA of <1s response but responding in 2s).
Saturation — how "full" your service is. CPU, memory, disk, network — whichever is the most constrained resource. Also queue depth for async systems.

Percentiles over averages

Average latency is almost always the wrong metric to alert on. A system responding in 50ms for 99% of requests and 5000ms for 1% has an average of ~99ms — seemingly fine. But 1% of requests at 5 seconds means 1 in 100 users is having a terrible experience.

Use percentiles:

P50 (median) — the typical user experience.
P95 — the experience for a user in the 95th percentile. Often what "most slow users" are experiencing.
P99 — your worst-10th-percentile user. Often where infrastructure problems first appear.
P99.9 — the one-in-a-thousand experience. Critical for financial systems, healthcare, and anywhere the worst case has severe consequences.

For a checkout flow, P99 latency under 2 seconds might be your SLO. Setting the alert on P99, not P50, means you catch tail latency degradation before it becomes a majority experience.

SLOs and error budgets

Service Level Objectives (SLOs) are the formalization of "what does good look like?". An SLO is a target for a metric over a time window:

99.9% of requests to the checkout API complete in under 2 seconds over a 30-day rolling window.
99.95% of authentication requests succeed over a 28-day rolling window.

An error budget is the complement: the allowed amount of badness. At 99.9% availability, you have 43.8 minutes of allowed downtime per month. When the error budget is healthy, teams can move fast. When it is depleted, reliability takes priority over feature velocity.

SLO-based alerting is fundamentally different from threshold-based alerting. Instead of alerting when error rate exceeds 5%, you alert when you are burning error budget faster than your SLO period allows — before the budget is exhausted.

Implementing distributed tracing

Distributed tracing is the hardest of the three pillars to implement correctly and the most valuable when you do.

A trace is a directed acyclic graph of spans. Each span represents a unit of work: a service handling a request, a database query, a call to an external API. Spans have:

A unique span ID
The parent span ID (which creates the tree structure)
A trace ID (shared by all spans in the same request lifecycle)
Start time and duration
Operation name
Attributes (key-value context: HTTP method, DB statement, user ID)
Status (OK, Error)
Events (point-in-time annotations within the span)

Propagation: the mechanism that makes tracing work

Tracing only works when the trace context is propagated across service boundaries. When Service A calls Service B, it must include the current trace ID and span ID in the outbound request headers. Service B reads those headers, creates a child span, and includes the parent context in any downstream calls it makes.

OpenTelemetry handles this automatically for most common libraries (HTTP clients, gRPC, messaging). The W3C Trace Context standard defines the header format (traceparent), ensuring interoperability across different tracing systems.

A real example: debugging a checkout latency spike

Imagine this scenario: P99 latency on your checkout API spikes from 800ms to 4.2 seconds. Your Grafana alert fires. Here is how observability makes this tractable.

Step 1: Metrics narrow the time window. The spike started at exactly 14:23:07 UTC. Error rate is unchanged — it is purely a latency issue.

Step 2: You find a slow trace. In your tracing backend (Jaeger/Grafana Tempo), you filter for checkout traces during that window with duration > 2 seconds. You find dozens.

Step 3: The trace reveals the bottleneck. Opening a slow trace, you see:

checkout-api (total: 4198ms)
├── auth-service (12ms) ← fine
├── inventory-service (4100ms) ← THIS
│   └── postgres query: SELECT * FROM inventory WHERE sku IN (...) (4089ms)
└── payment-service (23ms) ← fine

Step 4: The span attributes give the query. The postgres span includes the query text as an attribute. It is a SELECT * FROM inventory WHERE sku IN (...) — a query executing over a table that grew significantly in the last 24 hours. A missing index.

Step 5: Logs confirm the diagnosis. Filtering logs with trace_id = [slow trace ID], you see the inventory service logged a warning: "Query execution time 4089ms exceeded slow query threshold of 1000ms".

Without distributed tracing, this investigation might take an hour. With it, from alert to root cause is five minutes.

The observability stack: practical tool choices

The open-source stack (Grafana ecosystem)

For teams that want to own their observability infrastructure:

| Function | Tool | |---|---| | Metrics storage | Prometheus + Thanos (for long-term retention) | | Log storage | Grafana Loki | | Trace storage | Grafana Tempo | | Visualization | Grafana | | Alerting | Grafana Alertmanager | | Instrumentation | OpenTelemetry SDK + Collector |

The Grafana stack is production-ready, extremely mature, and free to operate. The operational cost (managing Prometheus, Loki, and Tempo) is non-trivial at scale — be realistic about your team's capacity to run this infrastructure.

The managed stack

For teams that want to focus on product rather than observability infrastructure:

Datadog — the most complete SaaS observability platform. Best-in-class alerting, APM, and log management. Expensive at scale.
Honeycomb — purpose-built for high-cardinality observability. Exceptional for query-driven debugging. The best tool for teams operating complex distributed systems.
Grafana Cloud — managed Grafana + Loki + Tempo + Prometheus. The economics of the open-source stack without the operational burden.
AWS CloudWatch / X-Ray — viable if you are all-in on AWS. Good integration with the AWS ecosystem; observability UX is weaker than purpose-built tools.

Choosing a strategy

Start with managed tooling unless you have a dedicated platform team and a clear reason to own the infrastructure. The operational cost of running a production-grade Prometheus + Loki + Tempo stack is significant and often underestimated.

Building an alerting philosophy that works

Alerts have a well-known failure mode: alert fatigue. When every alert page-worthy thing has an alert, and many of those alerts are noisy, teams learn to ignore them. When a real incident fires, it is 30 minutes before anyone investigates because everyone assumed it was another false positive.

Principles for high-signal alerting

Alert on symptoms, not causes. Users experience symptoms: high latency, elevated error rates, degraded functionality. Alert on these. "CPU is at 80%" is a cause — it may or may not affect users. "P99 latency is above SLO threshold" is a symptom.

Every alert must be actionable. If you cannot point to a runbook entry or a specific investigation path when an alert fires, it should not be an alert. Non-actionable alerts become noise.

Use PagerDuty/OpsGenie tiers deliberately. Not every alert should wake someone up. Use tiered routing: page immediately for SLO violations and service-down events; email or Slack for warning-level degradation that does not yet breach SLO.

Deduplicate and suppress. A single infrastructure problem (a database slowdown) can trigger 50 alerts from 50 services. Grouping and deduplication in your alerting tool prevents 50-page storms that paralyze on-call engineers.

Run monthly alert reviews. Categorize every alert that fired in the last month: actionable and correct, noisy (false positive), or not fired but should have. Remove or adjust based on evidence.

Production incident response: observability in action

The true test of your observability investment is how it changes incident response.

A well-observed system follows this pattern during an incident:

Detection — an alert fires on a user-facing symptom (elevated error rate, SLO burn rate).
Scope — metrics show which services and which user populations are affected, when it started, and whether it is improving or worsening.
Isolation — distributed traces and correlated logs narrow the problem to a specific service, query, endpoint, or third-party dependency.
Diagnosis — span attributes and log context reveal the specific failure: a bad query, a misconfigured timeout, a dependency that started returning errors.
Mitigation — fix or workaround is deployed. Metrics confirm recovery in real-time.
Postmortem — the trace replay recreates the exact failure sequence, informing the long-term fix.

Without observability, steps 2–4 are guesswork. With it, they are decisions driven by evidence.

Practical advice for getting started

If you are starting from zero, the sequence that avoids the most pain:

Structured logging first — It is the lowest-effort, highest-return investment. Pick a structured logging library for your stack and enforce it. Log the four fields: timestamp, level, service, trace_id.
Metrics for the four golden signals — Instrument your main service boundary. Request rate, error rate, P95 and P99 latency, and the most constrained resource. Set initial SLOs and alerts.
Add tracing for your highest-complexity service — Start with one service. Validate that traces are being emitted and visible. Verify context propagation to downstream dependencies.
Connect the three pillars — Ensure your logs contain the trace_id. Ensure your traces link to logs in your log query UI. Ensure your metrics dashboards have time-anchored links to relevant traces.
Declare your SLOs officially — Write them down where the team can see them. This transforms observability from infrastructure into an organizational commitment.

Closing thought

Observability is not a tool purchase. It is a practice that a team builds over time — instrument, observe, learn, adjust. The teams that do it well do not have better tools than the teams that struggle. They have a culture of treating observability as a first-class engineering concern, not an afterthought.

The engineers who build truly observable systems are the ones who ask "how will I debug this in production?" before shipping a feature — not after the 3am page.

That mindset is the difference between a system you can reason about and a system you can only hope is working.

Newsletter

Stay ahead of the curve

Deep technical insights on software architecture, AI and engineering. No fluff. One email per week.

No spam. Unsubscribe anytime.