Observability in Practice: Logs, Metrics, and Tracing in Production
Observability in Practice: Logs, Metrics, and Tracing in Production
There is a difference between monitoring and observability that most teams learn the hard way β during an incident at 3am, when all the dashboards are green and something is clearly, undeniably broken.
Monitoring tells you when predefined things go wrong. You set thresholds, you get alerts. This works well for problems you have seen before and thought to instrument. It fails completely for novel failure modes β the ones that actually wake you up at night.
Observability is the property of a system that allows you to understand its internal state from its external outputs. An observable system lets you ask arbitrary questions β "which users are experiencing this specific failure mode?", "what changed in the 15 minutes before latency spiked?" β without needing to have anticipated those questions in advance.
The difference is not philosophical. It determines whether your team spends incidents running scripts and correlating log files by hand, or whether they can ask the right question and get the answer in seconds.
This article covers how to implement observability correctly using the three pillars β logs, metrics, and tracing β with practical tooling choices and real incident examples.
The three pillars of observability
Every observable system emits three types of telemetry, and each answers a different class of question.
Logs answer "what happened?" They are the narrative record of events in your system β structured, timestamped records of what code did when. Logs are the highest-information signal: they can contain arbitrary context about any event.
Metrics answer "how is the system behaving over time?" They are numerical measurements aggregated over time β request rate, error rate, latency percentiles, CPU usage, memory consumption. Metrics are the lowest-cardinality signal: cheap to store, easy to query, perfect for alerting and trending.
Traces answer "what did this specific request do?" They follow a single operation (a user request, a background job, a scheduled task) across all the components it touched β services, databases, caches, queues β with timing information for each step.
The mistake most teams make is treating these as separate systems with separate purposes. True observability comes from connecting them: a trace links to the logs emitted during that trace, and to the metric timeseries that spiked at the same timestamp.
OpenTelemetry: the unified standard
Before diving into implementation, understanding OpenTelemetry is essential, because it changes the tooling landscape.
OpenTelemetry (OTel) is the CNCF-backed open standard for telemetry instrumentation. It provides:
- APIs β language-specific interfaces for emitting traces, metrics, and logs from your code.
- SDKs β the implementation of those APIs for each supported language (Go, Python, Java, JavaScript, .NET, Ruby, and more).
- A protocol (OTLP) β a vendor-neutral wire protocol for shipping telemetry from your services to any backend.
- Collectors β agents that receive, process, and export telemetry from your infrastructure.
The critical benefit: you instrument your code once, and you can send that telemetry to any backend β Grafana, Datadog, Honeycomb, Jaeger, AWS X-Ray β by changing configuration, not code. You avoid vendor lock-in at the instrumentation layer.
In 2024, OpenTelemetry is the right default for all new instrumentation. If you are using a vendor-specific SDK, you are creating a migration cost that will materialize eventually.
Implementing structured logging
Raw text logs are an antipattern. Searching through gigabytes of [2026-03-29 03:17:42] ERROR Something went wrong messages requires grep, is slow, and cannot be filtered or aggregated reliably.
Structured logging emits log events as key-value pairs (typically JSON), making them queryable, filterable, and aggregatable by any observability backend.
A well-structured log event:
{
"timestamp": "2026-03-29T03:17:42.000Z",
"level": "error",
"service": "payment-processor",
"version": "2.4.1",
"trace_id": "abc123def456",
"span_id": "789ghi012",
"user_id": "user_9872643",
"order_id": "order_8812734",
"event": "payment_failed",
"error": "stripe_timeout",
"amount_cents": 24999,
"currency": "usd",
"duration_ms": 5023,
"message": "Payment failed after 5023ms: Stripe API timeout"
}
Notice what this enables:
- Find all payment failures for a specific user:
user_id = "user_9872643" AND event = "payment_failed" - Find all Stripe timeouts in the last hour:
error = "stripe_timeout" AND timestamp > now-1h - Correlate with a specific request trace:
trace_id = "abc123def456" - Alert on error rate:
count(level="error") / count(*) > 0.05
The fields every log event should have
Always:
timestampβ ISO8601 with millisecond precisionlevelβ error, warn, info, debugserviceβ name of the emitting serviceversionβ deployed version of the servicetrace_idβ links to the distributed trace
When available (inject from request context):
user_id,account_id,org_idβ the subject of the operationrequest_idβ unique ID for the inbound request- Relevant entity IDs (
order_id,payment_id, etc.)
For errors:
errorβ machine-readable error codeerror_detailβ human-readable detailstack_traceβ for unexpected exceptions
Log levels as a discipline
Use levels deliberately:
errorβ unexpected failures that require investigation or immediate action. Should always page or alert.warnβ something unexpected but handled. Degrades gracefully. Worth monitoring for accumulation.infoβ significant business events. "Order placed." "User signed up." "Scheduled job started/completed." Keep signal-to-noise high.debugβ internal state useful during development. Disabled in production by default; should be enable-able per-service without redeployment.
The most common mistake: treating every log line as equally important. If 80% of your logs are info and every request generates 50 of them, your signal is buried in noise.
Implementing metrics correctly
Metrics are the foundation of alerting. They are cheap to store (a timeseries of numbers), fast to query, and the right tool for trending, thresholds, and SLO tracking.
The four golden signals
Google's Site Reliability Engineering book defined four signals that, if measured for any service, give you nearly complete visibility into that service's health:
- Latency β how long it takes to serve a request. Distinguish between the latency of successful and failed requests.
- Traffic β how much demand is being placed on the system. Requests per second, messages processed per minute, data volume.
- Errors β the rate of failed requests. Failed vs. total, distinguishing explicit HTTP 500s, implicit failures (HTTP 200 with wrong content), and policy failures (committed to SLA of <1s response but responding in 2s).
- Saturation β how "full" your service is. CPU, memory, disk, network β whichever is the most constrained resource. Also queue depth for async systems.
Percentiles over averages
Average latency is almost always the wrong metric to alert on. A system responding in 50ms for 99% of requests and 5000ms for 1% has an average of ~99ms β seemingly fine. But 1% of requests at 5 seconds means 1 in 100 users is having a terrible experience.
Use percentiles:
- P50 (median) β the typical user experience.
- P95 β the experience for a user in the 95th percentile. Often what "most slow users" are experiencing.
- P99 β your worst-10th-percentile user. Often where infrastructure problems first appear.
- P99.9 β the one-in-a-thousand experience. Critical for financial systems, healthcare, and anywhere the worst case has severe consequences.
For a checkout flow, P99 latency under 2 seconds might be your SLO. Setting the alert on P99, not P50, means you catch tail latency degradation before it becomes a majority experience.
SLOs and error budgets
Service Level Objectives (SLOs) are the formalization of "what does good look like?". An SLO is a target for a metric over a time window:
- 99.9% of requests to the checkout API complete in under 2 seconds over a 30-day rolling window.
- 99.95% of authentication requests succeed over a 28-day rolling window.
An error budget is the complement: the allowed amount of badness. At 99.9% availability, you have 43.8 minutes of allowed downtime per month. When the error budget is healthy, teams can move fast. When it is depleted, reliability takes priority over feature velocity.
SLO-based alerting is fundamentally different from threshold-based alerting. Instead of alerting when error rate exceeds 5%, you alert when you are burning error budget faster than your SLO period allows β before the budget is exhausted.
Implementing distributed tracing
Distributed tracing is the hardest of the three pillars to implement correctly and the most valuable when you do.
A trace is a directed acyclic graph of spans. Each span represents a unit of work: a service handling a request, a database query, a call to an external API. Spans have:
- A unique span ID
- The parent span ID (which creates the tree structure)
- A trace ID (shared by all spans in the same request lifecycle)
- Start time and duration
- Operation name
- Attributes (key-value context: HTTP method, DB statement, user ID)
- Status (OK, Error)
- Events (point-in-time annotations within the span)
Propagation: the mechanism that makes tracing work
Tracing only works when the trace context is propagated across service boundaries. When Service A calls Service B, it must include the current trace ID and span ID in the outbound request headers. Service B reads those headers, creates a child span, and includes the parent context in any downstream calls it makes.
OpenTelemetry handles this automatically for most common libraries (HTTP clients, gRPC, messaging). The W3C Trace Context standard defines the header format (traceparent), ensuring interoperability across different tracing systems.
A real example: debugging a checkout latency spike
Imagine this scenario: P99 latency on your checkout API spikes from 800ms to 4.2 seconds. Your Grafana alert fires. Here is how observability makes this tractable.
Step 1: Metrics narrow the time window. The spike started at exactly 14:23:07 UTC. Error rate is unchanged β it is purely a latency issue.
Step 2: You find a slow trace. In your tracing backend (Jaeger/Grafana Tempo), you filter for checkout traces during that window with duration > 2 seconds. You find dozens.
Step 3: The trace reveals the bottleneck. Opening a slow trace, you see:
checkout-api (total: 4198ms)
βββ auth-service (12ms) β fine
βββ inventory-service (4100ms) β THIS
β βββ postgres query: SELECT * FROM inventory WHERE sku IN (...) (4089ms)
βββ payment-service (23ms) β fine
Step 4: The span attributes give the query. The postgres span includes the query text as an attribute. It is a SELECT * FROM inventory WHERE sku IN (...) β a query executing over a table that grew significantly in the last 24 hours. A missing index.
Step 5: Logs confirm the diagnosis. Filtering logs with trace_id = [slow trace ID], you see the inventory service logged a warning: "Query execution time 4089ms exceeded slow query threshold of 1000ms".
Without distributed tracing, this investigation might take an hour. With it, from alert to root cause is five minutes.
The observability stack: practical tool choices
The open-source stack (Grafana ecosystem)
For teams that want to own their observability infrastructure:
| Function | Tool | |---|---| | Metrics storage | Prometheus + Thanos (for long-term retention) | | Log storage | Grafana Loki | | Trace storage | Grafana Tempo | | Visualization | Grafana | | Alerting | Grafana Alertmanager | | Instrumentation | OpenTelemetry SDK + Collector |
The Grafana stack is production-ready, extremely mature, and free to operate. The operational cost (managing Prometheus, Loki, and Tempo) is non-trivial at scale β be realistic about your team's capacity to run this infrastructure.
The managed stack
For teams that want to focus on product rather than observability infrastructure:
- Datadog β the most complete SaaS observability platform. Best-in-class alerting, APM, and log management. Expensive at scale.
- Honeycomb β purpose-built for high-cardinality observability. Exceptional for query-driven debugging. The best tool for teams operating complex distributed systems.
- Grafana Cloud β managed Grafana + Loki + Tempo + Prometheus. The economics of the open-source stack without the operational burden.
- AWS CloudWatch / X-Ray β viable if you are all-in on AWS. Good integration with the AWS ecosystem; observability UX is weaker than purpose-built tools.
Choosing a strategy
Start with managed tooling unless you have a dedicated platform team and a clear reason to own the infrastructure. The operational cost of running a production-grade Prometheus + Loki + Tempo stack is significant and often underestimated.
Building an alerting philosophy that works
Alerts have a well-known failure mode: alert fatigue. When every alert page-worthy thing has an alert, and many of those alerts are noisy, teams learn to ignore them. When a real incident fires, it is 30 minutes before anyone investigates because everyone assumed it was another false positive.
Principles for high-signal alerting
Alert on symptoms, not causes. Users experience symptoms: high latency, elevated error rates, degraded functionality. Alert on these. "CPU is at 80%" is a cause β it may or may not affect users. "P99 latency is above SLO threshold" is a symptom.
Every alert must be actionable. If you cannot point to a runbook entry or a specific investigation path when an alert fires, it should not be an alert. Non-actionable alerts become noise.
Use PagerDuty/OpsGenie tiers deliberately. Not every alert should wake someone up. Use tiered routing: page immediately for SLO violations and service-down events; email or Slack for warning-level degradation that does not yet breach SLO.
Deduplicate and suppress. A single infrastructure problem (a database slowdown) can trigger 50 alerts from 50 services. Grouping and deduplication in your alerting tool prevents 50-page storms that paralyze on-call engineers.
Run monthly alert reviews. Categorize every alert that fired in the last month: actionable and correct, noisy (false positive), or not fired but should have. Remove or adjust based on evidence.
Production incident response: observability in action
The true test of your observability investment is how it changes incident response.
A well-observed system follows this pattern during an incident:
- Detection β an alert fires on a user-facing symptom (elevated error rate, SLO burn rate).
- Scope β metrics show which services and which user populations are affected, when it started, and whether it is improving or worsening.
- Isolation β distributed traces and correlated logs narrow the problem to a specific service, query, endpoint, or third-party dependency.
- Diagnosis β span attributes and log context reveal the specific failure: a bad query, a misconfigured timeout, a dependency that started returning errors.
- Mitigation β fix or workaround is deployed. Metrics confirm recovery in real-time.
- Postmortem β the trace replay recreates the exact failure sequence, informing the long-term fix.
Without observability, steps 2β4 are guesswork. With it, they are decisions driven by evidence.
Practical advice for getting started
If you are starting from zero, the sequence that avoids the most pain:
- Structured logging first β It is the lowest-effort, highest-return investment. Pick a structured logging library for your stack and enforce it. Log the four fields: timestamp, level, service, trace_id.
- Metrics for the four golden signals β Instrument your main service boundary. Request rate, error rate, P95 and P99 latency, and the most constrained resource. Set initial SLOs and alerts.
- Add tracing for your highest-complexity service β Start with one service. Validate that traces are being emitted and visible. Verify context propagation to downstream dependencies.
- Connect the three pillars β Ensure your logs contain the trace_id. Ensure your traces link to logs in your log query UI. Ensure your metrics dashboards have time-anchored links to relevant traces.
- Declare your SLOs officially β Write them down where the team can see them. This transforms observability from infrastructure into an organizational commitment.
Closing thought
Observability is not a tool purchase. It is a practice that a team builds over time β instrument, observe, learn, adjust. The teams that do it well do not have better tools than the teams that struggle. They have a culture of treating observability as a first-class engineering concern, not an afterthought.
The engineers who build truly observable systems are the ones who ask "how will I debug this in production?" before shipping a feature β not after the 3am page.
That mindset is the difference between a system you can reason about and a system you can only hope is working.