Target Audience: Developers and SREs operating microservices Prerequisites: Three Pillars of Observability After Reading: You’ll understand distributed tracing and be able to analyze request flows between services

TL;DR
  • Trace: Entire path of one request (composed of multiple Spans)
  • Span: Single unit of work (start/end time, metadata)
  • Context Propagation: Passing Trace ID between services
  • Sampling: Store only a portion of total traces (cost optimization)

Why Is Distributed Tracing Necessary?#

In microservices, a single request passes through multiple services. It’s hard to identify where delays occur.

graph LR
    USER["User"] --> GW["API Gateway"]
    GW --> ORDER["Order Service"]
    ORDER --> PAYMENT["Payment Service"]
    ORDER --> INVENTORY["Inventory Service"]
    PAYMENT --> DB1["Payment DB"]
    INVENTORY --> DB2["Inventory DB"]

    style PAYMENT fill:#ffcdd2

This diagram shows a microservices architecture where a single user request is distributed through API Gateway to Order, Payment, and Inventory services with their databases. Problem: Response is slow but don’t know where

Solution: Distributed tracing to check time spent in each segment


Core Concepts#

Trace and Span#

graph TB
    subgraph "Trace (Entire Request)"
        S1["Span: API Gateway<br>0-250ms"]
        S2["Span: Order Service<br>10-200ms"]
        S3["Span: Payment Service<br>20-180ms"]
        S4["Span: Payment DB<br>30-150ms"]
    end

    S1 --> S2
    S2 --> S3
    S3 --> S4

This diagram shows how Spans are nested within a single Trace (API Gateway, Order, Payment, DB) with their respective time ranges.

TermDescription
TraceEntire request path (unique Trace ID)
SpanIndividual unit of work (unique Span ID)
Parent SpanUpper Span that called current Span
Root SpanFirst Span (no Parent)

Span Structure#

{
  "traceId": "abc123def456",
  "spanId": "span001",
  "parentSpanId": null,
  "operationName": "HTTP GET /orders",
  "serviceName": "order-service",
  "startTime": 1704700800000,
  "duration": 245,
  "tags": {
    "http.method": "GET",
    "http.status_code": 200,
    "http.url": "/orders/123"
  },
  "logs": [
    {
      "timestamp": 1704700800100,
      "message": "Fetching order from database"
    }
  ]
}

Context Propagation#

The method of passing Trace ID between services.

sequenceDiagram
    participant A as Service A
    participant B as Service B
    participant C as Service C

    A->>B: HTTP Request<br>traceparent: 00-abc123-span1-01
    Note over B: Extract context<br>Create child span
    B->>C: HTTP Request<br>traceparent: 00-abc123-span2-01
    Note over C: Extract context<br>Create child span

This diagram shows how Trace context is propagated between services via the traceparent header during HTTP requests. W3C Trace Context format:

traceparent: 00-{trace-id}-{span-id}-{flags}
traceparent: 00-abc123def456789-fedcba987654321-01

Tool Comparison#

ToolFeaturesSuitable For
JaegerCNCF project, excellent UIKubernetes environments
ZipkinLightweight, easy setupQuick start
TempoGrafana integration, low costWhen using Grafana
AWS X-RayAWS integrationAWS environments

Architecture (Jaeger)#

graph TB
    APP["Application"] --> |"spans"| AGENT["Jaeger Agent"]
    AGENT --> COLLECTOR["Jaeger Collector"]
    COLLECTOR --> STORAGE["Storage<br>(Elasticsearch/Cassandra)"]
    STORAGE --> QUERY["Jaeger Query"]
    QUERY --> UI["Jaeger UI"]

This diagram shows the Jaeger data flow: Application, Agent, Collector, Storage, Query, and UI.#

Spring Boot Configuration#

Add Dependencies#

// build.gradle.kts
dependencies {
    implementation("io.micrometer:micrometer-tracing-bridge-otel")
    implementation("io.opentelemetry:opentelemetry-exporter-otlp")
}

application.yml#

management:
  tracing:
    sampling:
      probability: 1.0  # Development: 100%, Production: 0.1 (10%)
  otlp:
    tracing:
      endpoint: http://jaeger:4318/v1/traces

logging:
  pattern:
    level: "%5p [${spring.application.name:},%X{traceId:-},%X{spanId:-}]"

Manual Span Creation#

@Service
@RequiredArgsConstructor
public class OrderService {
    private final Tracer tracer;

    public Order processOrder(OrderRequest request) {
        Span span = tracer.nextSpan().name("processOrder").start();
        try (Tracer.SpanInScope ws = tracer.withSpan(span)) {
            span.tag("order.type", request.getType());
            span.event("Processing started");

            Order order = createOrder(request);

            span.event("Order created");
            return order;
        } finally {
            span.end();
        }
    }
}

Sampling Strategy#

Storing all traces causes costs to spike. Sampling optimizes costs.

Sampling Methods#

MethodDescriptionSuitable For
ProbabilisticCollect fixed percentageGeneral use
Rate LimitingCollect N per secondTraffic spikes
Tail-basedPrioritize errors/slow requestsProblem analysis focus
EnvironmentSampling RateReason
Development100%Track all requests
Staging50%Sufficient data
Production1-10%Cost optimization

100% Collection on Errors#

# OpenTelemetry Collector configuration
processors:
  tail_sampling:
    policies:
      - name: errors
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: slow
        type: latency
        latency:
          threshold_ms: 1000
      - name: probabilistic
        type: probabilistic
        probabilistic:
          sampling_percentage: 10

Connecting Logs/Metrics#

Connection via Trace ID#

graph LR
    METRIC["Metric Alert<br>Error rate spike"]
    LOG["Logs<br>Search by trace_id"]
    TRACE["Trace<br>Detailed flow"]

    METRIC --> |"Check time range"| LOG
    LOG --> |"Extract trace_id"| TRACE

This diagram shows the analysis flow from metric alerts to logs, extracting trace_id from logs to connect to traces.

Include Trace ID in Logs#

// Spring Boot auto-includes
// Log pattern: %X{traceId:-}

// Log output example
2026-01-12 10:30:00 INFO [order-service,abc123def456,span001] Order created: 12345

Connection in Grafana#

1. Identify error rate spike in dashboard
2. Go to Explore → Loki
3. Search {service="order-service"} |= "ERROR"
4. Click trace_id in log
5. View full trace in Tempo/Jaeger

Analysis Patterns#

Finding Bottleneck Segments#

Trace analysis:
├─ API Gateway (10ms) ✓
├─ Order Service (50ms) ✓
│   ├─ Validation (5ms) ✓
│   └─ Payment Call (2000ms) ← bottleneck!
│       └─ Payment DB (1800ms) ← root cause
└─ Response (5ms) ✓

Error Tracking#

Trace analysis:
├─ API Gateway (10ms) ✓
├─ Order Service (50ms) ✗ Error
│   ├─ Inventory Check (200ms)
│   └─ Error: "Insufficient stock"

Service Dependency Map#

Visualize connections between services in Jaeger/Tempo:

graph LR
    GW["API Gateway"] --> ORDER["Order"]
    GW --> USER["User"]
    ORDER --> PAYMENT["Payment"]
    ORDER --> INVENTORY["Inventory"]
    ORDER --> NOTIFICATION["Notification"]
    PAYMENT --> PAYMENT_DB["Payment DB"]
    INVENTORY --> INVENTORY_DB["Inventory DB"]

This diagram shows a service dependency map visualizing the connections from API Gateway to each service and database.#

Alerting Rules#

Trace-based Alerts#

# Prometheus alerting rule (Tempo integration)
groups:
  - name: tracing
    rules:
      - alert: HighTraceErrorRate
        expr: |
          sum(rate(traces_spanmetrics_calls_total{status_code="STATUS_CODE_ERROR"}[5m]))
          / sum(rate(traces_spanmetrics_calls_total[5m]))
          > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High trace error rate"

      - alert: SlowSpans
        expr: |
          histogram_quantile(0.99, sum(rate(traces_spanmetrics_latency_bucket[5m])) by (le, service))
          > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P99 span latency > 2s"

Key Summary#

ConceptDescription
TraceEntire request path
SpanIndividual unit of work
ContextTracking info passed between services
SamplingSelective collection for cost optimization

Implementation Checklist:

  • Add OpenTelemetry SDK
  • Configure sampling rate
  • Include trace_id in logs
  • Deploy Jaeger/Tempo
  • Grafana integration

Next Steps#

Recommended OrderDocumentWhat You’ll Learn
1OpenTelemetryStandardized instrumentation
2Full-Stack ExampleIntegration hands-on