Target Audience: Developers and SREs operating microservices Prerequisites: Three Pillars of Observability After Reading: You’ll understand distributed tracing and be able to analyze request flows between services

TL;DR#

Key Summary:

  • Trace: Entire path of one request (composed of multiple Spans)
  • Span: Single unit of work (start/end time, metadata)
  • Context Propagation: Passing Trace ID between services
  • Sampling: Store only a portion of total traces (cost optimization)

Why Is Distributed Tracing Necessary?#

In microservices, a single request passes through multiple services. It’s hard to identify where delays occur.

graph LR
    USER["User"] --> GW["API Gateway"]
    GW --> ORDER["Order Service"]
    ORDER --> PAYMENT["Payment Service"]
    ORDER --> INVENTORY["Inventory Service"]
    PAYMENT --> DB1["Payment DB"]
    INVENTORY --> DB2["Inventory DB"]

    style PAYMENT fill:#ffcdd2

Problem: Response is slow but don’t know where

Solution: Distributed tracing to check time spent in each segment


Core Concepts#

Trace and Span#

graph TB
    subgraph "Trace (Entire Request)"
        S1["Span: API Gateway<br>0-250ms"]
        S2["Span: Order Service<br>10-200ms"]
        S3["Span: Payment Service<br>20-180ms"]
        S4["Span: Payment DB<br>30-150ms"]
    end

    S1 --> S2
    S2 --> S3
    S3 --> S4
TermDescription
TraceEntire request path (unique Trace ID)
SpanIndividual unit of work (unique Span ID)
Parent SpanUpper Span that called current Span
Root SpanFirst Span (no Parent)

Span Structure#

{
  "traceId": "abc123def456",
  "spanId": "span001",
  "parentSpanId": null,
  "operationName": "HTTP GET /orders",
  "serviceName": "order-service",
  "startTime": 1704700800000,
  "duration": 245,
  "tags": {
    "http.method": "GET",
    "http.status_code": 200,
    "http.url": "/orders/123"
  },
  "logs": [
    {
      "timestamp": 1704700800100,
      "message": "Fetching order from database"
    }
  ]
}

Context Propagation#

The method of passing Trace ID between services.

sequenceDiagram
    participant A as Service A
    participant B as Service B
    participant C as Service C

    A->>B: HTTP Request<br>traceparent: 00-abc123-span1-01
    Note over B: Extract context<br>Create child span
    B->>C: HTTP Request<br>traceparent: 00-abc123-span2-01
    Note over C: Extract context<br>Create child span

W3C Trace Context format:

traceparent: 00-{trace-id}-{span-id}-{flags}
traceparent: 00-abc123def456789-fedcba987654321-01

Tool Comparison#

ToolFeaturesSuitable For
JaegerCNCF project, excellent UIKubernetes environments
ZipkinLightweight, easy setupQuick start
TempoGrafana integration, low costWhen using Grafana
AWS X-RayAWS integrationAWS environments

Architecture (Jaeger)#

graph TB
    APP["Application"] --> |"spans"| AGENT["Jaeger Agent"]
    AGENT --> COLLECTOR["Jaeger Collector"]
    COLLECTOR --> STORAGE["Storage<br>(Elasticsearch/Cassandra)"]
    STORAGE --> QUERY["Jaeger Query"]
    QUERY --> UI["Jaeger UI"]

Spring Boot Configuration#

Add Dependencies#

// build.gradle.kts
dependencies {
    implementation("io.micrometer:micrometer-tracing-bridge-otel")
    implementation("io.opentelemetry:opentelemetry-exporter-otlp")
}

application.yml#

management:
  tracing:
    sampling:
      probability: 1.0  # Development: 100%, Production: 0.1 (10%)
  otlp:
    tracing:
      endpoint: http://jaeger:4318/v1/traces

logging:
  pattern:
    level: "%5p [${spring.application.name:},%X{traceId:-},%X{spanId:-}]"

Manual Span Creation#

@Service
@RequiredArgsConstructor
public class OrderService {
    private final Tracer tracer;

    public Order processOrder(OrderRequest request) {
        Span span = tracer.nextSpan().name("processOrder").start();
        try (Tracer.SpanInScope ws = tracer.withSpan(span)) {
            span.tag("order.type", request.getType());
            span.event("Processing started");

            Order order = createOrder(request);

            span.event("Order created");
            return order;
        } finally {
            span.end();
        }
    }
}

Sampling Strategy#

Storing all traces causes costs to spike. Sampling optimizes costs.

Sampling Methods#

MethodDescriptionSuitable For
ProbabilisticCollect fixed percentageGeneral use
Rate LimitingCollect N per secondTraffic spikes
Tail-basedPrioritize errors/slow requestsProblem analysis focus
EnvironmentSampling RateReason
Development100%Track all requests
Staging50%Sufficient data
Production1-10%Cost optimization

100% Collection on Errors#

# OpenTelemetry Collector configuration
processors:
  tail_sampling:
    policies:
      - name: errors
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: slow
        type: latency
        latency:
          threshold_ms: 1000
      - name: probabilistic
        type: probabilistic
        probabilistic:
          sampling_percentage: 10

Connecting Logs/Metrics#

Connection via Trace ID#

graph LR
    METRIC["Metric Alert<br>Error rate spike"]
    LOG["Logs<br>Search by trace_id"]
    TRACE["Trace<br>Detailed flow"]

    METRIC --> |"Check time range"| LOG
    LOG --> |"Extract trace_id"| TRACE

Include Trace ID in Logs#

// Spring Boot auto-includes
// Log pattern: %X{traceId:-}

// Log output example
2026-01-12 10:30:00 INFO [order-service,abc123def456,span001] Order created: 12345

Connection in Grafana#

1. Identify error rate spike in dashboard
2. Go to Explore → Loki
3. Search {service="order-service"} |= "ERROR"
4. Click trace_id in log
5. View full trace in Tempo/Jaeger

Analysis Patterns#

Finding Bottleneck Segments#

Trace analysis:
├─ API Gateway (10ms) ✓
├─ Order Service (50ms) ✓
│   ├─ Validation (5ms) ✓
│   └─ Payment Call (2000ms) ← bottleneck!
│       └─ Payment DB (1800ms) ← root cause
└─ Response (5ms) ✓

Error Tracking#

Trace analysis:
├─ API Gateway (10ms) ✓
├─ Order Service (50ms) ✗ Error
│   ├─ Inventory Check (200ms)
│   └─ Error: "Insufficient stock"

Service Dependency Map#

Visualize connections between services in Jaeger/Tempo:

graph LR
    GW["API Gateway"] --> ORDER["Order"]
    GW --> USER["User"]
    ORDER --> PAYMENT["Payment"]
    ORDER --> INVENTORY["Inventory"]
    ORDER --> NOTIFICATION["Notification"]
    PAYMENT --> PAYMENT_DB["Payment DB"]
    INVENTORY --> INVENTORY_DB["Inventory DB"]

Alerting Rules#

Trace-based Alerts#

# Prometheus alerting rule (Tempo integration)
groups:
  - name: tracing
    rules:
      - alert: HighTraceErrorRate
        expr: |
          sum(rate(traces_spanmetrics_calls_total{status_code="STATUS_CODE_ERROR"}[5m]))
          / sum(rate(traces_spanmetrics_calls_total[5m]))
          > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High trace error rate"

      - alert: SlowSpans
        expr: |
          histogram_quantile(0.99, sum(rate(traces_spanmetrics_latency_bucket[5m])) by (le, service))
          > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P99 span latency > 2s"

Key Summary#

ConceptDescription
TraceEntire request path
SpanIndividual unit of work
ContextTracking info passed between services
SamplingSelective collection for cost optimization

Implementation Checklist:

  • Add OpenTelemetry SDK
  • Configure sampling rate
  • Include trace_id in logs
  • Deploy Jaeger/Tempo
  • Grafana integration

Next Steps#

Recommended OrderDocumentWhat You’ll Learn
1OpenTelemetryStandardized instrumentation
2Full-Stack ExampleIntegration hands-on