Time Required: 30 minutes Prerequisites: Spring Boot Metrics What You’ll Learn: Track issues by connecting the three pillars (Metrics, Logs, Traces)
Goal#
“Error rate spike” → “Find trace_id in logs” → “Identify root cause with traces”
graph LR
A["1. Metrics<br>Detect 5% error rate"]
B["2. Logs<br>Find trace_id"]
C["3. Traces<br>Discover bottleneck"]
D["4. Resolution<br>Optimize DB query"]
A --> B --> C --> DScenario: Order Service Failure Analysis#
System Architecture#
graph LR
USER["User"] --> GW["API Gateway"]
GW --> ORDER["Order Service"]
ORDER --> PAYMENT["Payment Service"]
ORDER --> INVENTORY["Inventory Service"]
PAYMENT --> DB["Payment DB"]Step 1: Failure Detection (Metrics)#
Grafana Alert Triggered#
Alert: HighErrorRate
Service: order-service
Error Rate: 5.2%
Threshold: 1%Check Dashboard#
# Confirm error rate spike
sum(rate(http_server_requests_seconds_count{application="order-service",status=~"5.."}[5m]))
/ sum(rate(http_server_requests_seconds_count{application="order-service"}[5m]))
# Which endpoint?
sum by (uri) (rate(http_server_requests_seconds_count{application="order-service",status=~"5.."}[5m]))Result: Errors occurring on /orders POST endpoint
Step 2: Log Analysis (Logs)#
Search Error Logs in Loki#
{app="order-service"} |= "ERROR" | json
# Errors in the last 10 minutes
{app="order-service"} | json | level="ERROR"
# Specific endpoint
{app="order-service"} |= "/orders" |= "ERROR"Log Results#
{
"timestamp": "2026-01-12T10:30:00Z",
"level": "ERROR",
"service": "order-service",
"message": "Payment processing failed",
"traceId": "abc123def456",
"spanId": "span001",
"error": "Connection timeout",
"user_id": "user-789"
}Key Information: traceId: abc123def456
Step 3: Trace Analysis (Traces)#
Search by Trace ID in Tempo#
- Grafana → Explore → Tempo
- Search TraceID:
abc123def456
Trace Results#
Trace: abc123def456 (Total: 3500ms)
├─ order-service: POST /orders (3500ms)
│ ├─ validateOrder (10ms) ✓
│ ├─ checkInventory (50ms) ✓
│ └─ processPayment (3400ms) ✗
│ ├─ payment-service: /process (3400ms)
│ │ └─ DB Query (3300ms) ← Bottleneck!Discovery: Payment DB query taking 3.3 seconds
Step 4: Root Cause Analysis#
Check DB Metrics#
# DB connection utilization
hikaricp_connections_active{application="payment-service"}
/ hikaricp_connections_max{application="payment-service"}
# Slow queries
rate(pg_stat_statements_total_time_seconds_sum[5m])
/ rate(pg_stat_statements_calls_total[5m])Result: Connection pool saturated + specific query slow
Resolution#
- Add DB index
- Increase connection pool size
- Optimize query
Integrated Configuration#
Link Grafana Datasources#
# grafana/provisioning/datasources/datasources.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
url: http://prometheus:9090
isDefault: true
- name: Loki
type: loki
url: http://loki:3100
jsonData:
derivedFields:
- name: traceId
matcherRegex: '"traceId":"([^"]+)"'
url: '$${__value.raw}'
datasourceUid: tempo
- name: Tempo
type: tempo
url: http://tempo:3200
jsonData:
tracesToLogsV2:
datasourceUid: loki
filterByTraceID: true
filterBySpanID: true
tracesToMetrics:
datasourceUid: prometheusLogs → Traces Connection#
Click traceId field in Loki to navigate to Tempo
Traces → Logs Connection#
Click Span in Tempo to display Loki logs for that time range
Traces → Metrics Connection#
Click service in Tempo to display Prometheus metrics
Dashboard Template#
Unified Dashboard#
┌─────────────────────────────────────────────────────────────┐
│ Row 1: Core Metrics (Prometheus) │
│ [P99 Latency] [RPS] [Error Rate] [Active Traces] │
├─────────────────────────────────────────────────────────────┤
│ Row 2: Error Logs (Loki) │
│ Live tail: {app=~".*"} | json | level="ERROR" │
├─────────────────────────────────────────────────────────────┤
│ Row 3: Recent Traces (Tempo) │
│ Table: Recent traces with errors │
├─────────────────────────────────────────────────────────────┤
│ Row 4: Service Map (Tempo) │
│ Service dependency graph │
└─────────────────────────────────────────────────────────────┘Alert → Analysis Workflow#
1. Receive Alert#
# Alertmanager → Slack
Alert: HighErrorRate
Service: order-service
Dashboard: https://grafana/d/order-service
Runbook: https://wiki/runbook/high-error-rate2. Check Dashboard#
- Verify error rate, P99
- Identify affected endpoints
3. Search Logs#
{app="order-service"} | json | level="ERROR" | line_format "{{.traceId}}: {{.message}}"4. Analyze Traces#
- Verify entire path with Trace ID
- Identify slow Spans
5. Take Action#
- Resolve root cause
- Verify improvement with metrics
Verification Checklist#
- All three pillars being collected
- Loki → Tempo connection working
- Tempo → Loki connection working
- Tempo → Prometheus connection working
- Unified dashboard configured
Next Steps#
| Recommended Order | Document | What You’ll Learn |
|---|---|---|
| 1 | Debugging High Latency | Problem resolution |
| 2 | Post-Alert Action Guide | Response methods |