Target Scenario: P99 response time exceeds SLA (500ms) Goal: Identify and resolve bottlenecks Duration: 15~30 minutes (depending on problem complexity) Success Criteria: P99 response time recovers below SLA threshold (500ms)
Problem Scenario#
Alert: HighP99Latency
Service: order-service
P99: 2.5s (Threshold: 500ms)
Duration: 10 minutesDiagnostic Workflow#
graph TD
A["1. Scope Analysis<br>Which service? Since when?"]
B["2. Segment Analysis<br>Where is it slow?"]
C["3. Resource Check<br>CPU/Memory/DB?"]
D["4. Root Cause<br>Code? Query? External?"]
E["5. Resolution"]
A --> B --> C --> D --> EStep 1: Scope Analysis#
Check Impact Scope#
# Which service is slow?
topk(5,
histogram_quantile(0.99,
sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
)
)
# Since when did it become slow?
histogram_quantile(0.99,
sum by (le) (rate(http_request_duration_seconds_bucket{service="order-service"}[5m]))
)
# → Time range: Last 1 hourCheck Specific Endpoints#
# P99 by endpoint
histogram_quantile(0.99,
sum by (uri, le) (rate(http_request_duration_seconds_bucket{service="order-service"}[5m]))
)Result: /orders POST endpoint is slow
Step 2: Segment Analysis (Tracing)#
Find Slow Traces#
- Grafana → Explore → Tempo
- Filter by Duration > 2s
- Analyze traces
Trace Analysis Results#
Trace: abc123 (Total: 2500ms)
├─ order-service: POST /orders (2500ms)
│ ├─ validateRequest (5ms) ✓
│ ├─ checkInventory (30ms) ✓
│ ├─ calculatePrice (10ms) ✓
│ └─ saveOrder (2400ms) ← Bottleneck!
│ └─ DB INSERT (2350ms) ← Root CauseStep 3: Resource Check#
Service Resources#
# CPU usage
100 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100
# Memory usage
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
# GC time
rate(jvm_gc_pause_seconds_sum[5m])DB Resources#
# DB connection pool usage
hikaricp_connections_active / hikaricp_connections_max * 100
# Pending connections
hikaricp_connections_pendingResult: DB connection pool at 90% usage, pending connections occurring
Step 4: Root Cause Analysis#
Check DB Queries#
-- PostgreSQL slow queries
SELECT query, calls, mean_time, total_time
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 10;Result: INSERT INTO orders query is slow
Causes#
- Large data volume in table without indexes
- DB connection pool saturation
- Transaction lock contention
Step 5: Resolution#
Immediate Actions#
# Increase connection pool
spring:
datasource:
hikari:
maximum-pool-size: 20 # Previously 10Root Solution#
-- Add indexes
CREATE INDEX idx_orders_user_id ON orders(user_id);
CREATE INDEX idx_orders_created_at ON orders(created_at);Verify Improvements#
# Check P99 changes
histogram_quantile(0.99,
sum by (le) (rate(http_request_duration_seconds_bucket{service="order-service"}[5m]))
)Preventive Measures#
Add Alert Rules#
- alert: DBConnectionPoolHigh
expr: hikaricp_connections_active / hikaricp_connections_max > 0.8
for: 5m
labels:
severity: warning
- alert: SlowDBQueries
expr: |
rate(spring_data_repository_invocations_seconds_sum[5m])
/ rate(spring_data_repository_invocations_seconds_count[5m])
> 0.5
for: 5m
labels:
severity: warningRecording Rules#
- record: service:db_query_duration:avg
expr: |
rate(spring_data_repository_invocations_seconds_sum[5m])
/ rate(spring_data_repository_invocations_seconds_count[5m])Common Causes and Solutions#
| Cause | Symptoms | Solution |
|---|---|---|
| Slow DB queries | Specific endpoints only | Indexes, query optimization |
| Connection pool shortage | General slowness | Increase pool size |
| GC | Periodic spikes | Heap tuning |
| External API | Specific calls only | Circuit breaker, timeout |
| CPU saturation | General slowness | Scale out |
Checklist#
- Identify impact scope (service, endpoint)
- Confirm bottleneck segment via traces
- Check resource usage
- Identify root cause
- Verify metrics normalize after resolution
- Add preventive alerts