Target Audience: Developers new to Observability concepts Prerequisites: Basic understanding of web application architecture After Reading: You’ll understand the role of each pillar and know when to use which
TL;DR#
Key Summary:
- Metrics: “How much?” - Numerically measurable states (CPU 80%, response time 200ms)
- Logs: “What happened?” - Detailed records of individual events
- Traces: “From where to where?” - Tracking the entire path of a request
- The 3 pillars are complementary, most effective when used together
Why Are All Three Pillars Necessary?#
Analogy: A Doctor’s Diagnostic Process
When a patient complains “My stomach hurts,” how does the doctor diagnose?
- Temperature, blood pressure, pulse measurement (= Metrics): “Fever 38.5°C, blood pressure normal” → Understanding state through numbers
- Interview and symptom recording (= Logs): “Started 3 days ago, worsens after meals” → Collecting detailed context
- CT/X-ray imaging (= Traces): “Inflammation confirmed in appendix area” → Visualizing problem location
Diagnosis is impossible with just one. You can’t know it’s appendicitis from temperature alone, and you can’t pinpoint the exact location from symptom records alone. System observation is the same.
Understanding Through an Actual Incident#
Let’s analyze a single incident with all 3 pillars.
Situation: “Order API response has become slow”
graph TD
A["Problem Recognition"] --> M["Metrics<br>P99 response time 3s"]
M --> |"Since when?"| M2["Spiked in last 10 min"]
M2 --> |"Detailed cause?"| L["Logs<br>DB connection timeout"]
L --> |"Which request?"| T["Traces<br>Payment Service → DB delay"]
T --> |"After improvement"| M| Step | Used Pillar | Information Gained |
|---|---|---|
| 1. Anomaly detection | Metrics | “P99 response time spiked to 3s” |
| 2. Time identification | Metrics | “Started 10 minutes ago” |
| 3. Root cause tracking | Logs | “DB connection pool exhausted, timeout occurred” |
| 4. Path analysis | Traces | “Delay in Payment Service → DB segment” |
| 5. Improvement verification | Metrics | “Response time normalized after connection pool expansion” |
Each pillar alone has limitations:
- Metrics only: Know “it’s slow” but not “why”
- Logs only: Can see individual events but hard to grasp overall trends
- Traces only: Can see request flow but hard to grasp overall system state
Metrics#
Definition#
Metrics are numerical data measured over time. They express the system’s state in numbers.
Why Are Metrics Necessary?#
If logs record “what happened,” metrics tell you “how healthy is the system right now.”
Analogy: Car Dashboard
When the engine warning light comes on while driving, you immediately know “there’s a problem.” You don’t need to read the engine logs one by one. Metrics let you grasp the system’s health status at a glance.
The core value of metrics is aggregation and trend analysis. Questions like “How has the average response time changed over the last hour?” or “What’s the traffic difference between 10 AM today and yesterday?” can be answered immediately.
Characteristics#
| Property | Description |
|---|---|
| Aggregatable | Statistical operations like sum, avg, percentile possible |
| Storage efficient | Small footprint as only numbers are stored |
| Time series | Can identify trends based on time axis |
| Alert-suitable | Can set automatic alerts based on thresholds |
Metric Types#
graph LR
subgraph "Metric Types"
C["Counter<br>Cumulative Increase"]
G["Gauge<br>Current Value"]
H["Histogram<br>Distribution"]
S["Summary<br>Percentiles"]
end
C --> |"Examples"| C1["Request count, Error count"]
G --> |"Examples"| G1["CPU usage, Memory"]
H --> |"Examples"| H1["Response time distribution"]
S --> |"Examples"| S1["P50, P95, P99"]Examples#
# CPU usage (Gauge)
node_cpu_seconds_total
# Requests per second (Counter → converted with rate)
rate(http_requests_total[5m])
# P99 response time (Histogram)
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))When to Use?#
- System state monitoring (CPU, memory, disk)
- SLA/SLO measurement (response time, availability)
- Trend analysis (traffic patterns, growth rate)
- Alert condition settings
Logs#
Definition#
Logs are text records of individual events. They record in detail what happened at a specific point in time.
Why Are Logs Necessary?#
If metrics tell you “CPU is at 90%,” logs tell you “why it’s at 90%.”
Analogy: Dashcam Recording
When a car accident happens, knowing just the fact “a collision occurred” (= metrics) doesn’t reveal the cause. Watching the dashcam footage (= logs) reveals context like “the car ahead suddenly braked” or “ran a red light.”
The core value of logs is detailed context and debugging information. When an error occurs, you can trace “which user,” “with what request,” “sending what data” caused the problem.
Characteristics#
| Property | Description |
|---|---|
| Detailed | Contains context and details of events |
| Searchable | Can filter by keywords, patterns |
| Unstructured | Free-form text (structured logs recommended) |
| Storage cost | Large footprint due to text volume |
Log Levels#
DEBUG → Detailed info during development
INFO → Normal operation records
WARN → Potential problem warnings
ERROR → Error occurred
FATAL → System-stopping level errorsStructured Log Example#
{
"timestamp": "2026-01-12T10:30:00Z",
"level": "ERROR",
"service": "order-service",
"trace_id": "abc123",
"message": "Order creation failed",
"error": "Out of stock",
"order_id": "ORD-456",
"user_id": "USR-789"
}Advantages of Structured Logs:
- Easy field-by-field search/filtering
- Connect with distributed tracing via Trace ID
- Automatic parsing and dashboard creation possible
When to Use?#
- Detailed error cause analysis
- Debugging and problem reproduction
- Audit records
- Abnormal pattern detection
Traces#
Definition#
Traces record the entire path a request takes through a distributed system. They track a single request as it passes through multiple services.
Why Are Traces Necessary?#
In a microservices environment, a single request can pass through 5, 10, or even 20 services. When you see a metric showing “response is slow,” how do you know which service caused the delay?
Analogy: Package Tracking System
When you order something online, you can see the path “distribution center → regional hub → delivery driver → delivered” through “delivery tracking.” If delivery is delayed, you can immediately see which segment it’s stuck in.
Traces are the “delivery tracking” for requests. They measure the time taken in each segment of the path from API Gateway → Order Service → Payment Service → Database. If Payment Service took 2 seconds, you can focus your analysis on that cause.
Core Concepts#
graph LR
subgraph "Trace (Entire Request)"
S1["Span 1<br>API Gateway<br>50ms"]
S2["Span 2<br>Order Service<br>120ms"]
S3["Span 3<br>Payment Service<br>200ms"]
S4["Span 4<br>Database<br>80ms"]
end
S1 --> S2
S2 --> S3
S2 --> S4| Term | Description |
|---|---|
| Trace | Entire path of one request (composed of multiple Spans) |
| Span | Single unit of work (includes start/end time, metadata) |
| Trace ID | Unique ID identifying the entire request |
| Span ID | ID identifying individual operations |
| Parent Span | Upper Span that called the current Span |
Context Propagation#
The method of passing Trace ID between services.
HTTP Header example:
traceparent: 00-abc123-def456-01
W3C Trace Context format:
version-trace_id-span_id-flagsWhen to Use?#
- Identifying delay segments between microservices
- Visualizing the entire flow of specific requests
- Analyzing dependencies between services
- Identifying bottleneck points
Connecting the Three Pillars#
Connection via Trace ID#
graph TD
subgraph "Integrated Analysis Flow"
M["Metrics<br>Error rate spike detected"]
L["Logs<br>Filter by trace_id"]
T["Traces<br>Visualize failure segment"]
end
M --> |"Check time range"| L
L --> |"Extract trace_id"| T
T --> |"After improvement"| MPractical Example: Order Failure Analysis#
1. Detect Anomaly in Metrics
# Error rate exceeds 5%
sum(rate(http_requests_total{status="500"}[5m]))
/ sum(rate(http_requests_total[5m])) > 0.052. Search Logs for That Time Period
level:ERROR AND service:order-service AND timestamp:[2026-01-12T10:00 TO 2026-01-12T10:30]Found trace_id: abc123 in results
3. Check Entire Path with Trace
Trace ID: abc123
├─ API Gateway (10ms) ✓
├─ Order Service (50ms) ✓
├─ Payment Service (2000ms) ✗ ← bottleneck
└─ Inventory Service (30ms) ✓4. Conclusion: Payment Service external API call delay
Tool Selection Guide#
| Pillar | Open Source Tools | Cloud Services |
|---|---|---|
| Metrics | Prometheus, VictoriaMetrics | CloudWatch, Datadog |
| Logs | Loki, Elasticsearch | CloudWatch Logs, Splunk |
| Traces | Jaeger, Tempo | X-Ray, Datadog APM |
| Integration | OpenTelemetry | Datadog, New Relic |
OpenTelemetry unifies all three pillars into a single standard. For new projects, adopting OpenTelemetry is recommended.
Trade-offs#
| Pillar | Advantages | Disadvantages |
|---|---|---|
| Metrics | Storage efficient, alert-suitable | Lacks detailed context |
| Logs | Provides detailed context | High storage cost, analysis difficult |
| Traces | Understand distributed system flow | Complex implementation, sampling needed |
Cost Optimization Strategies#
- Metrics: Measure everything (inexpensive)
- Logs: Long-term retention only for ERROR and above, short-term for DEBUG
- Traces: Apply sampling (1~10% of total)
Key Summary#
| Question | Metrics | Logs | Traces |
|---|---|---|---|
| What? | Numerical data | Event records | Request paths |
| When? | State monitoring | Cause analysis | Flow tracking |
| Strength? | Trends, alerts | Detailed context | Distributed systems |
Next Steps#
| Recommended Order | Document | What You’ll Learn |
|---|---|---|
| 1 | Metrics Fundamentals | Counter, Gauge, Histogram types |
| 2 | Log Aggregation | Loki vs ELK comparison |
| 3 | Distributed Tracing | Span, Context Propagation |