Target Audience: Developers and SREs who need response time analysis Prerequisites: Metrics Fundamentals, rate and increase What You’ll Learn: Calculate accurate percentiles from Histograms and monitor SLAs
TL;DR#
Key Summary:
histogram_quantile(φ, bucket): Calculate φ percentile (0 ≤ φ ≤ 1)- P50:
histogram_quantile(0.5, ...)- Median- P95:
histogram_quantile(0.95, ...)- 95% are at or below this value- P99:
histogram_quantile(0.99, ...)- 99% are at or below this value- Must always use with
rate()orincrease()
Why Percentiles Matter#
Averages are distorted by extreme values. Percentiles better reflect actual user experience.
graph LR
subgraph "Response Time Distribution"
A["90% users: 100ms"]
B["9% users: 200ms"]
C["1% users: 5000ms"]
end
AVG["Average: 149ms<br>❌ Distorted"]
P99["P99: 5000ms<br>✅ Reflects worst experience"]| Metric | Value | Meaning |
|---|---|---|
| Average | 149ms | Distorted by extremes |
| P50 | 100ms | Half of requests are at or below |
| P95 | 200ms | 95% are at or below |
| P99 | 5000ms | 99% are at or below (1% wait 5 seconds) |
Histogram Structure Review#
A Histogram generates 3 types of time series.
# _bucket: Cumulative count per bucket (le = less than or equal)
http_request_duration_seconds_bucket{le="0.1"} 24054 # ≤ 0.1 seconds
http_request_duration_seconds_bucket{le="0.5"} 33444 # ≤ 0.5 seconds
http_request_duration_seconds_bucket{le="1"} 34022 # ≤ 1 second
http_request_duration_seconds_bucket{le="+Inf"} 34122 # All
# _count: Total observation count
http_request_duration_seconds_count 34122
# _sum: Sum of all values
http_request_duration_seconds_sum 2042.53graph LR
subgraph "Bucket Structure (Cumulative)"
B1["le=0.1<br>24054"]
B2["le=0.5<br>33444"]
B3["le=1.0<br>34022"]
B4["le=+Inf<br>34122"]
end
B1 --> |"includes"| B2
B2 --> |"includes"| B3
B3 --> |"includes"| B4histogram_quantile() Usage#
Basic Syntax#
histogram_quantile(
φ, # Percentile (value between 0-1)
bucket # _bucket time series (with rate applied)
)Basic Examples#
# P50 (median)
histogram_quantile(0.5,
rate(http_request_duration_seconds_bucket[5m])
)
# P90
histogram_quantile(0.9,
rate(http_request_duration_seconds_bucket[5m])
)
# P95
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
)
# P99
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket[5m])
)Why rate() is Needed#
# ❌ Buckets are cumulative, doesn't consider time range
histogram_quantile(0.99, http_request_duration_seconds_bucket)
# ✅ rate() calculates distribution during that time
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))Percentiles by Group#
Must Keep le Label#
# P99 by service
histogram_quantile(0.99,
sum by (service, le) (
rate(http_request_duration_seconds_bucket[5m])
)
)The
lelabel must be preserved. Withoutle, the bucket structure breaks and calculation becomes impossible.# ❌ Aggregating without le causes error sum by (service) (rate(..._bucket[5m])) # ✅ Aggregate including le sum by (service, le) (rate(..._bucket[5m]))
Various Grouping Examples#
# P99 by endpoint
histogram_quantile(0.99,
sum by (path, le) (
rate(http_request_duration_seconds_bucket[5m])
)
)
# P95 by service + endpoint
histogram_quantile(0.95,
sum by (service, path, le) (
rate(http_request_duration_seconds_bucket[5m])
)
)
# Overall system P99 (remove all labels except le)
histogram_quantile(0.99,
sum by (le) (
rate(http_request_duration_seconds_bucket[5m])
)
)Practical Patterns#
SLA Monitoring#
# Check if P99 is below 500ms
histogram_quantile(0.99,
sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
) < 0.5
# Find services violating SLA
histogram_quantile(0.99,
sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
) > 0.5Alert Rules#
# prometheus/rules/latency.yml
groups:
- name: latency
rules:
- alert: HighP99Latency
expr: |
histogram_quantile(0.99,
sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "{{ $labels.service }} P99 latency is {{ $value | humanizeDuration }}"Compare with Average Response Time#
# Average response time
rate(http_request_duration_seconds_sum[5m])
/ rate(http_request_duration_seconds_count[5m])
# P99 response time
histogram_quantile(0.99,
sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)
# P99 / Average ratio (degree of imbalance)
histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
/ (rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m]))Percentile Trend Comparison#
# P50 vs P99 difference (degree of long tail)
histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
- histogram_quantile(0.5, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))Accuracy and Bucket Design#
Linear Interpolation#
histogram_quantile uses linear interpolation between bucket boundaries. Bucket design directly affects accuracy.
graph LR
subgraph "Wide Buckets"
W1["le=0.1: 100"]
W2["le=1.0: 500"]
end
WR["P99 = 0.92s<br>(Actual: 0.8s, Large error)"]
subgraph "Fine-grained Buckets"
N1["le=0.1: 100"]
N2["le=0.25: 200"]
N3["le=0.5: 350"]
N4["le=0.75: 450"]
N5["le=1.0: 500"]
end
NR["P99 = 0.82s<br>(Actual: 0.8s, Small error)"]Bucket Design Recommendations#
// Spring Boot + Micrometer
Timer.builder("http_request_duration_seconds")
.publishPercentileHistogram()
.sla(
Duration.ofMillis(10), // Very fast response
Duration.ofMillis(50), // Fast response
Duration.ofMillis(100), // Normal target
Duration.ofMillis(250), // Slow response starts
Duration.ofMillis(500), // SLA threshold
Duration.ofSeconds(1), // Slow response
Duration.ofSeconds(5) // Near timeout
)
.register(registry);Bucket Design Principles:
- Concentrate buckets near SLA thresholds
- Cover 90%+ of expected distribution
- Consider cardinality (bucket count × label combinations)
Common Mistakes#
1. Missing le Label#
# ❌ Aggregate without le
histogram_quantile(0.99,
sum by (service) (rate(http_request_duration_seconds_bucket[5m]))
)
# Result: NaN
# ✅ Include le
histogram_quantile(0.99,
sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
)2. Missing rate()#
# ❌ Used without rate
histogram_quantile(0.99, http_request_duration_seconds_bucket)
# Result: Based on all data since server start
# ✅ Apply rate
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
# Result: Based on last 5 minutes3. Exceeding Bucket Range#
# When buckets only go up to le=1
# If actual P99 is 2 seconds...
histogram_quantile(0.99, ...)
# Result: 1 (max bucket value) - inaccurate
# Solution: Need to add larger buckets4. +Inf Display in Grafana#
If you see +Inf values in Grafana graphs, most requests exceed the maximum bucket.
# Check bucket coverage
http_request_duration_seconds_bucket{le="1"}
/ http_request_duration_seconds_bucket{le="+Inf"}
# Above 0.95 indicates good bucket designNative Histogram (Prometheus 2.40+)#
Native Histogram automatically manages buckets. Still experimental, but solves bucket design issues.
# With Native Histogram, use directly histogram_quantile(0.99, rate(http_request_duration_seconds[5m])) # _bucket suffix not needed
Key Takeaways#
| Percentile | Query | Meaning |
|---|---|---|
| P50 | histogram_quantile(0.5, ...) | Median |
| P90 | histogram_quantile(0.9, ...) | 90% at or below |
| P95 | histogram_quantile(0.95, ...) | 95% at or below |
| P99 | histogram_quantile(0.99, ...) | 99% at or below |
Complete Query Template:
histogram_quantile(
0.99,
sum by (service, le) (
rate(http_request_duration_seconds_bucket[5m])
)
)Next Steps#
| Recommended Order | Document | What You’ll Learn |
|---|---|---|
| 1 | Recording Rules | Pre-compute complex queries |
| 2 | SRE Golden Signals - Latency | Latency monitoring strategy |