Target Audience: Developers who know basic PromQL syntax Prerequisites: Basic Syntax What You’ll Learn: Aggregate multiple time series to calculate meaningful metrics

TL;DR#

Key Summary:

  • sum: Total - total request count
  • avg: Average - average response time
  • count: Count - number of active instances
  • topk/bottomk: Top/bottom N items
  • by/without: Specify grouping criteria

Aggregation Operators List#

OperatorDescriptionReturns
sumSumSingle value or per-group values
avgAverageSingle value or per-group values
minMinimumSingle value or per-group values
maxMaximumSingle value or per-group values
countNumber of time seriesSingle value or per-group values
stddevStandard deviationSingle value or per-group values
stdvarVarianceSingle value or per-group values
topkTop K itemsK time series
bottomkBottom K itemsK time series
count_valuesCount by valueOne time series per value
quantileQuantileSingle value or per-group values

sum (Total)#

Calculates the sum of all time series values.

Basic Usage#

# Total HTTP requests (all instances, all statuses)
sum(http_requests_total)

# Total requests per second
sum(rate(http_requests_total[5m]))

Aggregation by Group#

# Request count by status code
sum by (status) (http_requests_total)

# Result:
# {status="200"} 15000
# {status="404"} 500
# {status="500"} 100

# Request count by service and status
sum by (service, status) (rate(http_requests_total[5m]))

Practical Examples#

# Total error count
sum(rate(http_requests_total{status=~"5.."}[5m]))

# Error rate by service
sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))
/ sum by (service) (rate(http_requests_total[5m]))

avg (Average)#

Calculates the average of all time series values.

Basic Usage#

# Overall average CPU usage
avg(node_cpu_seconds_total{mode="idle"})

# Average memory usage per instance
avg by (instance) (
  node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes
) / avg by (instance) (node_memory_MemTotal_bytes) * 100

Practical Examples#

# Average response time by service
avg by (service) (
  rate(http_request_duration_seconds_sum[5m])
  / rate(http_request_duration_seconds_count[5m])
)

# Cluster average CPU usage
100 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100

count (Count)#

Counts the number of time series.

Basic Usage#

# Number of monitored targets
count(up)

# Number of active targets (up=1)
count(up == 1)

# Number of down targets
count(up == 0)

Aggregation by Group#

# Number of instances per service
count by (job) (up)

# Result:
# {job="api-server"} 5
# {job="web-server"} 3
# {job="database"} 2

Practical Examples#

# Number of services with errors
count(
  sum by (service) (rate(http_requests_total{status=~"5.."}[5m])) > 0
)

# Number of endpoints exceeding SLA response time
count(
  histogram_quantile(0.99, sum by (le, path) (rate(http_request_duration_seconds_bucket[5m])))
  > 0.5
)

min / max (Minimum / Maximum)#

Basic Usage#

# Lowest memory usage
min(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100)

# Highest CPU usage
max(100 - rate(node_cpu_seconds_total{mode="idle"}[5m]) * 100)

Aggregation by Group#

# Maximum response time by service
max by (service) (
  rate(http_request_duration_seconds_sum[5m])
  / rate(http_request_duration_seconds_count[5m])
)

topk / bottomk (Top / Bottom K)#

Basic Usage#

# Top 5 endpoints with most requests
topk(5, sum by (path) (rate(http_requests_total[5m])))

# Top 10 instances with highest memory usage
topk(10, node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)

# Bottom 3 disks with most available space
bottomk(3, node_filesystem_avail_bytes)

Practical Examples#

# Top 5 services with highest error rate
topk(5,
  sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))
  / sum by (service) (rate(http_requests_total[5m]))
)

# Top 3 endpoints with longest latency
topk(3,
  histogram_quantile(0.99,
    sum by (le, path) (rate(http_request_duration_seconds_bucket[5m]))
  )
)

by / without (Grouping)#

by: Keep only specified labels#

# Aggregate keeping only status label
sum by (status) (http_requests_total)

# Keep multiple labels
sum by (service, status) (http_requests_total)

without: Keep all labels except specified ones#

# Aggregate excluding instance label
sum without (instance) (http_requests_total)

# Exclude multiple labels
sum without (instance, pod) (http_requests_total)

by vs without#

# These two queries produce identical results
# Labels: {service, instance, status}

# Using by
sum by (service, status) (http_requests_total)

# Using without
sum without (instance) (http_requests_total)

Selection Criteria:

  • Few labels to keep → Use by
  • Few labels to remove → Use without

count_values (Count by Value)#

Counts how many times specific values appear.

# Number of instances by version
count_values("version", app_version)

# Result:
# {version="1.0.0"} 5
# {version="1.1.0"} 3
# {version="2.0.0"} 2

quantile (Quantile)#

Calculates a specific percentile among values.

# P90 CPU usage across all instances
quantile(0.9, node_cpu_seconds_total)

# P50 response time by service
quantile by (service) (0.5,
  rate(http_request_duration_seconds_sum[5m])
  / rate(http_request_duration_seconds_count[5m])
)
quantile() is used for Gauge or calculated values. To calculate percentiles from Histogram bucket data, use histogram_quantile().

Practical Patterns#

Error Rate Calculation#

# Overall error rate
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
* 100

# Error rate by service
sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))
/ sum by (service) (rate(http_requests_total[5m]))
* 100

Availability Calculation#

# Success rate (availability)
sum(rate(http_requests_total{status!~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
* 100

# Availability by service
(1 - sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))
     / sum by (service) (rate(http_requests_total[5m])))
* 100

Resource Utilization#

# Overall memory usage
(sum(node_memory_MemTotal_bytes) - sum(node_memory_MemAvailable_bytes))
/ sum(node_memory_MemTotal_bytes) * 100

# Disk usage per instance
(sum by (instance) (node_filesystem_size_bytes)
 - sum by (instance) (node_filesystem_avail_bytes))
/ sum by (instance) (node_filesystem_size_bytes) * 100

Traffic Analysis#

# Traffic percentage by endpoint
sum by (path) (rate(http_requests_total[5m]))
/ sum(rate(http_requests_total[5m]))
* 100

# Top 10 endpoints
topk(10, sum by (path) (rate(http_requests_total[5m])))

Key Takeaways#

OperatorPurposeExample
sumTotalTotal request count
avgAverageAverage CPU usage
countCountNumber of instances
min/maxExtremesMaximum memory
topkTop NTop 5 traffic
byGroupingAggregate by service
withoutExclude labelsAggregate excluding instance

Next Steps#

Recommended OrderDocumentWhat You’ll Learn
1rate and increaseCore concepts for handling Counters
2histogram_quantileCalculate P99 response time