Target Scenario: Prometheus memory/storage spike, slow queries Goal: Optimize costs by reducing unnecessary time series Duration: 30 minutes~1 hour (depending on analysis and fix complexity) Success Criteria: Time series count reduced below target and memory usage stabilized

Problem Scenario#

Alert: PrometheusHighCardinality
Active Series: 2,500,000 (Threshold: 1,000,000)
Memory Usage: 32GB

What is Cardinality?#

Cardinality = Number of unique time series

http_requests_total{method="GET", status="200", path="/api/users"}     # 1 series
http_requests_total{method="GET", status="200", path="/api/users/123"} # Another 1!
http_requests_total{method="GET", status="200", path="/api/users/456"} # Another 1!

Problem: If user_id is in path, time series created for each user

Step 1: Cardinality Analysis#

Current Time Series Count#

# Total time series count
count({__name__=~".+"})

# Time series count by metric
count by (__name__) ({__name__=~".+"})

# Top 10 metrics
topk(10, count by (__name__) ({__name__=~".+"}))

Cardinality by Label#

# Unique value count by label
count(count by (path) (http_requests_total))

# Find high cardinality labels
count by (path) (http_requests_total)

Check TSDB Status#

curl http://localhost:9090/api/v1/status/tsdb | jq .

Step 2: Identify Problem Labels#

Dangerous Patterns#

PatternProblemExpected Time Series
user_idOne per userHundreds of thousands+
request_idOne per requestInfinite
timestampOne per timeInfinite
session_idOne per sessionTens of thousands+
URL path (with ID)One per IDTens of thousands+

Safe Patterns#

PatternExpected Time Series
method (GET/POST)~5
status (2xx/4xx/5xx)~5
service~100
endpoint (normalized)~100

Step 3: Solutions#

1. Fix in Application#

// ❌ Bad: Dynamic path
Timer.builder("http_requests")
    .tag("path", "/users/123")  // user_id included
    .register(registry);

// ✅ Good: Normalized path
Timer.builder("http_requests")
    .tag("path", "/users/{id}")  // Pattern-based
    .register(registry);

Spring Boot Configuration#

@Configuration
public class MetricsConfig {
    @Bean
    public WebMvcTagsContributor webMvcTagsContributor() {
        return (exchange, response, handler) -> {
            // Normalize URL
            String pattern = getPattern(handler);
            return Tags.of("uri", pattern != null ? pattern : "UNKNOWN");
        };
    }
}

2. Prometheus Relabeling#

# prometheus.yml
scrape_configs:
  - job_name: 'app'
    metric_relabel_configs:
      # Remove specific labels
      - action: labeldrop
        regex: 'user_id|request_id|session_id'

      # Exclude specific metrics
      - source_labels: [__name__]
        regex: 'go_.*'
        action: drop

      # Exclude high cardinality metrics
      - source_labels: [__name__]
        regex: 'http_requests_total'
        action: drop
        # Use recording rule instead

3. Aggregate with Recording Rules#

# Original: High cardinality
# http_requests_total{method, status, path, instance, pod}

# Recording Rule: Only necessary labels
groups:
  - name: aggregation
    rules:
      - record: job:http_requests:rate5m
        expr: sum by (job, method, status) (rate(http_requests_total[5m]))

4. Limit Collection Itself#

# prometheus.yml
scrape_configs:
  - job_name: 'app'
    sample_limit: 10000  # Limit sample count
    metric_relabel_configs:
      - source_labels: [__name__]
        regex: 'go_.*|process_.*'  # Unnecessary metrics
        action: drop

Step 4: Verification#

Before/After Comparison#

# Time series count change
count({__name__=~".+"})

# Memory usage
process_resident_memory_bytes{job="prometheus"}

# Query time
prometheus_engine_query_duration_seconds

Prevention Guidelines#

Label Design Principles#

  1. Label values must be finite

    • method=GET|POST|PUT|DELETE
    • user_id=12345
  2. Predict label combinations

    Total time series = Metric count × Label1 × Label2 × ...
    
    Example: 100 metrics × 5 methods × 5 statuses × 100 paths = 250,000
  3. Dynamic values as metric values

    // ❌ As label
    Counter.builder("orders").tag("order_id", id).register(registry);
    
    // ✅ As log
    log.info("Order created: {}", orderId);

Code Review Checklist#

  • No dynamic label values?
  • Predicted label combination count?
  • No ID or timestamp included?
  • URL pattern-based?

Quick Reference#

SituationSolution
user_id labelRemove or use logs
ID in URLPattern-based (/users/{id})
Unnecessary metricsaction: drop
High cardinality metricsAggregate with Recording Rule