Target Audience: Developers designing Prometheus metrics for the first time Prerequisites: Three Pillars of Observability After Reading: You’ll be able to select the appropriate metric type and implement it correctly

TL;DR#

Key Summary:

  • Counter: Only cumulative increase (request count, error count) → Calculate rate of change with rate()
  • Gauge: Current value (temperature, memory) → Use as-is or avg() for average
  • Histogram: Distribution measurement (response time) → Calculate percentiles with histogram_quantile()
  • Summary: Percentile calculation on client (rarely used)

Why Are Metric Types Important?#

Metric types are not just a technical choice. They reflect the nature of the data.

Analogy: Thermometer vs Pedometer

  • A thermometer shows the current temperature. It can be 25°C yesterday and 30°C today → This is a Gauge
  • A pedometer accumulates steps. Walking 5000 steps yesterday and 3000 today totals 8000 steps → This is a Counter

What if you “accumulated” temperature like a pedometer? Morning 20°C, noon 25°C, evening 18°C would give “63°C” - a meaningless number. Conversely, if you viewed step count as a “current value” like a thermometer, you’d only know “I just walked 3 steps” without knowing how many steps you walked today.

Consequences of Wrong Choices#

Wrong ChoiceResult
Request count as GaugeResets to 0 on server restart, can’t track cumulative
CPU usage as CounterMeaningless value when rate() applied
Response time as CounterCan’t calculate average/percentiles

Counter#

Definition#

A monotonically increasing cumulative value. Can only increase or reset to 0.

Why Are Counters Necessary?#

To answer questions like “How many orders came in today?” or “How many errors occurred this month?” you need cumulative data.

Analogy: Car Odometer

A car odometer never decreases. It doesn’t go back from 100,000 km to 90,000 km. Instead, if you want to know “how much did I drive this week?” you calculate the difference between Monday’s mileage and Sunday’s mileage.

Counters work the same way. The raw value (1 million requests) has little meaning - you need to calculate rate of change over time with rate() or increase() to make it useful.

Characteristics#

graph LR
    subgraph "Counter Behavior"
        A["Start: 0"] --> B["Request 1: 1"]
        B --> C["Request 2: 2"]
        C --> D["Request 3: 3"]
        D --> E["Restart: 0"]
        E --> F["Request 4: 1"]
    end
PropertyDescription
Monotonically increasingValue never decreases
ResettableResets to 0 on process restart
rate() requiredRate of change is more meaningful than raw value

Usage Example#

// Spring Boot + Micrometer
@RestController
public class OrderController {
    private final Counter orderCounter;

    public OrderController(MeterRegistry registry) {
        this.orderCounter = Counter.builder("orders_total")
            .description("Total number of orders")
            .tag("status", "created")
            .register(registry);
    }

    @PostMapping("/orders")
    public Order createOrder(@RequestBody OrderRequest request) {
        Order order = orderService.create(request);
        orderCounter.increment();  // Increment by 1
        return order;
    }
}

Metric Output:

orders_total{status="created"} 1523

PromQL Usage#

# Raw value (meaningless - just cumulative)
orders_total

# Requests per second (5 minute average)
rate(orders_total[5m])

# Total requests in 5 minutes
increase(orders_total[5m])

# Requests per hour
increase(orders_total[1h])

Naming Convention#

# Recommended: _total suffix
http_requests_total
orders_created_total
errors_total

# Not recommended
http_requests_count  # _count is for Histogram/Summary internal use

When to Use?#

  • Request/event counts
  • Error occurrence counts
  • Processed bytes
  • Completed task counts

Gauge#

Definition#

A current state value. Can increase or decrease.

Characteristics#

graph LR
    subgraph "Gauge Behavior"
        A["Memory: 50%"] --> B["Increase: 70%"]
        B --> C["Decrease: 45%"]
        C --> D["Increase: 80%"]
    end
PropertyDescription
BidirectionalCan increase/decrease
SnapshotState at a specific point in time
Direct useMeaningful without rate()

Usage Example#

// Current number of requests being processed
@Component
public class RequestGauge {
    private final AtomicInteger inProgress = new AtomicInteger(0);

    public RequestGauge(MeterRegistry registry) {
        Gauge.builder("http_requests_in_progress", inProgress, AtomicInteger::get)
            .description("Requests currently being processed")
            .register(registry);
    }

    public void requestStarted() {
        inProgress.incrementAndGet();
    }

    public void requestFinished() {
        inProgress.decrementAndGet();
    }
}

Metric Output:

http_requests_in_progress 42

PromQL Usage#

# Current value
http_requests_in_progress

# Average (across multiple instances)
avg(http_requests_in_progress)

# Maximum value
max(http_requests_in_progress)

# Change over time (for debugging)
deriv(http_requests_in_progress[5m])

When to Use?#

  • CPU/memory usage
  • Current connection count
  • Queue size
  • Physical measurements like temperature, speed
  • Configuration values (version info, etc.)

Histogram#

Definition#

Measures value distribution in buckets (ranges). Used when distribution matters, like response times or request sizes.

Why Are Histograms Necessary?#

“Average response time 200ms” might look good, but in reality 99% could be 50ms while 1% is 15 seconds. Averages are easily skewed by outliers.

Analogy: Test Score Distribution

When a class average is 70 points, two situations are possible:

  1. Most students between 65-75 points → Even distribution
  2. Half at 40 points, half at 100 points → Polarized

Both situations look the same by average, but the distribution is completely different. Histograms show distribution like “80% under 100ms,” “95% under 500ms,” “99% under 1 second.”

This is why percentiles like P50, P95, P99 are important. If P99 is 2 seconds, it means “1 in 100 people waits more than 2 seconds.”

Characteristics#

graph TD
    subgraph "Histogram Buckets"
        B1["le=0.1: 100"]
        B2["le=0.5: 350"]
        B3["le=1.0: 480"]
        B4["le=+Inf: 500"]
    end

    B1 --> |"0~0.1s"| N1["100 requests"]
    B2 --> |"0~0.5s"| N2["350 requests"]
    B3 --> |"0~1.0s"| N3["480 requests"]
    B4 --> |"Total"| N4["500 requests"]
ComponentDescription
_bucketCumulative count per range
_countTotal observation count
_sumSum of all values
le (label)Less than or Equal

Usage Example#

@Component
public class RequestTimer {
    private final Timer requestTimer;

    public RequestTimer(MeterRegistry registry) {
        this.requestTimer = Timer.builder("http_request_duration_seconds")
            .description("HTTP request duration")
            .publishPercentileHistogram()  // Generate histogram buckets
            .sla(Duration.ofMillis(100), Duration.ofMillis(500), Duration.ofSeconds(1))
            .register(registry);
    }

    public void recordRequest(Runnable action) {
        requestTimer.record(action);
    }
}

Metric Output:

http_request_duration_seconds_bucket{le="0.1"} 100
http_request_duration_seconds_bucket{le="0.5"} 350
http_request_duration_seconds_bucket{le="1.0"} 480
http_request_duration_seconds_bucket{le="+Inf"} 500
http_request_duration_seconds_count 500
http_request_duration_seconds_sum 245.5

PromQL Usage#

# P50 (median)
histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))

# P95
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# P99
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# Average response time
rate(http_request_duration_seconds_sum[5m])
/ rate(http_request_duration_seconds_count[5m])

Bucket Design#

Number of buckets directly affects cardinality. Too many buckets increase storage costs.
// Recommended: Design based on SLA
.sla(
    Duration.ofMillis(50),   // Fast response
    Duration.ofMillis(100),  // Target SLA
    Duration.ofMillis(250),
    Duration.ofMillis(500),
    Duration.ofSeconds(1),   // Slow response threshold
    Duration.ofSeconds(5)    // Near timeout
)

When to Use?#

  • Response time/latency
  • Request/response sizes
  • Batch job processing time
  • When percentile calculation (P50, P95, P99) is needed

Summary#

Definition#

Pre-calculates percentiles on the client. Similar to Histogram but difficult to aggregate on the server side.

Histogram vs Summary#

ItemHistogramSummary
Percentile calculationServer (PromQL)Client
AggregatableCan aggregate multiple instancesCannot aggregate
AccuracyDepends on bucket boundariesAccurate
CPU usageServer loadClient load
Summary is rarely used. Percentiles from multiple instances cannot be combined, making it unsuitable for distributed environments. Histogram is recommended.

Type Selection Guide#

graph TD
    Q1{"Does the value only increase?"}
    Q1 --> |"Yes"| C["Counter"]
    Q1 --> |"No"| Q2{"Is distribution important?"}

    Q2 --> |"Yes"| H["Histogram"]
    Q2 --> |"No"| G["Gauge"]

    C --> C1["Request count, Error count"]
    G --> G1["CPU, Memory, Queue size"]
    H --> H1["Response time, Request size"]

Quick Reference Table#

Measurement TargetTypeReason
HTTP request countCounterCumulative increase
HTTP error countCounterCumulative increase
Response timeHistogramDistribution/percentile needed
CPU usageGaugeCurrent state
Memory usageGaugeCurrent state
Active connectionsGaugeCan increase/decrease
Request sizeHistogramDistribution needed
Queue pending itemsGaugeCurrent state
Processed bytesCounterCumulative increase

Naming Convention#

Basic Rules#

# Format
{namespace}_{name}_{unit}_{suffix}

# Examples
http_request_duration_seconds_bucket
process_cpu_seconds_total
node_memory_bytes

Recommendations#

ItemRuleExample
UnitUse base unitsseconds (not milliseconds)
SuffixCounter uses _totalhttp_requests_total
Casesnake_caseorder_created_total
ClaritySpecify measurement targethttp_request_duration_seconds

Practical Example: Spring Boot Metrics#

@RestController
@RequiredArgsConstructor
public class OrderController {
    private final MeterRegistry registry;

    // Counter: Order creation count
    private Counter orderCounter(String status) {
        return Counter.builder("orders_total")
            .tag("status", status)
            .register(registry);
    }

    // Gauge: Currently processing orders
    private final AtomicInteger ordersInProgress = new AtomicInteger(0);

    @PostConstruct
    void registerGauge() {
        Gauge.builder("orders_in_progress", ordersInProgress, AtomicInteger::get)
            .register(registry);
    }

    // Histogram: Order processing time
    private Timer orderTimer() {
        return Timer.builder("order_processing_duration_seconds")
            .publishPercentileHistogram()
            .register(registry);
    }

    @PostMapping("/orders")
    public Order createOrder(@RequestBody OrderRequest request) {
        ordersInProgress.incrementAndGet();
        try {
            return orderTimer().record(() -> {
                Order order = orderService.create(request);
                orderCounter("success").increment();
                return order;
            });
        } catch (Exception e) {
            orderCounter("failed").increment();
            throw e;
        } finally {
            ordersInProgress.decrementAndGet();
        }
    }
}

Key Summary#

TypePurposePromQLExample
CounterCumulative countrate(), increase()Request count
GaugeCurrent stateUse as-isCPU %
HistogramDistribution measurementhistogram_quantile()Response time

Next Steps#

Recommended OrderDocumentWhat You’ll Learn
1Prometheus ArchitecturePull model, time series DB
2PromQL Syntax BasicsSelectors, label matching
3rate and increaseCounter usage