What are Golden Signals?#

The Four Golden Signals introduced in the Google SRE book are core metrics for understanding service health.

Among hundreds of metrics, the Golden Signals are the “truly important ones”. Like a car dashboard showing only speed, fuel, engine temperature, and warning lights, they provide the minimum set of metrics to grasp system health at a glance.

Analogy: Key Health Check Metrics

During a health checkup, dozens of items are measured, but doctors first look at a few key metrics:

  • Blood pressure (corresponds to Latency): Problems when too high
  • Heart rate (corresponds to Traffic): How active it is
  • Blood sugar (corresponds to Errors): Abnormal values
  • Weight (corresponds to Saturation): Approaching limits

If these four are normal, you’re generally healthy. The same applies to systems.

The Four Golden Signals#

graph TD
    subgraph "Golden Signals"
        L["Latency<br>Response Time"]
        T["Traffic<br>Throughput"]
        E["Errors<br>Error Rate"]
        S["Saturation<br>Resource Usage"]
    end

    L --> |"How slow?"| Q1["Response Speed"]
    T --> |"How much?"| Q2["Throughput"]
    E --> |"Failing?"| Q3["Success Rate"]
    S --> |"At limit?"| Q4["Resource Headroom"]
SignalMeasuresKey Question
LatencyResponse time“How fast is it?”
TrafficThroughput“How much is it processing?”
ErrorsFailure rate“How much is failing?”
SaturationResource usage“How much headroom is left?”

Why These Four?#

Through years of operational experience, Google’s SRE team concluded that “monitoring just these four can detect most problems”.

Completeness#

The four signals cover all aspects of system state:

  1. Latency: Detect response speed issues
  2. Traffic: Understand load level
  3. Errors: Detect correctness issues
  4. Saturation: Predict capacity limits

Other metrics (e.g., heap memory, GC time) are mostly root causes or detailed indicators of these four. For example, long GC times increase Latency, and full disk causes Errors.

Direct User Impact#

SignalUser Impact
Latency ↑Slow page loading, increased bounce rate
Traffic ↑Increased system load
Errors ↑Features don’t work, trust declines
Saturation ↑Predicts imminent failure

Interconnected#

graph LR
    T["Traffic ↑"] --> S["Saturation ↑"]
    S --> L["Latency ↑"]
    S --> E["Errors ↑"]
    L --> E

Increased traffic leads to higher saturation, which leads to increased latency and error rates.


USE Method and RED Method#

Besides Golden Signals, these are frequently used methodologies.

USE (Resource-Focused)#

Utilization, Saturation, Errors - Suitable for infrastructure/resource monitoring

MetricTargetExample
UtilizationUsage rate70% CPU usage
SaturationQueue10 requests waiting
ErrorsErrorsDisk I/O errors

RED (Service-Focused)#

Rate, Errors, Duration - Suitable for microservice monitoring

MetricTargetExample
RateRequest count100 RPS
ErrorsError rate1% failure
DurationResponse timeP99 200ms

Comparison#

MethodologyFocusSuitable For
Golden SignalsService + ResourceGeneral purpose
USEResourceServer, network, storage
REDServiceAPI, microservices

Learning Path#

By Signal#

  1. Latency - Latency measurement strategy (P50, P95, P99)
  2. Traffic - Traffic/throughput monitoring
  3. Errors - Error rate definition and classification
  4. Saturation - Saturation (resource usage)

By Service Type#

  1. By Service Type - Guides for Web API, Kafka, DB

Quick Reference: PromQL Queries#

Latency#

# P99 response time
histogram_quantile(0.99,
  sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
)

# Average response time
rate(http_request_duration_seconds_sum[5m])
/ rate(http_request_duration_seconds_count[5m])

Traffic#

# Requests per second (RPS)
sum by (service) (rate(http_requests_total[5m]))

# Bytes per second
sum by (service) (rate(http_request_size_bytes_sum[5m]))

Errors#

# Error rate (%)
sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))
/ sum by (service) (rate(http_requests_total[5m]))
* 100

# Error count
sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))

Saturation#

# CPU usage
100 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100

# Memory usage
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# Disk usage
(1 - node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100

Recording Rules Template#

groups:
  - name: golden_signals
    rules:
      # Latency
      - record: service:http_request_duration_seconds:p99
        expr: |
          histogram_quantile(0.99,
            sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
          )

      # Traffic
      - record: service:http_requests:rate5m
        expr: sum by (service) (rate(http_requests_total[5m]))

      # Errors
      - record: service:http_requests_errors:ratio_rate5m
        expr: |
          sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))
          / sum by (service) (rate(http_requests_total[5m]))

      # Saturation
      - record: instance:node_cpu_utilization:ratio
        expr: 1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))

Dashboard Layout#

graph TD
    subgraph "Grafana Dashboard"
        ROW1["Row 1: Summary"]
        ROW2["Row 2: Latency"]
        ROW3["Row 3: Traffic"]
        ROW4["Row 4: Errors"]
        ROW5["Row 5: Saturation"]
    end

    ROW1 --> |"4 Stat panels"| S1["P99<br>RPS<br>Error Rate<br>CPU"]
    ROW2 --> |"Time Series"| L["P50/P95/P99<br>Response time"]
    ROW3 --> |"Time Series"| T["RPS<br>Traffic trend"]
    ROW4 --> |"Time Series"| E["Error rate<br>Error count"]
    ROW5 --> |"Gauge + Time"| SAT["CPU/MEM/DISK<br>Usage"]

Next Steps#

DocumentContent
LatencyP50/P95/P99 measurement, SLA setup
TrafficRPS, throughput monitoring
ErrorsError classification, error budget
SaturationResource bottleneck detection
By Service TypeCustom metrics for API, Kafka, DB