Target Audience: SREs operating various service types Prerequisites: SRE Golden Signals After reading this: You’ll be able to apply the four signals tailored to service characteristics

TL;DR
  • Each service type has different key signals
  • Web API: Latency, Errors focused
  • Kafka: Traffic (Lag), Saturation focused
  • Database: Latency, Saturation focused
  • Batch jobs: Errors, Traffic (completion rate) focused

Why Apply Differently by Service Type?#

Why can’t you use the same signals for every service? For a Web API, the most critical signals are Latency and Errors, but for a Kafka Consumer, Consumer Lag and Saturation are what matter most. If you apply the four golden signals identically across all services, you’ll miss the metrics that actually matter and accumulate alert fatigue from meaningless notifications. Prioritizing signals based on each service’s characteristics is how you achieve practical operational visibility.

Key Signals by Service Type#

Service TypeKey SignalsReason
Web APILatency, ErrorsDirectly affects user experience
KafkaTraffic (Lag), SaturationThroughput is key
DatabaseLatency, SaturationQuery performance, connections
Cache (Redis)Latency, SaturationResponse speed, memory
Batch jobsErrors, TrafficCompletion rate, throughput
Load balancerTraffic, ErrorsConnection distribution

Web API / Microservices#

Key Metrics#

graph TD
    subgraph "Web API Key"
        L["Latency<br>P99 response time"]
        E["Errors<br>5xx ratio"]
        T["Traffic<br>RPS"]
        S["Saturation<br>Connections/threads"]
    end

    L --> |"SLA"| SLA["P99 < 500ms"]
    E --> |"SLO"| SLO["Error rate < 0.1%"]

This diagram shows that Latency and Errors are the most critical metrics for Web APIs based on SLA/SLO criteria.

PromQL Queries#

# Latency: P99 response time
histogram_quantile(0.99,
  sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
)

# Traffic: Requests per second
sum by (service) (rate(http_requests_total[5m]))

# Errors: Error rate
sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))
/ sum by (service) (rate(http_requests_total[5m]))

# Saturation: Concurrent requests
sum by (service) (http_server_requests_active)

Recording Rules#

groups:
  - name: web_api_golden_signals
    rules:
      - record: service:http_request_duration_seconds:p99
        expr: histogram_quantile(0.99, sum by (service, le) (rate(http_request_duration_seconds_bucket[5m])))

      - record: service:http_requests:rate5m
        expr: sum by (service) (rate(http_requests_total[5m]))

      - record: service:http_errors:ratio
        expr: sum by (service) (rate(http_requests_total{status=~"5.."}[5m])) / sum by (service) (rate(http_requests_total[5m]))

Kafka#

Key Metrics#

graph TD
    subgraph "Kafka Key"
        T["Traffic<br>Message throughput"]
        L["Latency<br>Consumer Lag"]
        S["Saturation<br>Disk, partitions"]
        E["Errors<br>Failed messages"]
    end

    L --> |"Important"| LAG["Lag < 10,000"]
    S --> |"Important"| DISK["Disk < 80%"]

This diagram shows that Consumer Lag and disk utilization are the most important monitoring metrics for Kafka.

PromQL Queries#

# Traffic: Messages per second
sum by (topic) (rate(kafka_server_brokertopicmetrics_messagesin_total[5m]))

# Latency: Consumer Lag
sum by (consumer_group, topic) (kafka_consumer_group_lag)

# Saturation: Disk usage
kafka_log_log_size / kafka_log_log_max_size * 100

# Saturation: Under-replicated partitions
kafka_server_replicamanager_underreplicatedpartitions

# Errors: Failure rate
rate(kafka_producer_record_error_total[5m])

Alert Rules#

groups:
  - name: kafka_alerts
    rules:
      - alert: KafkaConsumerLagHigh
        expr: sum by (consumer_group, topic) (kafka_consumer_group_lag) > 10000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Kafka consumer lag high"

      - alert: KafkaUnderReplicated
        expr: kafka_server_replicamanager_underreplicatedpartitions > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Kafka under-replicated partitions"

      - alert: KafkaDiskHigh
        expr: kafka_log_log_size / kafka_log_log_max_size > 0.8
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Kafka disk usage high"

Database (PostgreSQL/MySQL)#

Key Metrics#

graph TD
    subgraph "DB Key"
        L["Latency<br>Query time"]
        S["Saturation<br>Connections, buffer"]
        E["Errors<br>Deadlocks, failures"]
        T["Traffic<br>QPS"]
    end

    S --> |"Important"| CONN["Connections < 80%"]
    L --> |"Important"| SLOW["Slow queries"]

This diagram shows that connection saturation and query latency are the most important monitoring metrics for databases.

PromQL Queries (PostgreSQL)#

# Latency: Query time
rate(pg_stat_statements_total_time_seconds_sum[5m])
/ rate(pg_stat_statements_calls_total[5m])

# Traffic: Queries per second (QPS)
sum(rate(pg_stat_statements_calls_total[5m]))

# Saturation: Connection usage
sum(pg_stat_activity_count) / pg_settings_max_connections * 100

# Saturation: Buffer hit rate
pg_stat_bgwriter_buffers_alloc
/ (pg_stat_bgwriter_buffers_alloc + pg_stat_bgwriter_buffers_backend) * 100

# Errors: Deadlocks
rate(pg_stat_database_deadlocks_total[5m])

PromQL Queries (MySQL)#

# Traffic: QPS
rate(mysql_global_status_queries[5m])

# Latency: Slow queries
rate(mysql_global_status_slow_queries[5m])

# Saturation: Connection usage
mysql_global_status_threads_connected / mysql_global_variables_max_connections * 100

# Saturation: Buffer pool usage
mysql_global_status_innodb_buffer_pool_pages_data
/ mysql_global_status_innodb_buffer_pool_pages_total * 100

Cache (Redis)#

Key Metrics#

graph TD
    subgraph "Redis Key"
        L["Latency<br>Command time"]
        S["Saturation<br>Memory"]
        T["Traffic<br>Command count"]
        E["Errors<br>Rejects, evictions"]
    end

    S --> |"Important"| MEM["Memory < 80%"]
    L --> |"Important"| HIT["Hit rate > 90%"]

This diagram shows that memory utilization and cache hit rate are the most important monitoring metrics for Redis.

PromQL Queries#

# Traffic: Commands per second
rate(redis_commands_total[5m])

# Latency: Average command time
rate(redis_commands_duration_seconds_total[5m])
/ rate(redis_commands_total[5m])

# Saturation: Memory usage
redis_memory_used_bytes / redis_memory_max_bytes * 100

# Errors: Hit rate
rate(redis_keyspace_hits_total[5m])
/ (rate(redis_keyspace_hits_total[5m]) + rate(redis_keyspace_misses_total[5m]))
* 100

# Saturation: Connection count
redis_connected_clients

Batch Jobs#

Key Metrics#

graph TD
    subgraph "Batch Key"
        E["Errors<br>Failure rate"]
        T["Traffic<br>Throughput"]
        L["Latency<br>Execution time"]
    end

    E --> |"Important"| FAIL["Failure rate < 1%"]
    T --> |"Important"| DONE["Completion rate 100%"]

This diagram shows that failure rate and completion rate are the most important monitoring metrics for batch jobs.

PromQL Queries#

# Traffic: Completed jobs
increase(batch_jobs_completed_total[1h])

# Errors: Failure rate
increase(batch_jobs_failed_total[1h])
/ (increase(batch_jobs_completed_total[1h]) + increase(batch_jobs_failed_total[1h]))

# Latency: Average execution time
rate(batch_job_duration_seconds_sum[1h])
/ rate(batch_job_duration_seconds_count[1h])

# Traffic: Processed items
increase(batch_items_processed_total[1h])

Alert Rules#

      - alert: BatchJobFailed
        expr: increase(batch_jobs_failed_total[1h]) > 0
        labels:
          severity: warning
        annotations:
          summary: "Batch job failed"

      - alert: BatchJobNotRunning
        expr: time() - batch_job_last_success_timestamp > 86400
        labels:
          severity: critical
        annotations:
          summary: "Batch job hasn't run in 24 hours"

Load Balancer (Nginx/HAProxy)#

PromQL Queries (Nginx)#

# Traffic: Requests per second
sum(rate(nginx_http_requests_total[5m]))

# Errors: 5xx ratio
sum(rate(nginx_http_requests_total{status=~"5.."}[5m]))
/ sum(rate(nginx_http_requests_total[5m]))

# Saturation: Active connections
nginx_connections_active

# Latency: Upstream response time
histogram_quantile(0.99, rate(nginx_upstream_response_time_seconds_bucket[5m]))

Dashboard Templates#

Panel Layout by Service Type#

ServiceRow 1 (Summary)Row 2 (Detail)Row 3 (Trend)
Web APIP99, RPS, Error rateStatus code distribution, By endpointTime series
KafkaLag, Throughput, PartitionsBy consumer, By topicTime series
DBQPS, Connections, Slow queriesBy query, By tableTime series
RedisCommands, Hit rate, MemoryBy command, By keyspaceTime series

Key Summary#

ServiceTop Priority2nd Priority3rd Priority
Web APILatencyErrorsTraffic
KafkaTraffic (Lag)SaturationErrors
DBSaturationLatencyTraffic
RedisSaturationLatencyTraffic
BatchErrorsTrafficLatency

Next Steps#

Recommended OrderDocumentWhat You’ll Learn
1Environment SetupPractice environment
2Kafka MonitoringKafka details