Target Audience: SREs operating various service types Prerequisites: SRE Golden Signals After reading this: You’ll be able to apply the four signals tailored to service characteristics

TL;DR#

Key Summary:

  • Each service type has different key signals
  • Web API: Latency, Errors focused
  • Kafka: Traffic (Lag), Saturation focused
  • Database: Latency, Saturation focused
  • Batch jobs: Errors, Traffic (completion rate) focused

Key Signals by Service Type#

Service TypeKey SignalsReason
Web APILatency, ErrorsDirectly affects user experience
KafkaTraffic (Lag), SaturationThroughput is key
DatabaseLatency, SaturationQuery performance, connections
Cache (Redis)Latency, SaturationResponse speed, memory
Batch jobsErrors, TrafficCompletion rate, throughput
Load balancerTraffic, ErrorsConnection distribution

Web API / Microservices#

Key Metrics#

graph TD
    subgraph "Web API Key"
        L["Latency<br>P99 response time"]
        E["Errors<br>5xx ratio"]
        T["Traffic<br>RPS"]
        S["Saturation<br>Connections/threads"]
    end

    L --> |"SLA"| SLA["P99 < 500ms"]
    E --> |"SLO"| SLO["Error rate < 0.1%"]

PromQL Queries#

# Latency: P99 response time
histogram_quantile(0.99,
  sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
)

# Traffic: Requests per second
sum by (service) (rate(http_requests_total[5m]))

# Errors: Error rate
sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))
/ sum by (service) (rate(http_requests_total[5m]))

# Saturation: Concurrent requests
sum by (service) (http_server_requests_active)

Recording Rules#

groups:
  - name: web_api_golden_signals
    rules:
      - record: service:http_request_duration_seconds:p99
        expr: histogram_quantile(0.99, sum by (service, le) (rate(http_request_duration_seconds_bucket[5m])))

      - record: service:http_requests:rate5m
        expr: sum by (service) (rate(http_requests_total[5m]))

      - record: service:http_errors:ratio
        expr: sum by (service) (rate(http_requests_total{status=~"5.."}[5m])) / sum by (service) (rate(http_requests_total[5m]))

Kafka#

Key Metrics#

graph TD
    subgraph "Kafka Key"
        T["Traffic<br>Message throughput"]
        L["Latency<br>Consumer Lag"]
        S["Saturation<br>Disk, partitions"]
        E["Errors<br>Failed messages"]
    end

    L --> |"Important"| LAG["Lag < 10,000"]
    S --> |"Important"| DISK["Disk < 80%"]

PromQL Queries#

# Traffic: Messages per second
sum by (topic) (rate(kafka_server_brokertopicmetrics_messagesin_total[5m]))

# Latency: Consumer Lag
sum by (consumer_group, topic) (kafka_consumer_group_lag)

# Saturation: Disk usage
kafka_log_log_size / kafka_log_log_max_size * 100

# Saturation: Under-replicated partitions
kafka_server_replicamanager_underreplicatedpartitions

# Errors: Failure rate
rate(kafka_producer_record_error_total[5m])

Alert Rules#

groups:
  - name: kafka_alerts
    rules:
      - alert: KafkaConsumerLagHigh
        expr: sum by (consumer_group, topic) (kafka_consumer_group_lag) > 10000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Kafka consumer lag high"

      - alert: KafkaUnderReplicated
        expr: kafka_server_replicamanager_underreplicatedpartitions > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Kafka under-replicated partitions"

      - alert: KafkaDiskHigh
        expr: kafka_log_log_size / kafka_log_log_max_size > 0.8
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Kafka disk usage high"

Database (PostgreSQL/MySQL)#

Key Metrics#

graph TD
    subgraph "DB Key"
        L["Latency<br>Query time"]
        S["Saturation<br>Connections, buffer"]
        E["Errors<br>Deadlocks, failures"]
        T["Traffic<br>QPS"]
    end

    S --> |"Important"| CONN["Connections < 80%"]
    L --> |"Important"| SLOW["Slow queries"]

PromQL Queries (PostgreSQL)#

# Latency: Query time
rate(pg_stat_statements_total_time_seconds_sum[5m])
/ rate(pg_stat_statements_calls_total[5m])

# Traffic: Queries per second (QPS)
sum(rate(pg_stat_statements_calls_total[5m]))

# Saturation: Connection usage
sum(pg_stat_activity_count) / pg_settings_max_connections * 100

# Saturation: Buffer hit rate
pg_stat_bgwriter_buffers_alloc
/ (pg_stat_bgwriter_buffers_alloc + pg_stat_bgwriter_buffers_backend) * 100

# Errors: Deadlocks
rate(pg_stat_database_deadlocks_total[5m])

PromQL Queries (MySQL)#

# Traffic: QPS
rate(mysql_global_status_queries[5m])

# Latency: Slow queries
rate(mysql_global_status_slow_queries[5m])

# Saturation: Connection usage
mysql_global_status_threads_connected / mysql_global_variables_max_connections * 100

# Saturation: Buffer pool usage
mysql_global_status_innodb_buffer_pool_pages_data
/ mysql_global_status_innodb_buffer_pool_pages_total * 100

Cache (Redis)#

Key Metrics#

graph TD
    subgraph "Redis Key"
        L["Latency<br>Command time"]
        S["Saturation<br>Memory"]
        T["Traffic<br>Command count"]
        E["Errors<br>Rejects, evictions"]
    end

    S --> |"Important"| MEM["Memory < 80%"]
    L --> |"Important"| HIT["Hit rate > 90%"]

PromQL Queries#

# Traffic: Commands per second
rate(redis_commands_total[5m])

# Latency: Average command time
rate(redis_commands_duration_seconds_total[5m])
/ rate(redis_commands_total[5m])

# Saturation: Memory usage
redis_memory_used_bytes / redis_memory_max_bytes * 100

# Errors: Hit rate
rate(redis_keyspace_hits_total[5m])
/ (rate(redis_keyspace_hits_total[5m]) + rate(redis_keyspace_misses_total[5m]))
* 100

# Saturation: Connection count
redis_connected_clients

Batch Jobs#

Key Metrics#

graph TD
    subgraph "Batch Key"
        E["Errors<br>Failure rate"]
        T["Traffic<br>Throughput"]
        L["Latency<br>Execution time"]
    end

    E --> |"Important"| FAIL["Failure rate < 1%"]
    T --> |"Important"| DONE["Completion rate 100%"]

PromQL Queries#

# Traffic: Completed jobs
increase(batch_jobs_completed_total[1h])

# Errors: Failure rate
increase(batch_jobs_failed_total[1h])
/ (increase(batch_jobs_completed_total[1h]) + increase(batch_jobs_failed_total[1h]))

# Latency: Average execution time
rate(batch_job_duration_seconds_sum[1h])
/ rate(batch_job_duration_seconds_count[1h])

# Traffic: Processed items
increase(batch_items_processed_total[1h])

Alert Rules#

      - alert: BatchJobFailed
        expr: increase(batch_jobs_failed_total[1h]) > 0
        labels:
          severity: warning
        annotations:
          summary: "Batch job failed"

      - alert: BatchJobNotRunning
        expr: time() - batch_job_last_success_timestamp > 86400
        labels:
          severity: critical
        annotations:
          summary: "Batch job hasn't run in 24 hours"

Load Balancer (Nginx/HAProxy)#

PromQL Queries (Nginx)#

# Traffic: Requests per second
sum(rate(nginx_http_requests_total[5m]))

# Errors: 5xx ratio
sum(rate(nginx_http_requests_total{status=~"5.."}[5m]))
/ sum(rate(nginx_http_requests_total[5m]))

# Saturation: Active connections
nginx_connections_active

# Latency: Upstream response time
histogram_quantile(0.99, rate(nginx_upstream_response_time_seconds_bucket[5m]))

Dashboard Templates#

Panel Layout by Service Type#

ServiceRow 1 (Summary)Row 2 (Detail)Row 3 (Trend)
Web APIP99, RPS, Error rateStatus code distribution, By endpointTime series
KafkaLag, Throughput, PartitionsBy consumer, By topicTime series
DBQPS, Connections, Slow queriesBy query, By tableTime series
RedisCommands, Hit rate, MemoryBy command, By keyspaceTime series

Key Summary#

ServiceTop Priority2nd Priority3rd Priority
Web APILatencyErrorsTraffic
KafkaTraffic (Lag)SaturationErrors
DBSaturationLatencyTraffic
RedisSaturationLatencyTraffic
BatchErrorsTrafficLatency

Next Steps#

Recommended OrderDocumentWhat You’ll Learn
1Environment SetupPractice environment
2Kafka MonitoringKafka details