Target Audience: Developers and SREs setting up monitoring alerts Prerequisites: Recording Rules What You’ll Learn: Write rules that reduce false positives and alert only on real issues

TL;DR#

Key Summary:

  • for: Fires when condition is met for specified duration (prevents false positives)
  • labels: Add metadata like severity, team
  • annotations: Include alert message, runbook URL
  • Use Recording Rules results for concise writing

Basic Syntax#

Alerting Rule Structure#

groups:
  - name: <group_name>
    rules:
      - alert: <alert_name>
        expr: <PromQL_condition>
        for: <duration>
        labels:
          <label_name>: <value>
        annotations:
          <annotation_name>: <value>

Basic Example#

groups:
  - name: availability
    rules:
      - alert: ServiceDown
        expr: up == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.instance }} is down"
          description: "{{ $labels.job }} has been down for more than 5 minutes."
          runbook_url: "https://wiki.example.com/runbook/service-down"

Core Components#

expr (Condition)#

The condition that triggers the alert.

# Target down
expr: up == 0

# Error rate exceeds 5%
expr: |
  sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))
  / sum by (service) (rate(http_requests_total[5m]))
  > 0.05

# P99 response time exceeds 500ms
expr: |
  histogram_quantile(0.99,
    sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
  ) > 0.5

# Using Recording Rule result (recommended)
expr: service:http_requests_errors:ratio_rate5m > 0.05

for (Duration)#

Alert fires when condition is sustained. Prevents false positives from temporary spikes.

# Fire after 5 minutes of sustained condition
for: 5m

# Fire immediately (no for)
# Caution: Risk of false positives
graph LR
    subgraph "for: 5m"
        P["Pending<br>(condition met)"]
        F["Firing<br>(5 min elapsed)"]
    end

    P --> |"sustained 5 min"| F
    P --> |"condition cleared"| R["Resolved<br>(alert canceled)"]

Recommended Values:

Situationfor ValueReason
Service down1-5mNeed quick detection
Error rate increase5-10mFilter temporary spikes
Resource shortage10-15mWait for auto-recovery
Disk shortage30m-1hIncreases slowly

labels (Labels)#

Add metadata to alerts.

labels:
  severity: critical          # Severity
  team: platform              # Responsible team
  service: "{{ $labels.service }}"  # Dynamic label

Severity Levels:

LevelDescriptionResponse
criticalService outageImmediate response (call)
warningPerformance degradationResponse during business hours
infoReference infoRecord only

annotations (Annotations)#

Provide detailed alert description.

annotations:
  summary: "High error rate on {{ $labels.service }}"
  description: |
    Error rate is {{ $value | humanizePercentage }}.
    Current threshold: 5%
  runbook_url: "https://wiki.example.com/runbook/high-error-rate"
  dashboard_url: "https://grafana.example.com/d/abc/errors?var-service={{ $labels.service }}"

Template Variables:

VariableDescription
{{ $labels }}All alert labels
{{ $labels.name }}Specific label value
{{ $value }}Expression result value
`{{ $valuehumanize }}`
`{{ $valuehumanizePercentage }}`
`{{ $valuehumanizeDuration }}`

Practical Alert Rules#

Availability Alerts#

groups:
  - name: availability
    rules:
      # Target down
      - alert: TargetDown
        expr: up == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Target {{ $labels.instance }} is down"
          description: "{{ $labels.job }} target {{ $labels.instance }} has been down for more than 5 minutes."

      # Service availability degradation
      - alert: ServiceAvailabilityLow
        expr: |
          sum by (service) (rate(http_requests_total{status!~"5.."}[5m]))
          / sum by (service) (rate(http_requests_total[5m]))
          < 0.99
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.service }} availability below 99%"
          description: "Current availability: {{ $value | humanizePercentage }}"

Error Rate Alerts#

groups:
  - name: errors
    rules:
      # High error rate
      - alert: HighErrorRate
        expr: |
          sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))
          / sum by (service) (rate(http_requests_total[5m]))
          > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error rate on {{ $labels.service }}"
          description: "Error rate is {{ $value | humanizePercentage }}"

      # Critical error rate
      - alert: CriticalErrorRate
        expr: |
          sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))
          / sum by (service) (rate(http_requests_total[5m]))
          > 0.10
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Critical error rate on {{ $labels.service }}"
          description: "Error rate is {{ $value | humanizePercentage }}. Immediate action required."

Latency Alerts#

groups:
  - name: latency
    rules:
      # High P99 latency
      - alert: HighP99Latency
        expr: |
          histogram_quantile(0.99,
            sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
          ) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High P99 latency on {{ $labels.service }}"
          description: "P99 latency is {{ $value | humanizeDuration }}"

      # Using Recording Rule
      - alert: HighP99LatencyFromRule
        expr: service:http_request_duration_seconds:p99 > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High P99 latency on {{ $labels.service }}"

Resource Alerts#

groups:
  - name: resources
    rules:
      # High CPU
      - alert: HighCPUUsage
        expr: |
          100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ $value | humanize }}%"

      # Low memory
      - alert: LowMemory
        expr: |
          (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Low memory on {{ $labels.instance }}"
          description: "Available memory is {{ $value | humanizePercentage }}"

      # Low disk space
      - alert: DiskSpaceLow
        expr: |
          (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.1
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Available disk space is {{ $value | humanizePercentage }}"

Kafka Alerts#

groups:
  - name: kafka
    rules:
      # High Consumer Lag
      - alert: KafkaConsumerLagHigh
        expr: |
          sum by (consumer_group, topic) (kafka_consumer_group_lag) > 10000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High consumer lag for {{ $labels.consumer_group }}"
          description: "Lag is {{ $value | humanize }} messages on topic {{ $labels.topic }}"

      # Under-replicated partitions
      - alert: KafkaUnderReplicatedPartitions
        expr: kafka_server_replicamanager_underreplicatedpartitions > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Kafka under-replicated partitions detected"

Preventing Alert Fatigue#

1. Appropriate Thresholds#

# ❌ Too sensitive
expr: error_rate > 0.001  # 0.1%

# ✅ Meaningful threshold
expr: error_rate > 0.01   # 1%

2. Sufficient for Time#

# ❌ Fires on temporary spikes
for: 30s

# ✅ Detects sustained issues only
for: 5m

3. Tiered Alerts#

# Warning: 5% error, 5 min sustained
- alert: HighErrorRate
  expr: error_rate > 0.05
  for: 5m
  labels:
    severity: warning

# Critical: 10% error, 2 min sustained (faster)
- alert: CriticalErrorRate
  expr: error_rate > 0.10
  for: 2m
  labels:
    severity: critical

4. Grouping in Alertmanager#

# alertmanager.yml
route:
  group_by: ['alertname', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

Rule Validation#

Syntax Check#

promtool check rules alerts/*.yml

Unit Tests#

# tests/alert_tests.yml
rule_files:
  - ../alerts/errors.yml

evaluation_interval: 1m

tests:
  - interval: 1m
    input_series:
      - series: 'http_requests_total{service="api", status="500"}'
        values: '0+10x10'
      - series: 'http_requests_total{service="api", status="200"}'
        values: '0+100x10'

    alert_rule_test:
      - eval_time: 10m
        alertname: HighErrorRate
        exp_alerts:
          - exp_labels:
              service: api
              severity: warning
            exp_annotations:
              summary: "High error rate on api"
promtool test rules tests/alert_tests.yml

Alert States#

stateDiagram-v2
    [*] --> Inactive: condition not met
    Inactive --> Pending: condition met
    Pending --> Inactive: condition cleared
    Pending --> Firing: for time elapsed
    Firing --> Inactive: condition cleared (Resolved)
StateDescription
InactiveCondition not met
PendingCondition met, waiting for duration
FiringAlert fired

Key Takeaways#

ComponentRoleExample
exprTrigger conditionerror_rate > 0.05
forPrevent false positives5m
labelsMetadataseverity: critical
annotationsAlert contentsummary, runbook_url

Good Alert Criteria:

  1. Only situations requiring immediate action
  2. Clear severity classification
  3. Detailed context (runbook, dashboard)
  4. Appropriate for time to prevent false positives

Next Steps#

Recommended OrderDocumentWhat You’ll Learn
1SRE Golden SignalsSelecting metrics to alert on
2Alert Action GuideResponse after receiving alerts