Alerting Rules

Target Audience: Developers and SREs setting up monitoring alerts Prerequisites: Recording Rules What You’ll Learn: Write rules that reduce false positives and alert only on real issues

TL;DR#

Key Summary:
for: Fires when condition is met for specified duration (prevents false positives)
labels: Add metadata like severity, team
annotations: Include alert message, runbook URL
Use Recording Rules results for concise writing

Basic Syntax#

Alerting Rule Structure#

groups:
  - name: <group_name>
    rules:
      - alert: <alert_name>
        expr: <PromQL_condition>
        for: <duration>
        labels:
          <label_name>: <value>
        annotations:
          <annotation_name>: <value>

Basic Example#

groups:
  - name: availability
    rules:
      - alert: ServiceDown
        expr: up == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.instance }} is down"
          description: "{{ $labels.job }} has been down for more than 5 minutes."
          runbook_url: "https://wiki.example.com/runbook/service-down"

Core Components#

expr (Condition)#

The condition that triggers the alert.

# Target down
expr: up == 0

# Error rate exceeds 5%
expr: |
  sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))
  / sum by (service) (rate(http_requests_total[5m]))
  > 0.05

# P99 response time exceeds 500ms
expr: |
  histogram_quantile(0.99,
    sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
  ) > 0.5

# Using Recording Rule result (recommended)
expr: service:http_requests_errors:ratio_rate5m > 0.05

for (Duration)#

Alert fires when condition is sustained. Prevents false positives from temporary spikes.

# Fire after 5 minutes of sustained condition
for: 5m

# Fire immediately (no for)
# Caution: Risk of false positives

graph LR
    subgraph "for: 5m"
        P["Pending<br>(condition met)"]
        F["Firing<br>(5 min elapsed)"]
    end

    P --> |"sustained 5 min"| F
    P --> |"condition cleared"| R["Resolved<br>(alert canceled)"]

Recommended Values:

Situation	for Value	Reason
Service down	1-5m	Need quick detection
Error rate increase	5-10m	Filter temporary spikes
Resource shortage	10-15m	Wait for auto-recovery
Disk shortage	30m-1h	Increases slowly

labels (Labels)#

Add metadata to alerts.

labels:
  severity: critical          # Severity
  team: platform              # Responsible team
  service: "{{ $labels.service }}"  # Dynamic label

Severity Levels:

Level	Description	Response
`critical`	Service outage	Immediate response (call)
`warning`	Performance degradation	Response during business hours
`info`	Reference info	Record only

annotations (Annotations)#

Provide detailed alert description.

annotations:
  summary: "High error rate on {{ $labels.service }}"
  description: |
    Error rate is {{ $value | humanizePercentage }}.
    Current threshold: 5%
  runbook_url: "https://wiki.example.com/runbook/high-error-rate"
  dashboard_url: "https://grafana.example.com/d/abc/errors?var-service={{ $labels.service }}"

Template Variables:

Variable	Description
`{{ $labels }}`	All alert labels
`{{ $labels.name }}`	Specific label value
`{{ $value }}`	Expression result value
`{{ $value	humanize }}`
`{{ $value	humanizePercentage }}`
`{{ $value	humanizeDuration }}`

Practical Alert Rules#

Availability Alerts#

groups:
  - name: availability
    rules:
      # Target down
      - alert: TargetDown
        expr: up == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Target {{ $labels.instance }} is down"
          description: "{{ $labels.job }} target {{ $labels.instance }} has been down for more than 5 minutes."

      # Service availability degradation
      - alert: ServiceAvailabilityLow
        expr: |
          sum by (service) (rate(http_requests_total{status!~"5.."}[5m]))
          / sum by (service) (rate(http_requests_total[5m]))
          < 0.99
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.service }} availability below 99%"
          description: "Current availability: {{ $value | humanizePercentage }}"

Error Rate Alerts#

groups:
  - name: errors
    rules:
      # High error rate
      - alert: HighErrorRate
        expr: |
          sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))
          / sum by (service) (rate(http_requests_total[5m]))
          > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error rate on {{ $labels.service }}"
          description: "Error rate is {{ $value | humanizePercentage }}"

      # Critical error rate
      - alert: CriticalErrorRate
        expr: |
          sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))
          / sum by (service) (rate(http_requests_total[5m]))
          > 0.10
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Critical error rate on {{ $labels.service }}"
          description: "Error rate is {{ $value | humanizePercentage }}. Immediate action required."

Latency Alerts#

groups:
  - name: latency
    rules:
      # High P99 latency
      - alert: HighP99Latency
        expr: |
          histogram_quantile(0.99,
            sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
          ) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High P99 latency on {{ $labels.service }}"
          description: "P99 latency is {{ $value | humanizeDuration }}"

      # Using Recording Rule
      - alert: HighP99LatencyFromRule
        expr: service:http_request_duration_seconds:p99 > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High P99 latency on {{ $labels.service }}"

Resource Alerts#

groups:
  - name: resources
    rules:
      # High CPU
      - alert: HighCPUUsage
        expr: |
          100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ $value | humanize }}%"

      # Low memory
      - alert: LowMemory
        expr: |
          (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Low memory on {{ $labels.instance }}"
          description: "Available memory is {{ $value | humanizePercentage }}"

      # Low disk space
      - alert: DiskSpaceLow
        expr: |
          (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.1
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Available disk space is {{ $value | humanizePercentage }}"

Kafka Alerts#

groups:
  - name: kafka
    rules:
      # High Consumer Lag
      - alert: KafkaConsumerLagHigh
        expr: |
          sum by (consumer_group, topic) (kafka_consumer_group_lag) > 10000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High consumer lag for {{ $labels.consumer_group }}"
          description: "Lag is {{ $value | humanize }} messages on topic {{ $labels.topic }}"

      # Under-replicated partitions
      - alert: KafkaUnderReplicatedPartitions
        expr: kafka_server_replicamanager_underreplicatedpartitions > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Kafka under-replicated partitions detected"

Preventing Alert Fatigue#

1. Appropriate Thresholds#

# ❌ Too sensitive
expr: error_rate > 0.001  # 0.1%

# ✅ Meaningful threshold
expr: error_rate > 0.01   # 1%

2. Sufficient for Time#

# ❌ Fires on temporary spikes
for: 30s

# ✅ Detects sustained issues only
for: 5m

3. Tiered Alerts#

# Warning: 5% error, 5 min sustained
- alert: HighErrorRate
  expr: error_rate > 0.05
  for: 5m
  labels:
    severity: warning

# Critical: 10% error, 2 min sustained (faster)
- alert: CriticalErrorRate
  expr: error_rate > 0.10
  for: 2m
  labels:
    severity: critical

4. Grouping in Alertmanager#

# alertmanager.yml
route:
  group_by: ['alertname', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

Rule Validation#

Syntax Check#

promtool check rules alerts/*.yml

Unit Tests#

# tests/alert_tests.yml
rule_files:
  - ../alerts/errors.yml

evaluation_interval: 1m

tests:
  - interval: 1m
    input_series:
      - series: 'http_requests_total{service="api", status="500"}'
        values: '0+10x10'
      - series: 'http_requests_total{service="api", status="200"}'
        values: '0+100x10'

    alert_rule_test:
      - eval_time: 10m
        alertname: HighErrorRate
        exp_alerts:
          - exp_labels:
              service: api
              severity: warning
            exp_annotations:
              summary: "High error rate on api"

promtool test rules tests/alert_tests.yml

Alert States#

stateDiagram-v2
    [*] --> Inactive: condition not met
    Inactive --> Pending: condition met
    Pending --> Inactive: condition cleared
    Pending --> Firing: for time elapsed
    Firing --> Inactive: condition cleared (Resolved)

State	Description
Inactive	Condition not met
Pending	Condition met, waiting for duration
Firing	Alert fired

Key Takeaways#

Component	Role	Example
`expr`	Trigger condition	`error_rate > 0.05`
`for`	Prevent false positives	`5m`
`labels`	Metadata	`severity: critical`
`annotations`	Alert content	`summary`, `runbook_url`

Good Alert Criteria:

Only situations requiring immediate action
Clear severity classification
Detailed context (runbook, dashboard)
Appropriate for time to prevent false positives

Next Steps#

Recommended Order	Document	What You’ll Learn
1	SRE Golden Signals	Selecting metrics to alert on
2	Alert Action Guide	Response after receiving alerts