Target Audience: Developers and SREs designing Grafana dashboards Prerequisites: SRE Golden Signals After Reading: You’ll be able to design effective dashboards and quickly identify problems

TL;DR#

Key Summary:

  • Hierarchical Structure: Overview → Service → Detail order
  • 5-Second Rule: Must be able to identify problem presence in 5 seconds
  • Golden Signals First: Latency, Traffic, Errors, Saturation
  • Remove Unnecessary Info: Exclude metrics that don’t lead to action

Dashboard Design Principles#

1. 5-Second Rule#

Within 5 seconds of viewing the dashboard, you should know:

  • Is there a problem?
  • Where is the problem?
  • How severe is it?

2. Hierarchical Structure#

graph TD
    L1["Level 1: Overview<br>Overall System Status"]
    L2["Level 2: Service<br>Individual Service Status"]
    L3["Level 3: Detail<br>Detailed Metrics"]

    L1 --> |"Problem Detection"| L2
    L2 --> |"Root Cause Analysis"| L3

3. Color Rules#

ColorMeaningUsage
🟢 GreenNormalBelow threshold
🟡 YellowWarningWarning threshold
🔴 RedCriticalCritical threshold
⚪ GrayNo dataN/A

Level 1: Overview Dashboard#

Purpose#

Grasp overall system status at a glance

Layout#

┌─────────────────────────────────────────────────────────────┐
│ Row 1: Key Metrics (Stat Panels)                             │
│ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐      │
│ │ P99    │ │ RPS    │ │ Error  │ │ CPU    │ │ Memory │      │
│ │ 120ms  │ │ 5,234  │ │ 0.1%   │ │ 45%    │ │ 62%    │      │
│ └────────┘ └────────┘ └────────┘ └────────┘ └────────┘      │
├─────────────────────────────────────────────────────────────┤
│ Row 2: Per-Service Status (Table/Heatmap)                    │
│ ┌───────────────────────────────────────────────────────┐   │
│ │ Service    │ RPS   │ P99    │ Errors │ Status        │   │
│ │ order      │ 1,234 │ 100ms  │ 0.05%  │ 🟢            │   │
│ │ payment    │ 890   │ 250ms  │ 0.5%   │ 🟡            │   │
│ │ inventory  │ 2,100 │ 50ms   │ 0.01%  │ 🟢            │   │
│ └───────────────────────────────────────────────────────┘   │
├─────────────────────────────────────────────────────────────┤
│ Row 3: Trends (Time Series)                                  │
│ ┌───────────────────────────────────────────────────────┐   │
│ │ 📈 RPS / Error Rate Trends (24 hours)                 │   │
│ └───────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘

Grafana Query Examples#

# Stat: P99 Response Time
histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))

# Stat: Total RPS
sum(rate(http_requests_total[5m]))

# Stat: Error Rate
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100

# Table: Per-Service Status
# Combine multiple queries with Transformation

Level 2: Service Dashboard#

Purpose#

Check detailed status of individual services

Layout (RED Pattern)#

┌─────────────────────────────────────────────────────────────┐
│ Variable: $service = order-service                           │
├─────────────────────────────────────────────────────────────┤
│ Row 1: Rate (Traffic)                                        │
│ ┌──────────────────────┐ ┌──────────────────────┐           │
│ │ RPS Trend            │ │ RPS by Endpoint      │           │
│ └──────────────────────┘ └──────────────────────┘           │
├─────────────────────────────────────────────────────────────┤
│ Row 2: Errors                                                │
│ ┌──────────────────────┐ ┌──────────────────────┐           │
│ │ Error Rate Trend     │ │ Status Code Dist.    │           │
│ └──────────────────────┘ └──────────────────────┘           │
├─────────────────────────────────────────────────────────────┤
│ Row 3: Duration (Latency)                                    │
│ ┌──────────────────────┐ ┌──────────────────────┐           │
│ │ P50/P95/P99 Trend    │ │ Response Time Heatmap│           │
│ └──────────────────────┘ └──────────────────────┘           │
├─────────────────────────────────────────────────────────────┤
│ Row 4: Resources                                             │
│ ┌──────────────────────┐ ┌──────────────────────┐           │
│ │ CPU Usage            │ │ Memory Usage         │           │
│ └──────────────────────┘ └──────────────────────┘           │
└─────────────────────────────────────────────────────────────┘

Variable Configuration#

# Grafana Variables
- name: service
  type: query
  query: label_values(http_requests_total, service)
  refresh: on_time_range_change

- name: instance
  type: query
  query: label_values(http_requests_total{service="$service"}, instance)

Level 3: Detail Dashboard#

Purpose#

Deep analysis for root cause identification

Layout Example (Database)#

┌─────────────────────────────────────────────────────────────┐
│ Row 1: Connection Pool                                       │
│ ┌──────────────────────┐ ┌──────────────────────┐           │
│ │ Active Connections   │ │ Pending Connections  │           │
│ └──────────────────────┘ └──────────────────────┘           │
├─────────────────────────────────────────────────────────────┤
│ Row 2: Query Performance                                     │
│ ┌──────────────────────┐ ┌──────────────────────┐           │
│ │ Slow Query Count     │ │ Query Execution Time │           │
│ └──────────────────────┘ └──────────────────────┘           │
├─────────────────────────────────────────────────────────────┤
│ Row 3: Resource Usage                                        │
│ ┌──────────────────────┐ ┌──────────────────────┐           │
│ │ Buffer Hit Rate      │ │ Disk I/O             │           │
│ └──────────────────────┘ └──────────────────────┘           │
└─────────────────────────────────────────────────────────────┘

Panel Type Selection#

Panel TypeSuitable DataExamples
StatSingle value, current stateRPS, Error rate
GaugePercentage, thresholdCPU usage
Time SeriesTrends over timeRequest count trends
Bar ChartComparison, rankingTraffic by endpoint
HeatmapDistribution, densityResponse time distribution
TableDetailed listsService list
Pie ChartProportionsStatus code distribution

Threshold Configuration#

Stat Panel#

{
  "thresholds": {
    "mode": "absolute",
    "steps": [
      { "color": "green", "value": null },
      { "color": "yellow", "value": 80 },
      { "color": "red", "value": 90 }
    ]
  }
}

Time Series Threshold Display#

{
  "fieldConfig": {
    "defaults": {
      "custom": {
        "thresholdsStyle": {
          "mode": "line"
        }
      },
      "thresholds": {
        "steps": [
          { "color": "green", "value": null },
          { "color": "red", "value": 500 }
        ]
      }
    }
  }
}

Best Practices#

✅ Dashboards with clear purpose
✅ Hierarchical drill-down structure
✅ Consistent color rules
✅ Filtering with variables
✅ Add descriptions to panels
✅ Visualization connected to alerts
❌ Everything in one dashboard
❌ Listing metrics without action
❌ Meaningless colors
❌ Hardcoded values
❌ Complex queries without explanation

Dashboard Sharing Configuration#

JSON Export#

# Export dashboard
curl -H "Authorization: Bearer $TOKEN" \
  "http://grafana:3000/api/dashboards/uid/abc123" \
  | jq '.dashboard' > dashboard.json

# Import dashboard
curl -X POST -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d @dashboard.json \
  "http://grafana:3000/api/dashboards/db"

Provisioning#

# grafana/provisioning/dashboards/default.yaml
apiVersion: 1
providers:
  - name: 'default'
    orgId: 1
    folder: 'Observability'
    type: file
    options:
      path: /var/lib/grafana/dashboards

Key Summary#

PrincipleDescription
5-Second RuleQuick problem identification
HierarchyOverview → Service → Detail
Golden SignalsLatency, Traffic, Errors, Saturation
Color ConsistencyGreen/Yellow/Red
Action ConnectionAlerts, runbook links

Next Steps#

Recommended OrderDocumentWhat You’ll Learn
1Environment SetupGrafana hands-on
2Spring Boot ExampleDashboard application