What is Observability?#

Observability is the ability to understand internal system states from external outputs. It’s not just about installing monitoring tools, but building a system that can answer the question “Why did this problem occur?”

Why is Observability Necessary?#

As microservices and distributed systems become prevalent, the limitations of traditional monitoring have become clear.

Limitations of Traditional MonitoringAfter Adopting Observability
Only know “server CPU is high”Can trace which requests are causing high CPU
Can only check pre-defined metricsCan analyze unexpected problems with data
Difficult to understand service connectionsVisualize entire flow with distributed tracing
Hours to trace root cause after failureImmediately identify root cause with trace ID
Logs, metrics, traces are separatedIntegrated analysis by connecting all three pillars

The Three Pillars of Observability#

graph LR
    subgraph "Three Pillars of Observability"
        M["Metrics<br>Numeric Data"]
        L["Logs<br>Event Records"]
        T["Traces<br>Request Flow"]
    end

    M --> |"Anomaly Detection"| A["Alert"]
    A --> |"Detail Check"| L
    L --> |"Flow Tracking"| T
    T --> |"Performance Analysis"| M
PillarRoleRepresentative ToolsExamples
MetricsNumeric-based state measurementPrometheus, MicrometerCPU 80%, Response time 200ms
LogsDetailed event recordsLoki, Elasticsearch“Order creation failed: Out of stock”
TracesTrack entire request pathJaeger, TempoOrder → Payment → Shipping service flow

When Should You Adopt Observability?#

Suitable cases:

  • Operating microservices architecture
  • When identifying failure causes takes too long
  • When SLA/SLO-based operations are needed
  • When you want to systematically analyze performance bottlenecks

May be overkill:

  • Single monolithic applications
  • Internal tools with very low traffic
  • Small teams with insufficient operational staff

What This Guide Covers#

Quick Start#

Build a Prometheus + Grafana environment in 10 minutes and verify your first metrics.

Concepts#

Explains not just “how to use” but “why it was designed this way”.

TopicWhat You’ll Learn
Three Pillars of ObservabilityRoles of Metrics, Logs, Traces and their interconnections
Metrics FundamentalsUnderstanding Counter, Gauge, Histogram, Summary types
Prometheus ArchitecturePull model, time series DB, service discovery
PromQLQuery language from basics to advanced (7 documents)
SRE Golden SignalsDeep dive into Latency, Traffic, Errors, Saturation (6 documents)
Log AggregationLoki vs ELK comparison, log design patterns
Distributed TracingSpan, Trace ID, Context Propagation
OpenTelemetryObservability standards and integration methods
Dashboard DesignEffective visualization principles

Examples#

Hands-on experience with executable code.

How-To Guides#

Common problem scenarios and solutions in practice.

Appendix#

  • Glossary - Quick reference for Observability terms
  • FAQ - Frequently asked questions
  • Alerting Actions Guide - Response strategies after PromQL detection
  • References - Official documentation and additional learning resources

Prerequisites#

  • Required: Basic Docker usage, HTTP/REST API understanding
  • Helpful: Spring Boot experience, Kubernetes basics, time series data concepts

Tech Stack#

Tools used in this guide.

ComponentToolVersion
Metrics CollectionPrometheus2.50+
Metrics ExposureMicrometer1.12+
VisualizationGrafana10.x
Log CollectionLoki2.9+
Distributed TracingTempo2.x
StandardizationOpenTelemetry1.x

Suggested Learning Path#

graph TD
    START["Start"] --> Q1{"Experience Level?"}

    Q1 --> |"Beginner"| P1["Quick Start<br>→ Three Pillars<br>→ Metrics Fundamentals<br>→ Spring Boot Example"]
    Q1 --> |"Intermediate"| P2["PromQL Advanced<br>→ Golden Signals<br>→ Full-Stack Example"]
    Q1 --> |"Operations"| P3["Distributed Tracing<br>→ Alerting Actions<br>→ Cardinality Optimization"]

    P1 --> NEXT["Next Steps"]
    P2 --> NEXT
    P3 --> NEXT

If you’re new:

Quick Start → Three Pillars → Metrics Fundamentals → Prometheus Architecture → Spring Boot Example

If you want to learn PromQL deeply:

PromQL Basics → Aggregation Operators → rate vs increase → histogram_quantile → Recording Rules → Alerting Rules

From an SRE/Operations perspective:

Golden Signals Overview → Latency/Errors/Saturation → Application by Service Type → Alerting Actions Guide

Each document can be read independently, but if you’re new, we recommend following the order above.