Target Audience: Developers and SREs designing log systems Prerequisites: Three Pillars of Observability After Reading: You’ll be able to select a log collection system and design effective logs

TL;DR#

Key Summary:

  • Loki: Label-based, lightweight, excellent Grafana integration
  • ELK: Powerful full-text search, suitable for large-scale analysis
  • Structured Logs: JSON format for easy field-by-field search
  • Log Levels: Long-term retention only for ERROR and above recommended

Why Is Log Aggregation Necessary?#

In microservices environments, applications run across dozens or hundreds of containers. If each container generates its own log file, which server’s log should you look at when an incident occurs?

Analogy: Book Management in a Large Library

Imagine a library with millions of books. If each book is randomly placed in different locations, finding the book you want is nearly impossible. But if all book locations are recorded in a central database, you can find any book instantly just by searching the title or author.

Log aggregation works the same way. By gathering logs from distributed servers in one place and making them searchable, you can find “the payment error that occurred at 3 PM yesterday” in seconds.

Advantages of Centralized Log Management#

Problem SituationDistributed LogsCentralized Logs
Incident occursSSH into 20 servers one by oneSearch from one screen
Log retentionPossible loss on server restartStored in permanent storage
Correlation analysisTime sync difficultConnect via trace_id
Access controlPer-server permission managementUnified permission management
graph LR
    subgraph "Distributed Logs (Inefficient)"
        S1["Server 1<br>/var/log/app.log"]
        S2["Server 2<br>/var/log/app.log"]
        S3["Server 3<br>/var/log/app.log"]
        ADMIN["Admin"]
        ADMIN --> |"SSH"| S1
        ADMIN --> |"SSH"| S2
        ADMIN --> |"SSH"| S3
    end
graph LR
    subgraph "Centralized Logs (Efficient)"
        A1["Server 1"]
        A2["Server 2"]
        A3["Server 3"]
        CENTRAL["Log Collection System<br>(Loki/ELK)"]
        DASH["Unified Dashboard"]
        A1 --> CENTRAL
        A2 --> CENTRAL
        A3 --> CENTRAL
        CENTRAL --> DASH
    end
Core Principle: Logs should be searchable from one place. Incident response time is proportional to “time spent finding logs.”

Loki vs ELK Comparison#

Architecture Comparison#

graph TB
    subgraph "Loki Stack"
        APP1["Application"] --> |"stdout"| PROM1["Promtail"]
        PROM1 --> |"push"| LOKI["Loki"]
        LOKI --> GF["Grafana"]
    end

    subgraph "ELK Stack"
        APP2["Application"] --> |"file/stdout"| FB["Filebeat"]
        FB --> LS["Logstash"]
        LS --> ES["Elasticsearch"]
        ES --> KI["Kibana"]
    end

Detailed Comparison#

ItemLokiELK
IndexingLabels onlyFull-text indexing
SearchLabel filter + grepFull-text search (Lucene)
Storage CostLow (compressed raw)High (index size)
Query LanguageLogQLKQL, Lucene
Installation ComplexityLowHigh
Grafana IntegrationNativePlugin required
Alert IntegrationGrafana alertsKibana alerts

Selection Guide#

graph TD
    Q1{"Is full-text search<br>important?"}
    Q1 --> |"Yes"| ELK["ELK Stack"]
    Q1 --> |"No"| Q2{"Already using<br>Grafana?"}
    Q2 --> |"Yes"| LOKI["Loki"]
    Q2 --> |"No"| Q3{"Sufficient<br>ops staff?"}
    Q3 --> |"Yes"| ELK
    Q3 --> |"No"| LOKI
SituationRecommendation
Already using GrafanaLoki
Full-text search requiredElasticsearch
Low cost neededLoki
Large-scale analysisElasticsearch
Quick setupLoki

Structured Log Design#

Unstructured vs Structured#

# ❌ Unstructured (hard to parse)
2026-01-12 10:30:00 ERROR OrderService - Failed to create order for user 123: insufficient stock

# ✅ Structured JSON
{
  "timestamp": "2026-01-12T10:30:00Z",
  "level": "ERROR",
  "service": "order-service",
  "message": "Failed to create order",
  "user_id": "123",
  "error": "insufficient stock",
  "trace_id": "abc123def456"
}

Required Fields#

FieldDescriptionExample
timestampISO 8601 format2026-01-12T10:30:00Z
levelLog levelINFO, ERROR
serviceService nameorder-service
messageLog messageOrder created
trace_idDistributed trace IDabc123def456
FieldPurpose
user_idPer-user filtering
request_idPer-request tracking
duration_msPerformance analysis
error_codeError classification
stack_traceDebugging

Spring Boot Configuration#

# application.yml
logging:
  pattern:
    console: '{"timestamp":"%d{ISO8601}","level":"%level","service":"${spring.application.name}","message":"%message","logger":"%logger","thread":"%thread"}%n'

# Logback (logback-spring.xml)
<!-- logback-spring.xml -->
<configuration>
  <appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
    <encoder class="net.logstash.logback.encoder.LogstashEncoder">
      <customFields>{"service":"order-service"}</customFields>
    </encoder>
  </appender>
</configuration>

Log Level Strategy#

Level Definitions#

LevelPurposeRetention Period
TRACEDetailed debuggingDon’t collect
DEBUGDevelopment debugging1-3 days
INFONormal operation7-14 days
WARNPotential issues30 days
ERRORErrors occurred90+ days

Environment-Specific Settings#

# application.yml
spring:
  profiles:
    active: production

---
spring:
  config:
    activate:
      on-profile: development
logging:
  level:
    root: DEBUG
    com.example: TRACE

---
spring:
  config:
    activate:
      on-profile: production
logging:
  level:
    root: INFO
    com.example: INFO

Loki Configuration#

Promtail Configuration#

# promtail.yml
server:
  http_listen_port: 9080

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: containers
    static_configs:
      - targets:
          - localhost
        labels:
          job: containerlogs
          __path__: /var/log/containers/*.log

    pipeline_stages:
      - json:
          expressions:
            output: log
            stream: stream
            timestamp: time
      - labels:
          stream:
      - timestamp:
          source: timestamp
          format: RFC3339Nano
      - output:
          source: output

LogQL Queries#

# Filter by service
{service="order-service"}

# Filter by level
{service="order-service"} |= "ERROR"

# JSON parsing
{service="order-service"} | json | level="ERROR"

# Regex
{service="order-service"} |~ "user_id=123"

# Error count aggregation
sum(count_over_time({service="order-service"} |= "ERROR" [5m]))

ELK Configuration#

Filebeat Configuration#

# filebeat.yml
filebeat.inputs:
  - type: container
    paths:
      - '/var/lib/docker/containers/*/*.log'
    processors:
      - add_kubernetes_metadata:
          host: ${NODE_NAME}
          matchers:
            - logs_path:
                logs_path: "/var/lib/docker/containers/"

output.logstash:
  hosts: ["logstash:5044"]

Logstash Pipeline#

# logstash.conf
input {
  beats {
    port => 5044
  }
}

filter {
  json {
    source => "message"
  }
  date {
    match => ["timestamp", "ISO8601"]
    target => "@timestamp"
  }
  mutate {
    remove_field => ["message"]
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "logs-%{[service]}-%{+YYYY.MM.dd}"
  }
}

Log Retention Policy#

Cost Optimization#

graph LR
    HOT["Hot<br>7 days<br>SSD"]
    WARM["Warm<br>30 days<br>HDD"]
    COLD["Cold<br>90 days<br>Object Storage"]
    DELETE["Delete"]

    HOT --> WARM --> COLD --> DELETE

Loki Retention Settings#

# loki.yml
schema_config:
  configs:
    - from: 2026-01-01
      store: boltdb-shipper
      object_store: s3
      schema: v11
      index:
        prefix: loki_index_
        period: 24h

limits_config:
  retention_period: 720h  # 30 days

compactor:
  retention_enabled: true
  retention_delete_delay: 2h

Elasticsearch ILM#

{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_size": "50GB",
            "max_age": "7d"
          }
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "shrink": {
            "number_of_shards": 1
          }
        }
      },
      "delete": {
        "min_age": "30d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

Best Practices#

// ✅ Include structured context
log.info("Order created",
    kv("order_id", orderId),
    kv("user_id", userId),
    kv("amount", amount));

// ✅ Use appropriate levels
log.debug("Processing step completed");
log.error("Failed to process order", exception);
// ❌ Logging sensitive information
log.info("User login: password={}", password);

// ❌ Excessive logging
for (item : items) {
    log.info("Processing item: {}", item);  // What if 100,000 items?
}

// ❌ Swallowing exceptions in logs
try { ... } catch (Exception e) {
    log.error("Error");  // No stack trace
}

Key Summary#

ItemLokiELK
Suitable forLightweight, Grafana integrationFull-text search, large-scale
QueryLogQLKQL
CostLowHigh

Log Design Principles:

  1. JSON structured mandatory
  2. Include trace_id (distributed tracing connection)
  3. Use appropriate levels
  4. Exclude sensitive information

Next Steps#

Recommended OrderDocumentWhat You’ll Learn
1Distributed TracingConnecting logs and traces
2Environment SetupLoki hands-on