소요 시간: 20분 선수 지식: Kafka 기초, 환경 구성 이 문서를 읽으면: Kafka 클러스터와 Consumer Lag을 모니터링할 수 있습니다

Kafka 모니터링 핵심 지표#

카테고리지표의미
ConsumerLag처리 지연 (가장 중요)
BrokerUnder-replicated Partitions복제 문제
ProducerRecord Error Rate전송 실패율
BrokerDisk Usage저장 공간

Step 1: Kafka + JMX Exporter 설정#

# docker-compose.yml에 추가
services:
  kafka:
    image: confluentinc/cp-kafka:7.5.0
    container_name: kafka
    ports:
      - "9092:9092"
      - "9101:9101"  # JMX
    environment:
      KAFKA_NODE_ID: 1
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
      KAFKA_PROCESS_ROLES: broker,controller
      KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka:29093
      KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:9092,CONTROLLER://0.0.0.0:29093
      KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
      CLUSTER_ID: MkU3OEVBNTcwNTJENDM2Qk
      KAFKA_JMX_PORT: 9101
      KAFKA_JMX_HOSTNAME: localhost
      KAFKA_OPTS: -javaagent:/opt/jmx-exporter/jmx_prometheus_javaagent.jar=7071:/opt/jmx-exporter/kafka.yml
    volumes:
      - ./jmx-exporter:/opt/jmx-exporter

  kafka-exporter:
    image: danielqsj/kafka-exporter:latest
    container_name: kafka-exporter
    ports:
      - "9308:9308"
    command:
      - --kafka.server=kafka:9092
      - --topic.filter=.*
      - --group.filter=.*

Step 2: JMX Exporter 설정#

mkdir -p jmx-exporter
curl -L https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.19.0/jmx_prometheus_javaagent-0.19.0.jar -o jmx-exporter/jmx_prometheus_javaagent.jar
# jmx-exporter/kafka.yml
lowercaseOutputName: true
rules:
  # Broker 메트릭
  - pattern: kafka.server<type=(.+), name=(.+), clientId=(.+), topic=(.+), partition=(.*)><>Value
    name: kafka_server_$1_$2
    type: GAUGE
    labels:
      clientId: "$3"
      topic: "$4"
      partition: "$5"

  - pattern: kafka.server<type=(.+), name=(.+)><>Value
    name: kafka_server_$1_$2
    type: GAUGE

  # Under-replicated Partitions
  - pattern: kafka.server<type=ReplicaManager, name=UnderReplicatedPartitions><>Value
    name: kafka_server_replicamanager_underreplicatedpartitions
    type: GAUGE

  # Messages In Per Sec
  - pattern: kafka.server<type=BrokerTopicMetrics, name=MessagesInPerSec, topic=(.+)><>OneMinuteRate
    name: kafka_server_brokertopicmetrics_messagesin_total
    type: GAUGE
    labels:
      topic: "$1"

  # Bytes In Per Sec
  - pattern: kafka.server<type=BrokerTopicMetrics, name=BytesInPerSec, topic=(.+)><>OneMinuteRate
    name: kafka_server_brokertopicmetrics_bytesin_total
    type: GAUGE
    labels:
      topic: "$1"

Step 3: Prometheus 설정 추가#

# prometheus/prometheus.yml에 추가
scrape_configs:
  - job_name: 'kafka'
    static_configs:
      - targets: ['kafka:7071']

  - job_name: 'kafka-exporter'
    static_configs:
      - targets: ['kafka-exporter:9308']

Step 4: 핵심 PromQL 쿼리#

Consumer Lag#

# 컨슈머 그룹별 Lag
sum by (consumergroup, topic) (kafka_consumergroup_lag)

# Lag이 10000 이상인 그룹
sum by (consumergroup, topic) (kafka_consumergroup_lag) > 10000

# Lag 추이
sum(kafka_consumergroup_lag) by (consumergroup)

Broker 상태#

# Under-replicated Partitions
kafka_server_replicamanager_underreplicatedpartitions

# ISR (In-Sync Replicas) 축소
kafka_server_replicamanager_isrshrinks_total

트래픽#

# 토픽별 초당 메시지 수
sum by (topic) (rate(kafka_server_brokertopicmetrics_messagesin_total[5m]))

# 토픽별 초당 바이트
sum by (topic) (rate(kafka_server_brokertopicmetrics_bytesin_total[5m]))

Step 5: 알림 규칙#

# prometheus/rules/kafka-alerts.yml
groups:
  - name: kafka
    rules:
      - alert: KafkaConsumerLagHigh
        expr: sum by (consumergroup, topic) (kafka_consumergroup_lag) > 10000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High consumer lag: {{ $labels.consumergroup }}"
          description: "Lag is {{ $value }} on topic {{ $labels.topic }}"

      - alert: KafkaUnderReplicatedPartitions
        expr: kafka_server_replicamanager_underreplicatedpartitions > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Kafka under-replicated partitions"
          description: "{{ $value }} partitions are under-replicated"

      - alert: KafkaBrokerDown
        expr: count(kafka_server_kafkaserver_brokerstate) < 3
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Kafka broker down"

Step 6: Recording Rules#

# prometheus/rules/kafka-recording.yml
groups:
  - name: kafka_recording
    rules:
      - record: topic:kafka_messages:rate5m
        expr: sum by (topic) (rate(kafka_server_brokertopicmetrics_messagesin_total[5m]))

      - record: consumergroup:kafka_lag:sum
        expr: sum by (consumergroup) (kafka_consumergroup_lag)

      - record: :kafka_underreplicated:sum
        expr: sum(kafka_server_replicamanager_underreplicatedpartitions)

Step 7: Grafana 대시보드#

Row 1: Overview#

패널쿼리타입
Total Lagsum(kafka_consumergroup_lag)Stat
Messages/secsum(rate(kafka_server_brokertopicmetrics_messagesin_total[5m]))Stat
Under-replicatedsum(kafka_server_replicamanager_underreplicatedpartitions)Stat

Row 2: Consumer Lag#

# Time Series: Lag 추이
sum by (consumergroup) (kafka_consumergroup_lag)

# Table: 상세
sum by (consumergroup, topic, partition) (kafka_consumergroup_lag)

Row 3: Traffic#

# 토픽별 메시지/초
sum by (topic) (rate(kafka_server_brokertopicmetrics_messagesin_total[5m]))

Spring Kafka 메트릭#

# application.yml
management:
  metrics:
    enable:
      kafka: true
# Producer 메트릭
kafka_producer_record_send_total
kafka_producer_record_error_total

# Consumer 메트릭
kafka_consumer_records_consumed_total
kafka_consumer_fetch_manager_records_lag

확인 체크리스트#

  • kafka-exporter 메트릭 노출 확인
  • Consumer Lag 쿼리 동작 확인
  • Grafana 대시보드 구성
  • 알림 규칙 테스트

다음 단계#

추천 순서문서배우는 것
1풀스택 예제통합 예제
2서비스 유형별 적용Kafka 황금 신호