A step-by-step guide to diagnosing and resolving Consumer Lag issues.

TL;DR
  • Consumer Lag: Number of unread messages (Producer rate > Consumer processing rate)
  • Diagnosis: Check Lag with kafka-consumer-groups.sh, analyze trends with monitoring tools
  • Resolution: Increase Consumer count, optimize processing logic, adjust Partition count, tune batch settings

What is Consumer Lag?#

Consumer Lag is the number of messages that the Consumer has not yet read from those published by the Producer. If Lag continuously increases, message processing delays occur, and in the worst case, messages may be deleted after exceeding the retention period.

flowchart LR
    P[Producer] -->|1000 msgs/sec| T[Topic]
    T -->|800 msgs/sec| C[Consumer]

    T -.-|"Lag: 200 msgs/sec accumulating"| L[Warning: Lag Increasing]

Diagram: If the Producer publishes 1000 messages per second but the Consumer only processes 800, Lag accumulates at 200 messages per second.


Step 1: Check Current Lag Status#

1.1 Using kafka-consumer-groups.sh#

Check the current Lag of your Consumer Group using the Kafka CLI tool:

kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --describe --group order-service-group

Expected output:

GROUP              TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG
order-service-group orders         0          1000            1500            500
order-service-group orders         1          950             1480            530
order-service-group orders         2          980             1490            510
ColumnDescription
CURRENT-OFFSETLast position read by the Consumer
LOG-END-OFFSETLast message position written to the Partition
LAGNumber of unread messages (LOG-END-OFFSET - CURRENT-OFFSET)

1.2 Assessing Lag Severity#

Lag StatusCriteriaAction
NormalLag < 100, temporaryContinue monitoring
CautionLag 100-1000, persistentBegin root cause analysis
WarningLag > 1000, increasing trendImmediate action required
CriticalLag spiking, unable to processEmergency response
Caution
Don’t judge by the Lag number alone. The trend is more important. If Lag is 1000 but decreasing, it’s fine. But if Lag is 100 and continuously increasing, that’s a problem.

1.3 Checking Lag in Spring Boot#

You can check Lag within your application using AdminClient from the kafka-clients library:

@Service
@RequiredArgsConstructor
public class LagMonitorService {

    private final AdminClient adminClient;

    public Map<TopicPartition, Long> getConsumerLag(String groupId) {
        Map<TopicPartition, Long> lagMap = new HashMap<>();

        try {
            // Get current offsets for Consumer Group
            Map<TopicPartition, OffsetAndMetadata> offsets =
                adminClient.listConsumerGroupOffsets(groupId)
                    .partitionsToOffsetAndMetadata()
                    .get();

            // Get end offsets for Topics
            Map<TopicPartition, ListOffsetsResult.ListOffsetsResultInfo> endOffsets =
                adminClient.listOffsets(offsets.keySet().stream()
                    .collect(Collectors.toMap(tp -> tp, tp -> OffsetSpec.latest())))
                    .all()
                    .get();

            // Calculate Lag
            offsets.forEach((tp, offsetMeta) -> {
                long endOffset = endOffsets.get(tp).offset();
                long currentOffset = offsetMeta.offset();
                lagMap.put(tp, endOffset - currentOffset);
            });

        } catch (Exception e) {
            log.error("Failed to retrieve Lag", e);
        }

        return lagMap;
    }
}

Step 2: Diagnose the Cause#

Lag can be categorized into four main causes. Systematically check each one.

2.1 Insufficient Consumer Processing Speed#

Symptoms:

  • Lag increases uniformly across all Partitions
  • Consumer CPU usage is high

Diagnostic command:

# Check Consumer process CPU (Linux)
top -p $(pgrep -f "your-consumer-app")

Check points:

  • Is there a bottleneck in business logic?
  • Are external API calls slow?
  • Are database queries slow?

2.2 Insufficient Consumer Count#

Symptoms:

  • Consumer count is less than Partition count
  • Only some Consumers are overloaded

Diagnosis:

# Check Consumer Group members
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --describe --group order-service-group --members

# Example output
GROUP              CONSUMER-ID                          HOST          CLIENT-ID  #PARTITIONS
order-service-group consumer-1-abc123                  /192.168.1.10 consumer-1 3
order-service-group consumer-2-def456                  /192.168.1.11 consumer-2 3
Optimal Consumer Count
Consumer count = Partition count is most efficient. Having more Consumers than Partitions results in idle Consumers.

2.3 Partition Imbalance#

Symptoms:

  • Lag concentrated on specific Partitions
  • Uneven key distribution

Diagnosis:

# Check message count per Partition
kafka-run-class.sh kafka.tools.GetOffsetShell \
  --broker-list localhost:9092 \
  --topic orders \
  --time -1

Cause: If messages are concentrated on a specific key, load occurs only on the Partition to which that key is assigned.

2.4 Frequent Rebalancing#

Symptoms:

  • Consumer Lag spikes periodically then decreases
  • Many Rebalancing messages in Consumer logs

Diagnosis: Check for Rebalancing-related messages in Consumer logs:

INFO  o.a.k.c.c.i.ConsumerCoordinator : [Consumer clientId=consumer-1] Revoke partitions
INFO  o.a.k.c.c.i.ConsumerCoordinator : [Consumer clientId=consumer-1] Assigned partitions

Causes:

  • Exceeding session.timeout.ms (Consumer fails to send heartbeat)
  • Exceeding max.poll.interval.ms (poll() call interval too long)

Step 3: Apply Solutions#

3.1 Increase Consumer Count (Fastest Solution)#

Increase the number of Consumer threads in Spring Boot:

@KafkaListener(
    topics = "orders",
    groupId = "order-service-group",
    concurrency = "6"  // Increased from 3 to 6
)
public void consume(String message) {
    processMessage(message);
}

Or deploy additional application instances:

# Increase replicas in Kubernetes
kubectl scale deployment order-consumer --replicas=6
Caution
Increasing Consumer count beyond the Partition count has no effect. If you need more Consumers than Partitions, increase Partitions first.

3.2 Optimize Processing Logic#

Apply asynchronous processing:

@KafkaListener(topics = "orders")
public void consume(String message) {
    // Synchronous processing (slow)
    // processMessage(message);

    // Asynchronous processing (fast)
    CompletableFuture.runAsync(() -> processMessage(message), executor);
    ack.acknowledge();
}

Apply batch processing:

@KafkaListener(topics = "orders", batch = "true")
public void consumeBatch(List<String> messages) {
    // Batch processing instead of individual processing
    batchProcessor.processAll(messages);
}

3.3 Tune Consumer Settings#

Optimize application.yml settings:

spring:
  kafka:
    consumer:
      # Increase max records fetched at once
      max-poll-records: 500  # Default: 500

      # Increase poll interval limit (for long-running processing)
      properties:
        max.poll.interval.ms: 300000  # 5 minutes

        # Increase fetch data size
        fetch.min.bytes: 1048576  # 1MB
        fetch.max.wait.ms: 500

        # Adjust session timeout (prevent Rebalancing)
        session.timeout.ms: 30000
        heartbeat.interval.ms: 10000
SettingDefaultRecommendedDescription
max.poll.records500500-1000Improve throughput with larger batches
fetch.min.bytes11MBReduce frequency of small requests
max.poll.interval.ms5 minProcessing time + bufferPrevent Rebalancing

3.4 Increase Partition Count#

If Partition count is insufficient, parallel processing is limited:

# Check current Partition count
kafka-topics.sh --bootstrap-server localhost:9092 \
  --describe --topic orders

# Increase Partition count (6 to 12)
kafka-topics.sh --bootstrap-server localhost:9092 \
  --alter --topic orders \
  --partitions 12
Caution
Increasing Partition count changes existing key-based routing. Make this decision carefully if message ordering is important.

Step 4: Set Up Monitoring#

4.1 Prometheus + Grafana Setup#

Use Spring Boot Actuator and Micrometer to collect Lag metrics:

# application.yml
management:
  endpoints:
    web:
      exposure:
        include: prometheus, health
  metrics:
    tags:
      application: order-consumer
// build.gradle.kts
dependencies {
    implementation("org.springframework.boot:spring-boot-starter-actuator")
    implementation("io.micrometer:micrometer-registry-prometheus")
}

4.2 Alert Configuration#

Set up Prometheus AlertManager or Grafana Alert:

# Prometheus alert rule example
groups:
  - name: kafka-consumer-alerts
    rules:
      - alert: HighConsumerLag
        expr: kafka_consumer_lag > 1000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Consumer Lag exceeds 1000"
          description: "Lag for {{ $labels.group }} group's {{ $labels.topic }}: {{ $value }}"

Checklist#

Follow this order when troubleshooting Lag issues:

  • 1. Check Lag status: kafka-consumer-groups.sh --describe
  • 2. Analyze trends: Check if Lag is increasing or stable
  • 3. Check Consumer status: Member count, CPU usage
  • 4. Diagnose cause: Processing speed, Consumer count, Partition balance, Rebalancing
  • 5. Apply solution: Increase Consumers, optimize logic, tune settings
  • 6. Set up monitoring: Continuous Lag tracking