Memory Troubleshooting

Learn how to diagnose and resolve OutOfMemoryError and GC issues.

Duration: Approximately 20-40 minutes (additional 10 minutes for GC log analysis)

Scope of This Guide
Covered: Heap memory settings, Circuit Breaker, Field Data optimization, GC tuning
Not Covered: Adding nodes, hardware upgrades - see Cluster Management

TL;DR
Heap memory: 50% or less of total memory, maximum 31GB
Circuit Breaker: Check settings to prevent memory overuse
Field data: Avoid aggregations on text fields, use doc_values
GC tuning: Use G1GC, analyze logs to identify issues

Before You Begin#

Verify the following requirements:

Item	Requirement	How to Check
Server access	SSH or console access	Able to log in to server
jvm.options edit permission	root or elasticsearch user	Check paths below
ES restart permission	Able to restart service	`systemctl restart elasticsearch`

jvm.options file locations:

Installation Method	Path
Debian/Ubuntu (apt)	`/etc/elasticsearch/jvm.options`
RPM/CentOS (yum)	`/etc/elasticsearch/jvm.options`
tar.gz extraction	`{ES_HOME}/config/jvm.options`
Docker	Use environment variable `ES_JAVA_OPTS`

# Find jvm.options file location
ls -la /etc/elasticsearch/jvm.options 2>/dev/null || \
ls -la $ES_HOME/config/jvm.options 2>/dev/null || \
echo "Cannot find jvm.options file"

Note
Elasticsearch must be restarted after changing jvm.options. Rolling restart is recommended for production environments.

Symptoms#

The following issues occur:

OutOfMemoryError:

java.lang.OutOfMemoryError: Java heap space

Circuit Breaker triggered:

{
  "error": {
    "type": "circuit_breaking_exception",
    "reason": "[parent] Data too large, data for [<query>] would be larger than limit of [xxx/yyy]"
  }
}

GC overhead:

GC overhead limit exceeded

Step 1: Check Current Memory Status#

1.1 Memory Usage by Node#

# Heap memory status by node
curl -X GET "localhost:9200/_cat/nodes?v&h=name,heap.percent,heap.current,heap.max"

# Example output:
# name    heap.percent heap.current heap.max
# node-1  75           11.2gb       16gb

1.2 Detailed Memory Analysis#

# Full node statistics
curl -X GET "localhost:9200/_nodes/stats/jvm?pretty"

# Circuit Breaker status
curl -X GET "localhost:9200/_nodes/stats/breaker?pretty"

Key points to check:

heap.percent > 85%: Danger level
High fielddata usage indicates aggregation query issues
Frequent request breaker trips indicate need for query optimization

Step 2: Optimize Heap Memory Settings#

2.1 Appropriate Heap Size#

# Edit jvm.options file
# Location: /etc/elasticsearch/jvm.options or config/jvm.options

# Recommended settings (example for 16GB system)
-Xms8g
-Xmx8g

Heap memory guidelines:

System Memory	Recommended Heap	Remaining Memory For
8GB	4GB	OS cache, Lucene
16GB	8GB	OS cache, Lucene
32GB	16GB	OS cache, Lucene
64GB	31GB	OS cache, Lucene

Warning
Setting heap above 32GB disables Compressed OOPs, which actually degrades performance. Maximum recommended is 31GB.

2.2 Set Xms and Xmx Equal#

# Incorrect: Variable heap size increases GC burden
-Xms4g
-Xmx16g

# Correct: Fixed size
-Xms8g
-Xmx8g

Step 3: Adjust Circuit Breaker#

3.1 Check Breaker Settings#

curl -X GET "localhost:9200/_cluster/settings?include_defaults=true&filter_path=**.breaker"

3.2 Adjust Breaker Limits#

curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
{
  "persistent": {
    "indices.breaker.total.limit": "70%",
    "indices.breaker.fielddata.limit": "40%",
    "indices.breaker.request.limit": "40%"
  }
}'

Breaker	Default	Role
`total`	70%	Total memory limit
`fielddata`	40%	Field data cache
`request`	60%	Single request memory
`in_flight_requests`	100%	Requests in transit

Step 4: Optimize Field Data#

4.1 Problem Cause#

Aggregations or sorting on text fields load fielddata into memory:

// Dangerous: Aggregating on text field
{
  "aggs": {
    "categories": {
      "terms": { "field": "category" }  // Problem if category is text
    }
  }
}

4.2 Solutions#

Method 1: Use keyword field

// Mapping configuration
{
  "mappings": {
    "properties": {
      "category": {
        "type": "text",
        "fields": {
          "keyword": { "type": "keyword" }
        }
      }
    }
  }
}

// Use keyword for aggregation
{
  "aggs": {
    "categories": {
      "terms": { "field": "category.keyword" }
    }
  }
}

Method 2: Use doc_values

// Enable doc_values in mapping (enabled by default for keyword)
{
  "mappings": {
    "properties": {
      "status": {
        "type": "keyword",
        "doc_values": true
      }
    }
  }
}

4.3 Limit Field Data Cache#

curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
{
  "persistent": {
    "indices.fielddata.cache.size": "20%"
  }
}'

Step 5: GC Optimization#

5.1 G1GC Settings (Recommended)#

# jvm.options
-XX:+UseG1GC
-XX:MaxGCPauseMillis=200
-XX:G1ReservePercent=25
-XX:InitiatingHeapOccupancyPercent=30

5.2 GC Log Analysis#

# Check GC log location
ls /var/log/elasticsearch/gc.log*

# Enable GC logging (jvm.options)
-Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m

Analysis tools:

GCViewer: java -jar gcviewer.jar gc.log
GCEasy: https://gceasy.io (online)

5.3 GC Problem Patterns#

Symptom	Cause	Solution
Frequent Young GC	High object creation	Query optimization, adjust batch size
Long Full GC	Heap shortage	Increase heap or clean up data
Memory shortage after GC	Memory leak	Analyze heap dump

Step 6: Query-Level Optimization#

6.1 Avoid Large Result Sets#

// Dangerous: Too many results
{ "size": 10000 }

// Safe: Use pagination
{ "size": 100, "from": 0 }

// For bulk data: Use Scroll API
curl -X POST "localhost:9200/products/_search?scroll=1m" -H 'Content-Type: application/json' -d'
{
  "size": 1000,
  "query": { "match_all": {} }
}'

6.2 Optimize Aggregations#

// Dangerous: High cardinality aggregation
{
  "aggs": {
    "all_users": {
      "terms": { "field": "user_id", "size": 1000000 }
    }
  }
}

// Safe: Appropriate size limit
{
  "aggs": {
    "top_users": {
      "terms": { "field": "user_id", "size": 100 }
    }
  }
}

Checklist#

Items to check when troubleshooting memory issues:

Is heap size appropriate? - 50% of system memory, maximum 31GB
Are Xms and Xmx equal? - Prevent heap size fluctuation
Not aggregating on text fields? - Use keyword or doc_values
Circuit Breaker appropriate? - Too high causes OOM, too low causes query failures
Analyzed GC logs? - Identify patterns
Any unnecessary indices? - Clean up old indices

Verify Success#

Confirm memory issues are resolved with these methods:

Check heap usage: Verify heap.percent stays stable below 75%

# Monitor heap usage (10 times at 5-second intervals)
for i in {1..10}; do
  curl -s "localhost:9200/_cat/nodes?v&h=name,heap.percent" && sleep 5
done

Check Circuit Breaker: Verify breaker no longer trips

curl -X GET "localhost:9200/_nodes/stats/breaker?pretty" | grep tripped

Check OOM logs: Verify no new OutOfMemoryError occurs

# Search for OOM in recent logs
grep -i "OutOfMemory" /var/log/elasticsearch/*.log | tail -5

Success Criteria
heap.percent stable below 75%
Circuit Breaker tripped count not increasing
No OOM for 24 hours

Common Errors#

jvm.options Syntax Error#

Symptom: Elasticsearch won’t start

Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.

Cause: Invalid option in jvm.options file

Solution:

Check jvm.options file syntax
Ensure each option is on a new line
Check for spaces or special characters

# Correct format
-Xms8g
-Xmx8g

# Incorrect format (contains spaces)
-Xms 8g
-Xmx=8g

Elasticsearch Startup Failure (Insufficient Memory)#

Symptom: Service won’t start

[ERROR] bootstrap checks failed
max virtual memory areas vm.max_map_count [65530] is too low