Index Rebuild

This guide walks you through efficiently rebuilding large indices.

Estimated time: About 30-60 minutes (may take several hours depending on data size)

Scope of This Guide
Covers: _reindex API, Snapshot/Restore, Logstash comparison, large-scale processing strategies, and performance optimization
Does not cover: For simple mapping changes, see Mapping Migration. For cluster scaling, see Cluster Scaling.

TL;DR
_reindex API: Simplest option, best for rebuilds within the same cluster
Snapshot/Restore: Best for cross-cluster migration and very large datasets
Logstash: Best for complex transformations or external source integration
Performance optimization: Use refresh_interval: -1, replica 0, and sliced scroll

Before You Begin#

Verify the following prerequisites:

Item	Requirement	How to Verify
Elasticsearch version	7.x or higher	`curl -X GET "localhost:9200"`
Cluster status	green	`curl -X GET "localhost:9200/_cluster/health"`
Disk space	At least 2x the index size available	`curl -X GET "localhost:9200/_cat/allocation?v"`
Index permissions	Read + Write + Admin	Test with the commands below

# Check index size
curl -X GET "localhost:9200/_cat/indices/products?v&h=index,docs.count,store.size"

# Check available disk space
curl -X GET "localhost:9200/_cat/allocation?v&h=node,disk.avail,disk.total,disk.percent"

Warning
Disk I/O and CPU usage increase significantly during index rebuilds. Schedule the operation outside peak hours or configure resource limits.

Symptoms#

You need an index rebuild in the following situations:

When you need to change the number of shards (cannot be changed after creation)
When you need to make large-scale mapping changes
When you need to clean up old data or restructure the index
When index performance is consistently degrading

# Check shard count (immutable setting)
curl -X GET "localhost:9200/products/_settings?pretty&filter_path=**.number_of_shards"

# Response: "number_of_shards": "1" <- Rebuild required to change this

Method Selection Guide#

flowchart TD
    A["Index rebuild needed"] --> B{Within the<br>same cluster?}
    B -->|Yes| C{Data transformation<br>needed?}
    B -->|No| D{Data size?}
    C -->|Simple transformation| E["_reindex API<br>+ Script"]
    C -->|Complex transformation| F["Logstash<br>Pipeline"]
    C -->|No transformation| G["_reindex API"]
    D -->|Under 100GB| H["_reindex API<br>+ Remote"]
    D -->|Over 100GB| I["Snapshot /<br>Restore"]

    style E fill:#e8f5e9,stroke:#4caf50
    style F fill:#fff3e0,stroke:#ff9800
    style G fill:#e8f5e9,stroke:#4caf50
    style H fill:#e8f5e9,stroke:#4caf50
    style I fill:#e3f2fd,stroke:#2196f3

Method Comparison#

Item	_reindex API	Snapshot/Restore	Logstash
Difficulty	Easy	Moderate	Moderate
Speed	Fast	Very fast	Moderate
Data transformation	Script (simple)	Not possible	Very flexible
Cross-cluster	remote option	Possible	Possible
Resource usage	High	Low	Moderate
Suitable size	Up to hundreds of GB	Unlimited	Up to hundreds of GB

Method 1: _reindex API#

1.1 Prepare the New Index#

# Create new index with updated settings
curl -X PUT "localhost:9200/products-v2" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "number_of_shards": 5,
    "number_of_replicas": 0,
    "refresh_interval": "-1"
  },
  "mappings": {
    "properties": {
      "name": { "type": "text", "analyzer": "standard" },
      "price": { "type": "double" },
      "category": { "type": "keyword" },
      "created_at": { "type": "date" }
    }
  }
}'

1.2 Basic Reindex#

curl -X POST "localhost:9200/_reindex?wait_for_completion=false" -H 'Content-Type: application/json' -d'
{
  "source": {
    "index": "products-v1",
    "size": 5000
  },
  "dest": {
    "index": "products-v2"
  }
}'

# Response: {"task": "node-1:54321"}

1.3 Parallel Processing with Sliced Scroll#

For large indices, use sliced scroll for parallel Reindex:

# Manual slicing (divide into a specific number of slices)
curl -X POST "localhost:9200/_reindex?wait_for_completion=false" -H 'Content-Type: application/json' -d'
{
  "source": {
    "index": "products-v1",
    "size": 5000,
    "slice": {
      "id": 0,
      "max": 5
    }
  },
  "dest": {
    "index": "products-v2"
  }
}'

# Or automatic slicing (ES 7.x+)
curl -X POST "localhost:9200/_reindex?slices=auto&wait_for_completion=false" -H 'Content-Type: application/json' -d'
{
  "source": { "index": "products-v1" },
  "dest": { "index": "products-v2" }
}'

1.4 Monitoring Progress#

# Check task status
curl -X GET "localhost:9200/_tasks/node-1:54321?pretty"

# List all Reindex tasks
curl -X GET "localhost:9200/_tasks?actions=*reindex&detailed&pretty"

# Cancel a task (if needed)
curl -X POST "localhost:9200/_tasks/node-1:54321/_cancel"

Method 2: Snapshot/Restore#

Useful for large datasets or cross-cluster migration.

2.1 Register a Repository#

# Shared filesystem repository
curl -X PUT "localhost:9200/_snapshot/my_backup" -H 'Content-Type: application/json' -d'
{
  "type": "fs",
  "settings": {
    "location": "/mnt/backups/elasticsearch"
  }
}'

2.2 Create a Snapshot#

# Snapshot a specific index
curl -X PUT "localhost:9200/_snapshot/my_backup/products_snapshot?wait_for_completion=true" -H 'Content-Type: application/json' -d'
{
  "indices": "products-v1",
  "ignore_unavailable": true,
  "include_global_state": false
}'

2.3 Restore with a Different Name#

# Restore with a renamed index
curl -X POST "localhost:9200/_snapshot/my_backup/products_snapshot/_restore" -H 'Content-Type: application/json' -d'
{
  "indices": "products-v1",
  "rename_pattern": "products-v1",
  "rename_replacement": "products-v2",
  "index_settings": {
    "index.number_of_replicas": 0
  }
}'

Note
Snapshot/Restore cannot change mappings or settings. If you need structural changes, perform an additional _reindex after the Restore.

Method 3: Logstash#

Use this when you need complex data transformations.

3.1 Logstash Pipeline Configuration#

# logstash-reindex.conf
input {
  elasticsearch {
    hosts => ["localhost:9200"]
    index => "products-v1"
    query => '{ "query": { "match_all": {} } }'
    size => 5000
    scroll => "5m"
    docinfo => true
  }
}

filter {
  # Complex transformation logic
  mutate {
    convert => { "price" => "float" }
    rename => { "old_field" => "new_field" }
    remove_field => ["unwanted_field"]
  }
}

output {
  elasticsearch {
    hosts => ["localhost:9200"]
    index => "products-v2"
    document_id => "%{[@metadata][_id]}"
    action => "index"
  }
}

3.2 Execution#

bin/logstash -f logstash-reindex.conf

Performance Optimization#

Target Index Settings#

Optimize the target index settings before running Reindex:

# 1. Disable refresh
curl -X PUT "localhost:9200/products-v2/_settings" -H 'Content-Type: application/json' -d'
{ "refresh_interval": "-1" }'

# 2. Set replicas to 0
curl -X PUT "localhost:9200/products-v2/_settings" -H 'Content-Type: application/json' -d'
{ "number_of_replicas": 0 }'

# 3. Adjust translog settings (optional)
curl -X PUT "localhost:9200/products-v2/_settings" -H 'Content-Type: application/json' -d'
{
  "index.translog.durability": "async",
  "index.translog.sync_interval": "30s"
}'

Warning
translog.durability: async carries a risk of data loss. Use it only during the rebuild and be sure to restore it to request afterward.

Restore Settings After Completion#

# Restore refresh
curl -X PUT "localhost:9200/products-v2/_settings" -H 'Content-Type: application/json' -d'
{ "refresh_interval": "1s" }'

# Restore replicas
curl -X PUT "localhost:9200/products-v2/_settings" -H 'Content-Type: application/json' -d'
{ "number_of_replicas": 1 }'

# Restore translog
curl -X PUT "localhost:9200/products-v2/_settings" -H 'Content-Type: application/json' -d'
{ "index.translog.durability": "request" }'

# Manual refresh
curl -X POST "localhost:9200/products-v2/_refresh"

# Force merge (optional: segment optimization)
curl -X POST "localhost:9200/products-v2/_forcemerge?max_num_segments=1"

Checklist#

Items to verify during an index rebuild:

Is there enough disk space? - At least 2x the index size available
Is the cluster status green? - If yellow, resolve existing issues first
Have you chosen a rebuild method? - _reindex / Snapshot / Logstash
Have you applied performance optimizations? - refresh, replica, translog settings
Do the document counts match? - Compare the source and new index
Have you restored the settings? - refresh_interval, replica, translog

Verifying Success#

Confirm the rebuild succeeded using the following methods:

Document count comparison: Verify it matches the source

echo "Source:" && curl -s "localhost:9200/products-v1/_count" | python3 -m json.tool
echo "New index:" && curl -s "localhost:9200/products-v2/_count" | python3 -m json.tool

Cluster status check: Verify the cluster is green after the rebuild
```
curl -X GET "localhost:9200/_cluster/health?pretty"
```

Sample query test: Verify that key queries work correctly

curl -X GET "localhost:9200/products-v2/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": { "match_all": {} },
  "size": 5
}'

Success Criteria
Document count matches the source
Cluster status is green
Key queries work correctly
Settings (refresh, replica, translog) are restored

Common Errors#

“circuit_breaking_exception” during reindex#

{
  "error": {
    "type": "circuit_breaking_exception",
    "reason": "Data too large"
  }
}

Cause: Out of memory during Reindex

Solution: Reduce the source.size value:

"source": { "index": "products-v1", "size": 1000 }

Task Disappeared#

Cause: The task was lost due to a node restart

Solution: Check completed tasks in the .tasks index:

curl -X GET "localhost:9200/.tasks/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": { "match": { "task.action": "indices:data/write/reindex" } }
}'

Version Conflict#

{
  "failures": [{
    "cause": { "type": "version_conflict_engine_exception" }
  }]
}

Cause: Concurrent writes to the source index

Solution: Use conflicts: proceed to ignore conflicts, or set the source index to read-only:

# Ignore conflicts
"conflicts": "proceed"

# Or set the source index to read-only
curl -X PUT "localhost:9200/products-v1/_settings" -H 'Content-Type: application/json' -d'
{ "index.blocks.write": true }'

Mapping Migration - Zero-downtime mapping migration
Cluster Scaling - Cluster-level scaling
Memory Troubleshooting - Handling memory issues during rebuilds