Indexing Strategy

Prerequisites
Before reading this document, understand these concepts first:
Core Components - Shard, Segment concepts
Data Modeling - Mapping, Analyzer basics

Learn Bulk indexing, Refresh, and Index Lifecycle Management for efficiently storing large volumes of data.

Elasticsearch indexing goes beyond simply storing data. When a document is indexed, it goes through a complex process including text analysis, inverted index creation, and segment management. Understanding this process reveals why Bulk indexing is 10x faster than single-document indexing, why documents aren’t immediately searchable after indexing, and why Refresh should be disabled during bulk indexing. The right indexing strategy directly impacts system performance and operational costs.

Indexing Basics#

Indexing Process#

flowchart LR
    A[Document Received] --> B[Analyze]
    B --> C[Create Inverted Index]
    C --> D[Memory Buffer]
    D --> E[Refresh]
    E --> F[Segment]
    F --> G[Flush]
    G --> H[Disk]

This diagram shows the indexing process where a document is received, analyzed, stored in an inverted index, held in a memory buffer, made searchable via Refresh, written to a Segment, and permanently stored to disk via Flush.

Stage	Description
Analyze	Split text into tokens
Memory Buffer	Temporary storage in memory
Refresh	Make searchable (default 1 second)
Segment	Immutable index piece
Flush	Permanent storage to disk

Single vs Bulk Indexing#

Why use Bulk indexing? Indexing 10,000 documents one by one results in 10,000 network round trips, taking about 30 seconds. Sending the same data in batches of 1,000 completes in about 3 seconds with just 10 requests. Bulk indexing dramatically reduces network overhead, improving large-scale data processing speed by more than 10x.

Single Document Indexing#

PUT /products/_doc/1
{
  "name": "MacBook Pro",
  "price": 2390000
}

Bulk Indexing#

Process multiple documents at once:

POST /_bulk
{"index": {"_index": "products", "_id": "1"}}
{"name": "MacBook Pro", "price": 2390000}
{"index": {"_index": "products", "_id": "2"}}
{"name": "MacBook Air", "price": 1390000}
{"index": {"_index": "products", "_id": "3"}}
{"name": "iPad", "price": 1499000}

NDJSON format: Each line separated by newline (\n), including the last line

Performance Comparison#

Method	Time for 10K docs	Network Requests
Single	~30 seconds	10,000
Bulk (1000 per batch)	~3 seconds	10

Recommended Bulk Settings#

POST /_bulk
// Recommended size: 5-15MB per request
// Recommended doc count: 1,000-5,000

Bulk Indexing in Spring#

@Service
public class ProductBulkService {

    private final ElasticsearchOperations operations;

    public void bulkIndex(List<Product> products) {
        List<IndexQuery> queries = products.stream()
            .map(product -> new IndexQueryBuilder()
                .withId(product.getId())
                .withObject(product)
                .build())
            .toList();

        operations.bulkIndex(queries, Product.class);
    }
}

Refresh#

What is Refresh?#

Operation that makes Memory Buffer data searchable.

flowchart LR
    A[Memory Buffer] -->|Refresh| B[Segment<br>Searchable]

This diagram shows how data in the memory buffer is converted into a searchable Segment through the Refresh operation.

Refresh Interval#

PUT /products/_settings
{
  "index": {
    "refresh_interval": "30s"    // Default: 1s
  }
}

Setting	Meaning	Use Case
`1s`	Every 1 second (default)	Real-time search
`30s`	Every 30 seconds	Typical service
`-1`	Disabled	During bulk indexing

Optimization for Bulk Indexing#

// 1. Disable Refresh
PUT /products/_settings
{ "refresh_interval": "-1" }

// 2. Perform Bulk indexing
POST /_bulk
...

// 3. Manual Refresh
POST /products/_refresh

// 4. Restore Refresh
PUT /products/_settings
{ "refresh_interval": "1s" }

Flush and Translog#

Translog#

Write-Ahead Log to prevent data loss. Plays an important role in Lucene internals. → Lucene Internals Details

flowchart LR
    A[Document] --> B[Translog]
    A --> C[Memory Buffer]
    B -->|Crash Recovery| D[Data Restore]
    C -->|Flush| E[Disk Segment]

This diagram shows how a document is simultaneously written to both the Translog and memory buffer, enabling crash recovery from the Translog while Flush permanently stores data to disk.

Flush#

Persist Memory Buffer + Translog → Disk Segment:

POST /products/_flush

Note: Manual Flush is usually unnecessary. Elasticsearch manages it automatically.

Flush Settings#

PUT /products/_settings
{
  "index": {
    "translog": {
      "durability": "async",        // async: performance, request: stability
      "sync_interval": "5s",
      "flush_threshold_size": "512mb"
    }
  }
}

Index Template#

Why use Index Templates? If you need to create indices like logs-2024-01-01, logs-2024-01-02 every day, do you have to manually define shard count, replicas, and mapping each time? If you accidentally miss a setting, each index ends up with a different structure, causing search and operational problems. Index Templates automatically apply settings when indices matching a pattern are created.

Settings automatically applied when creating new indices:

PUT /_index_template/products_template
{
  "index_patterns": ["products-*"],
  "priority": 1,
  "template": {
    "settings": {
      "number_of_shards": 3,
      "number_of_replicas": 1,
      "refresh_interval": "5s"
    },
    "mappings": {
      "properties": {
        "name": { "type": "text" },
        "price": { "type": "integer" },
        "created_at": { "type": "date" }
      }
    }
  }
}

Now automatically applied when creating products-2024, products-2025, etc.

Index Lifecycle Management (ILM)#

Why automate index lifecycle management? If log data accumulates tens of GBs daily and you have to manually delete old indices and move infrequently accessed indices to lower-cost nodes, what happens? Operator mistakes can fill up disks or accidentally delete important data. ILM automatically executes Hot → Warm → Cold → Delete policies, eliminating this operational burden and risk.

Automatically manage the lifecycle of time-series data. Especially useful for managing log data. → ILM Practical Example

Lifecycle Phases#

flowchart LR
    A[Hot<br>Active write/read] --> B[Warm<br>Read-heavy]
    B --> C[Cold<br>Occasional reads]
    C --> D[Frozen<br>Rarely read]
    D --> E[Delete<br>Remove]

This diagram shows the five phases of the index lifecycle. Data starts in the Hot phase with active usage, gradually moves through Warm, Cold, and Frozen as access frequency decreases, and is finally Deleted.

Creating ILM Policy#

PUT /_ilm/policy/logs_policy
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_size": "50gb",
            "max_age": "7d"
          },
          "set_priority": { "priority": 100 }
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "shrink": { "number_of_shards": 1 },
          "forcemerge": { "max_num_segments": 1 },
          "set_priority": { "priority": 50 }
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "set_priority": { "priority": 0 }
        }
      },
      "delete": {
        "min_age": "90d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

Applying ILM Policy#

PUT /_index_template/logs_template
{
  "index_patterns": ["logs-*"],
  "template": {
    "settings": {
      "index.lifecycle.name": "logs_policy",
      "index.lifecycle.rollover_alias": "logs"
    }
  }
}

Reindex#

Copy/transform existing index to new index:

Basic Reindex#

POST /_reindex
{
  "source": { "index": "products-old" },
  "dest": { "index": "products-new" }
}

Filtered Reindex#

POST /_reindex
{
  "source": {
    "index": "products-old",
    "query": {
      "term": { "in_stock": true }
    }
  },
  "dest": { "index": "products-active" }
}

Field Transformation#

POST /_reindex
{
  "source": { "index": "products-old" },
  "dest": { "index": "products-new" },
  "script": {
    "source": "ctx._source.price_krw = ctx._source.price * 1000"
  }
}

Async Reindex#

POST /_reindex?wait_for_completion=false
{
  "source": { "index": "large-index" },
  "dest": { "index": "large-index-new" }
}

Check progress:

GET /_tasks?actions=*reindex&detailed

Alias#

Why use Aliases? If your application code hardcodes the index name products-v1 and you need to switch to products-v2 after a mapping change, you’d have to modify and redeploy all code, causing downtime during the transition. An Alias gives an index a nickname so the application only references the alias, allowing the actual index to be swapped with zero downtime.

Give indices alternative names for flexible management:

Create Alias#

POST /_aliases
{
  "actions": [
    { "add": { "index": "products-v1", "alias": "products" } }
  ]
}

Zero Downtime Reindexing#

// 1. Create new index and copy data
PUT /products-v2
POST /_reindex
{
  "source": { "index": "products-v1" },
  "dest": { "index": "products-v2" }
}

// 2. Switch Alias (atomic)
POST /_aliases
{
  "actions": [
    { "remove": { "index": "products-v1", "alias": "products" } },
    { "add": { "index": "products-v2", "alias": "products" } }
  ]
}

Application uses only products alias → Zero-downtime switch

Indexing Performance Optimization#

Bulk Indexing Checklist#

// 1. Disable Replicas
PUT /products/_settings
{ "number_of_replicas": 0 }

// 2. Disable Refresh
PUT /products/_settings
{ "refresh_interval": "-1" }

// 3. Perform Bulk indexing
POST /_bulk
...

// 4. Refresh
POST /products/_refresh

// 5. Restore settings
PUT /products/_settings
{
  "number_of_replicas": 1,
  "refresh_interval": "1s"
}

Optimal Bulk Size#

Item	Recommended
Request size	5-15 MB
Document count	1,000-5,000
Concurrent requests	2-3 (per node)

Indexing Threads#

PUT /products/_settings
{
  "index": {
    "indexing": {
      "slowlog": {
        "threshold": {
          "index": {
            "warn": "10s",
            "info": "5s"
          }
        }
      }
    }
  }
}

Next Steps#

Goal	Recommended Document
Cluster configuration	Cluster Management
Search optimization	Performance Tuning
Failure response	High Availability