Prerequisites

Before reading this document, understand these concepts first:

Learn Bulk indexing, Refresh, and Index Lifecycle Management for efficiently storing large volumes of data.

Elasticsearch indexing goes beyond simply storing data. When a document is indexed, it goes through a complex process including text analysis, inverted index creation, and segment management. Understanding this process reveals why Bulk indexing is 10x faster than single-document indexing, why documents aren’t immediately searchable after indexing, and why Refresh should be disabled during bulk indexing. The right indexing strategy directly impacts system performance and operational costs.

Indexing Basics#

Indexing Process#

flowchart LR
    A[Document Received] --> B[Analyze]
    B --> C[Create Inverted Index]
    C --> D[Memory Buffer]
    D --> E[Refresh]
    E --> F[Segment]
    F --> G[Flush]
    G --> H[Disk]

This diagram shows the indexing process where a document is received, analyzed, stored in an inverted index, held in a memory buffer, made searchable via Refresh, written to a Segment, and permanently stored to disk via Flush.

StageDescription
AnalyzeSplit text into tokens
Memory BufferTemporary storage in memory
RefreshMake searchable (default 1 second)
SegmentImmutable index piece
FlushPermanent storage to disk

Single vs Bulk Indexing#

Why use Bulk indexing? Indexing 10,000 documents one by one results in 10,000 network round trips, taking about 30 seconds. Sending the same data in batches of 1,000 completes in about 3 seconds with just 10 requests. Bulk indexing dramatically reduces network overhead, improving large-scale data processing speed by more than 10x.

Single Document Indexing#

PUT /products/_doc/1
{
  "name": "MacBook Pro",
  "price": 2390000
}

Bulk Indexing#

Process multiple documents at once:

POST /_bulk
{"index": {"_index": "products", "_id": "1"}}
{"name": "MacBook Pro", "price": 2390000}
{"index": {"_index": "products", "_id": "2"}}
{"name": "MacBook Air", "price": 1390000}
{"index": {"_index": "products", "_id": "3"}}
{"name": "iPad", "price": 1499000}

NDJSON format: Each line separated by newline (\n), including the last line

Performance Comparison#

MethodTime for 10K docsNetwork Requests
Single~30 seconds10,000
Bulk (1000 per batch)~3 seconds10
POST /_bulk
// Recommended size: 5-15MB per request
// Recommended doc count: 1,000-5,000

Bulk Indexing in Spring#

@Service
public class ProductBulkService {

    private final ElasticsearchOperations operations;

    public void bulkIndex(List<Product> products) {
        List<IndexQuery> queries = products.stream()
            .map(product -> new IndexQueryBuilder()
                .withId(product.getId())
                .withObject(product)
                .build())
            .toList();

        operations.bulkIndex(queries, Product.class);
    }
}

Refresh#

What is Refresh?#

Operation that makes Memory Buffer data searchable.

flowchart LR
    A[Memory Buffer] -->|Refresh| B[Segment<br>Searchable]

This diagram shows how data in the memory buffer is converted into a searchable Segment through the Refresh operation.

Refresh Interval#

PUT /products/_settings
{
  "index": {
    "refresh_interval": "30s"    // Default: 1s
  }
}
SettingMeaningUse Case
1sEvery 1 second (default)Real-time search
30sEvery 30 secondsTypical service
-1DisabledDuring bulk indexing

Optimization for Bulk Indexing#

// 1. Disable Refresh
PUT /products/_settings
{ "refresh_interval": "-1" }

// 2. Perform Bulk indexing
POST /_bulk
...

// 3. Manual Refresh
POST /products/_refresh

// 4. Restore Refresh
PUT /products/_settings
{ "refresh_interval": "1s" }

Flush and Translog#

Translog#

Write-Ahead Log to prevent data loss. Plays an important role in Lucene internals. → Lucene Internals Details

flowchart LR
    A[Document] --> B[Translog]
    A --> C[Memory Buffer]
    B -->|Crash Recovery| D[Data Restore]
    C -->|Flush| E[Disk Segment]

This diagram shows how a document is simultaneously written to both the Translog and memory buffer, enabling crash recovery from the Translog while Flush permanently stores data to disk.

Flush#

Persist Memory Buffer + Translog → Disk Segment:

POST /products/_flush

Note: Manual Flush is usually unnecessary. Elasticsearch manages it automatically.

Flush Settings#

PUT /products/_settings
{
  "index": {
    "translog": {
      "durability": "async",        // async: performance, request: stability
      "sync_interval": "5s",
      "flush_threshold_size": "512mb"
    }
  }
}

Index Template#

Why use Index Templates? If you need to create indices like logs-2024-01-01, logs-2024-01-02 every day, do you have to manually define shard count, replicas, and mapping each time? If you accidentally miss a setting, each index ends up with a different structure, causing search and operational problems. Index Templates automatically apply settings when indices matching a pattern are created.

Settings automatically applied when creating new indices:

PUT /_index_template/products_template
{
  "index_patterns": ["products-*"],
  "priority": 1,
  "template": {
    "settings": {
      "number_of_shards": 3,
      "number_of_replicas": 1,
      "refresh_interval": "5s"
    },
    "mappings": {
      "properties": {
        "name": { "type": "text" },
        "price": { "type": "integer" },
        "created_at": { "type": "date" }
      }
    }
  }
}

Now automatically applied when creating products-2024, products-2025, etc.


Index Lifecycle Management (ILM)#

Why automate index lifecycle management? If log data accumulates tens of GBs daily and you have to manually delete old indices and move infrequently accessed indices to lower-cost nodes, what happens? Operator mistakes can fill up disks or accidentally delete important data. ILM automatically executes Hot → Warm → Cold → Delete policies, eliminating this operational burden and risk.

Automatically manage the lifecycle of time-series data. Especially useful for managing log data. → ILM Practical Example

Lifecycle Phases#

flowchart LR
    A[Hot<br>Active write/read] --> B[Warm<br>Read-heavy]
    B --> C[Cold<br>Occasional reads]
    C --> D[Frozen<br>Rarely read]
    D --> E[Delete<br>Remove]

This diagram shows the five phases of the index lifecycle. Data starts in the Hot phase with active usage, gradually moves through Warm, Cold, and Frozen as access frequency decreases, and is finally Deleted.

Creating ILM Policy#

PUT /_ilm/policy/logs_policy
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_size": "50gb",
            "max_age": "7d"
          },
          "set_priority": { "priority": 100 }
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "shrink": { "number_of_shards": 1 },
          "forcemerge": { "max_num_segments": 1 },
          "set_priority": { "priority": 50 }
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "set_priority": { "priority": 0 }
        }
      },
      "delete": {
        "min_age": "90d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

Applying ILM Policy#

PUT /_index_template/logs_template
{
  "index_patterns": ["logs-*"],
  "template": {
    "settings": {
      "index.lifecycle.name": "logs_policy",
      "index.lifecycle.rollover_alias": "logs"
    }
  }
}

Reindex#

Copy/transform existing index to new index:

Basic Reindex#

POST /_reindex
{
  "source": { "index": "products-old" },
  "dest": { "index": "products-new" }
}

Filtered Reindex#

POST /_reindex
{
  "source": {
    "index": "products-old",
    "query": {
      "term": { "in_stock": true }
    }
  },
  "dest": { "index": "products-active" }
}

Field Transformation#

POST /_reindex
{
  "source": { "index": "products-old" },
  "dest": { "index": "products-new" },
  "script": {
    "source": "ctx._source.price_krw = ctx._source.price * 1000"
  }
}

Async Reindex#

POST /_reindex?wait_for_completion=false
{
  "source": { "index": "large-index" },
  "dest": { "index": "large-index-new" }
}

Check progress:

GET /_tasks?actions=*reindex&detailed

Alias#

Why use Aliases? If your application code hardcodes the index name products-v1 and you need to switch to products-v2 after a mapping change, you’d have to modify and redeploy all code, causing downtime during the transition. An Alias gives an index a nickname so the application only references the alias, allowing the actual index to be swapped with zero downtime.

Give indices alternative names for flexible management:

Create Alias#

POST /_aliases
{
  "actions": [
    { "add": { "index": "products-v1", "alias": "products" } }
  ]
}

Zero Downtime Reindexing#

// 1. Create new index and copy data
PUT /products-v2
POST /_reindex
{
  "source": { "index": "products-v1" },
  "dest": { "index": "products-v2" }
}

// 2. Switch Alias (atomic)
POST /_aliases
{
  "actions": [
    { "remove": { "index": "products-v1", "alias": "products" } },
    { "add": { "index": "products-v2", "alias": "products" } }
  ]
}

Application uses only products alias → Zero-downtime switch


Indexing Performance Optimization#

Bulk Indexing Checklist#

// 1. Disable Replicas
PUT /products/_settings
{ "number_of_replicas": 0 }

// 2. Disable Refresh
PUT /products/_settings
{ "refresh_interval": "-1" }

// 3. Perform Bulk indexing
POST /_bulk
...

// 4. Refresh
POST /products/_refresh

// 5. Restore settings
PUT /products/_settings
{
  "number_of_replicas": 1,
  "refresh_interval": "1s"
}

Optimal Bulk Size#

ItemRecommended
Request size5-15 MB
Document count1,000-5,000
Concurrent requests2-3 (per node)

Indexing Threads#

PUT /products/_settings
{
  "index": {
    "indexing": {
      "slowlog": {
        "threshold": {
          "index": {
            "warn": "10s",
            "info": "5s"
          }
        }
      }
    }
  }
}

Next Steps#

GoalRecommended Document
Cluster configurationCluster Management
Search optimizationPerformance Tuning
Failure responseHigh Availability