PrerequisitesBefore reading this document, understand these concepts first:
- Core Components - Shard, Segment concepts
- Data Modeling - Mapping, Analyzer basics
Learn Bulk indexing, Refresh, and Index Lifecycle Management for efficiently storing large volumes of data.
Elasticsearch indexing goes beyond simply storing data. When a document is indexed, it goes through a complex process including text analysis, inverted index creation, and segment management. Understanding this process reveals why Bulk indexing is 10x faster than single-document indexing, why documents aren’t immediately searchable after indexing, and why Refresh should be disabled during bulk indexing. The right indexing strategy directly impacts system performance and operational costs.
Indexing Basics#
Indexing Process#
flowchart LR
A[Document Received] --> B[Analyze]
B --> C[Create Inverted Index]
C --> D[Memory Buffer]
D --> E[Refresh]
E --> F[Segment]
F --> G[Flush]
G --> H[Disk]| Stage | Description |
|---|---|
| Analyze | Split text into tokens |
| Memory Buffer | Temporary storage in memory |
| Refresh | Make searchable (default 1 second) |
| Segment | Immutable index piece |
| Flush | Permanent storage to disk |
Single vs Bulk Indexing#
Single Document Indexing#
PUT /products/_doc/1
{
"name": "MacBook Pro",
"price": 2390000
}Bulk Indexing#
Process multiple documents at once:
POST /_bulk
{"index": {"_index": "products", "_id": "1"}}
{"name": "MacBook Pro", "price": 2390000}
{"index": {"_index": "products", "_id": "2"}}
{"name": "MacBook Air", "price": 1390000}
{"index": {"_index": "products", "_id": "3"}}
{"name": "iPad", "price": 1499000}NDJSON format: Each line separated by newline (
\n), including the last line
Performance Comparison#
| Method | Time for 10K docs | Network Requests |
|---|---|---|
| Single | ~30 seconds | 10,000 |
| Bulk (1000 per batch) | ~3 seconds | 10 |
Recommended Bulk Settings#
POST /_bulk
// Recommended size: 5-15MB per request
// Recommended doc count: 1,000-5,000Bulk Indexing in Spring#
@Service
public class ProductBulkService {
private final ElasticsearchOperations operations;
public void bulkIndex(List<Product> products) {
List<IndexQuery> queries = products.stream()
.map(product -> new IndexQueryBuilder()
.withId(product.getId())
.withObject(product)
.build())
.toList();
operations.bulkIndex(queries, Product.class);
}
}Refresh#
What is Refresh?#
Operation that makes Memory Buffer data searchable.
flowchart LR
A[Memory Buffer] -->|Refresh| B[Segment<br>Searchable]Refresh Interval#
PUT /products/_settings
{
"index": {
"refresh_interval": "30s" // Default: 1s
}
}| Setting | Meaning | Use Case |
|---|---|---|
1s | Every 1 second (default) | Real-time search |
30s | Every 30 seconds | Typical service |
-1 | Disabled | During bulk indexing |
Optimization for Bulk Indexing#
// 1. Disable Refresh
PUT /products/_settings
{ "refresh_interval": "-1" }
// 2. Perform Bulk indexing
POST /_bulk
...
// 3. Manual Refresh
POST /products/_refresh
// 4. Restore Refresh
PUT /products/_settings
{ "refresh_interval": "1s" }Flush and Translog#
Translog#
Write-Ahead Log to prevent data loss. Plays an important role in Lucene internals. → Lucene Internals Details
flowchart LR
A[Document] --> B[Translog]
A --> C[Memory Buffer]
B -->|Crash Recovery| D[Data Restore]
C -->|Flush| E[Disk Segment]Flush#
Persist Memory Buffer + Translog → Disk Segment:
POST /products/_flushNote: Manual Flush is usually unnecessary. Elasticsearch manages it automatically.
Flush Settings#
PUT /products/_settings
{
"index": {
"translog": {
"durability": "async", // async: performance, request: stability
"sync_interval": "5s",
"flush_threshold_size": "512mb"
}
}
}Index Template#
Settings automatically applied when creating new indices:
PUT /_index_template/products_template
{
"index_patterns": ["products-*"],
"priority": 1,
"template": {
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1,
"refresh_interval": "5s"
},
"mappings": {
"properties": {
"name": { "type": "text" },
"price": { "type": "integer" },
"created_at": { "type": "date" }
}
}
}
}Now automatically applied when creating products-2024, products-2025, etc.
Index Lifecycle Management (ILM)#
Automatically manage the lifecycle of time-series data. Especially useful for managing log data. → ILM Practical Example
Lifecycle Phases#
flowchart LR
A[Hot<br>Active write/read] --> B[Warm<br>Read-heavy]
B --> C[Cold<br>Occasional reads]
C --> D[Frozen<br>Rarely read]
D --> E[Delete<br>Remove]Creating ILM Policy#
PUT /_ilm/policy/logs_policy
{
"policy": {
"phases": {
"hot": {
"min_age": "0ms",
"actions": {
"rollover": {
"max_size": "50gb",
"max_age": "7d"
},
"set_priority": { "priority": 100 }
}
},
"warm": {
"min_age": "7d",
"actions": {
"shrink": { "number_of_shards": 1 },
"forcemerge": { "max_num_segments": 1 },
"set_priority": { "priority": 50 }
}
},
"cold": {
"min_age": "30d",
"actions": {
"set_priority": { "priority": 0 }
}
},
"delete": {
"min_age": "90d",
"actions": {
"delete": {}
}
}
}
}
}Applying ILM Policy#
PUT /_index_template/logs_template
{
"index_patterns": ["logs-*"],
"template": {
"settings": {
"index.lifecycle.name": "logs_policy",
"index.lifecycle.rollover_alias": "logs"
}
}
}Reindex#
Copy/transform existing index to new index:
Basic Reindex#
POST /_reindex
{
"source": { "index": "products-old" },
"dest": { "index": "products-new" }
}Filtered Reindex#
POST /_reindex
{
"source": {
"index": "products-old",
"query": {
"term": { "in_stock": true }
}
},
"dest": { "index": "products-active" }
}Field Transformation#
POST /_reindex
{
"source": { "index": "products-old" },
"dest": { "index": "products-new" },
"script": {
"source": "ctx._source.price_krw = ctx._source.price * 1000"
}
}Async Reindex#
POST /_reindex?wait_for_completion=false
{
"source": { "index": "large-index" },
"dest": { "index": "large-index-new" }
}Check progress:
GET /_tasks?actions=*reindex&detailedAlias#
Give indices alternative names for flexible management:
Create Alias#
POST /_aliases
{
"actions": [
{ "add": { "index": "products-v1", "alias": "products" } }
]
}Zero Downtime Reindexing#
// 1. Create new index and copy data
PUT /products-v2
POST /_reindex
{
"source": { "index": "products-v1" },
"dest": { "index": "products-v2" }
}
// 2. Switch Alias (atomic)
POST /_aliases
{
"actions": [
{ "remove": { "index": "products-v1", "alias": "products" } },
{ "add": { "index": "products-v2", "alias": "products" } }
]
}Application uses only products alias → Zero-downtime switch
Indexing Performance Optimization#
Bulk Indexing Checklist#
// 1. Disable Replicas
PUT /products/_settings
{ "number_of_replicas": 0 }
// 2. Disable Refresh
PUT /products/_settings
{ "refresh_interval": "-1" }
// 3. Perform Bulk indexing
POST /_bulk
...
// 4. Refresh
POST /products/_refresh
// 5. Restore settings
PUT /products/_settings
{
"number_of_replicas": 1,
"refresh_interval": "1s"
}Optimal Bulk Size#
| Item | Recommended |
|---|---|
| Request size | 5-15 MB |
| Document count | 1,000-5,000 |
| Concurrent requests | 2-3 (per node) |
Indexing Threads#
PUT /products/_settings
{
"index": {
"indexing": {
"slowlog": {
"threshold": {
"index": {
"warn": "10s",
"info": "5s"
}
}
}
}
}
}Next Steps#
| Goal | Recommended Document |
|---|---|
| Cluster configuration | Cluster Management |
| Search optimization | Performance Tuning |
| Failure response | High Availability |