Prerequisites

Before reading this document, understand these concepts first:

Learn Bulk indexing, Refresh, and Index Lifecycle Management for efficiently storing large volumes of data.

Elasticsearch indexing goes beyond simply storing data. When a document is indexed, it goes through a complex process including text analysis, inverted index creation, and segment management. Understanding this process reveals why Bulk indexing is 10x faster than single-document indexing, why documents aren’t immediately searchable after indexing, and why Refresh should be disabled during bulk indexing. The right indexing strategy directly impacts system performance and operational costs.

Indexing Basics#

Indexing Process#

flowchart LR
    A[Document Received] --> B[Analyze]
    B --> C[Create Inverted Index]
    C --> D[Memory Buffer]
    D --> E[Refresh]
    E --> F[Segment]
    F --> G[Flush]
    G --> H[Disk]
StageDescription
AnalyzeSplit text into tokens
Memory BufferTemporary storage in memory
RefreshMake searchable (default 1 second)
SegmentImmutable index piece
FlushPermanent storage to disk

Single vs Bulk Indexing#

Single Document Indexing#

PUT /products/_doc/1
{
  "name": "MacBook Pro",
  "price": 2390000
}

Bulk Indexing#

Process multiple documents at once:

POST /_bulk
{"index": {"_index": "products", "_id": "1"}}
{"name": "MacBook Pro", "price": 2390000}
{"index": {"_index": "products", "_id": "2"}}
{"name": "MacBook Air", "price": 1390000}
{"index": {"_index": "products", "_id": "3"}}
{"name": "iPad", "price": 1499000}

NDJSON format: Each line separated by newline (\n), including the last line

Performance Comparison#

MethodTime for 10K docsNetwork Requests
Single~30 seconds10,000
Bulk (1000 per batch)~3 seconds10
POST /_bulk
// Recommended size: 5-15MB per request
// Recommended doc count: 1,000-5,000

Bulk Indexing in Spring#

@Service
public class ProductBulkService {

    private final ElasticsearchOperations operations;

    public void bulkIndex(List<Product> products) {
        List<IndexQuery> queries = products.stream()
            .map(product -> new IndexQueryBuilder()
                .withId(product.getId())
                .withObject(product)
                .build())
            .toList();

        operations.bulkIndex(queries, Product.class);
    }
}

Refresh#

What is Refresh?#

Operation that makes Memory Buffer data searchable.

flowchart LR
    A[Memory Buffer] -->|Refresh| B[Segment<br>Searchable]

Refresh Interval#

PUT /products/_settings
{
  "index": {
    "refresh_interval": "30s"    // Default: 1s
  }
}
SettingMeaningUse Case
1sEvery 1 second (default)Real-time search
30sEvery 30 secondsTypical service
-1DisabledDuring bulk indexing

Optimization for Bulk Indexing#

// 1. Disable Refresh
PUT /products/_settings
{ "refresh_interval": "-1" }

// 2. Perform Bulk indexing
POST /_bulk
...

// 3. Manual Refresh
POST /products/_refresh

// 4. Restore Refresh
PUT /products/_settings
{ "refresh_interval": "1s" }

Flush and Translog#

Translog#

Write-Ahead Log to prevent data loss. Plays an important role in Lucene internals. → Lucene Internals Details

flowchart LR
    A[Document] --> B[Translog]
    A --> C[Memory Buffer]
    B -->|Crash Recovery| D[Data Restore]
    C -->|Flush| E[Disk Segment]

Flush#

Persist Memory Buffer + Translog → Disk Segment:

POST /products/_flush

Note: Manual Flush is usually unnecessary. Elasticsearch manages it automatically.

Flush Settings#

PUT /products/_settings
{
  "index": {
    "translog": {
      "durability": "async",        // async: performance, request: stability
      "sync_interval": "5s",
      "flush_threshold_size": "512mb"
    }
  }
}

Index Template#

Settings automatically applied when creating new indices:

PUT /_index_template/products_template
{
  "index_patterns": ["products-*"],
  "priority": 1,
  "template": {
    "settings": {
      "number_of_shards": 3,
      "number_of_replicas": 1,
      "refresh_interval": "5s"
    },
    "mappings": {
      "properties": {
        "name": { "type": "text" },
        "price": { "type": "integer" },
        "created_at": { "type": "date" }
      }
    }
  }
}

Now automatically applied when creating products-2024, products-2025, etc.


Index Lifecycle Management (ILM)#

Automatically manage the lifecycle of time-series data. Especially useful for managing log data. → ILM Practical Example

Lifecycle Phases#

flowchart LR
    A[Hot<br>Active write/read] --> B[Warm<br>Read-heavy]
    B --> C[Cold<br>Occasional reads]
    C --> D[Frozen<br>Rarely read]
    D --> E[Delete<br>Remove]

Creating ILM Policy#

PUT /_ilm/policy/logs_policy
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_size": "50gb",
            "max_age": "7d"
          },
          "set_priority": { "priority": 100 }
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "shrink": { "number_of_shards": 1 },
          "forcemerge": { "max_num_segments": 1 },
          "set_priority": { "priority": 50 }
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "set_priority": { "priority": 0 }
        }
      },
      "delete": {
        "min_age": "90d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

Applying ILM Policy#

PUT /_index_template/logs_template
{
  "index_patterns": ["logs-*"],
  "template": {
    "settings": {
      "index.lifecycle.name": "logs_policy",
      "index.lifecycle.rollover_alias": "logs"
    }
  }
}

Reindex#

Copy/transform existing index to new index:

Basic Reindex#

POST /_reindex
{
  "source": { "index": "products-old" },
  "dest": { "index": "products-new" }
}

Filtered Reindex#

POST /_reindex
{
  "source": {
    "index": "products-old",
    "query": {
      "term": { "in_stock": true }
    }
  },
  "dest": { "index": "products-active" }
}

Field Transformation#

POST /_reindex
{
  "source": { "index": "products-old" },
  "dest": { "index": "products-new" },
  "script": {
    "source": "ctx._source.price_krw = ctx._source.price * 1000"
  }
}

Async Reindex#

POST /_reindex?wait_for_completion=false
{
  "source": { "index": "large-index" },
  "dest": { "index": "large-index-new" }
}

Check progress:

GET /_tasks?actions=*reindex&detailed

Alias#

Give indices alternative names for flexible management:

Create Alias#

POST /_aliases
{
  "actions": [
    { "add": { "index": "products-v1", "alias": "products" } }
  ]
}

Zero Downtime Reindexing#

// 1. Create new index and copy data
PUT /products-v2
POST /_reindex
{
  "source": { "index": "products-v1" },
  "dest": { "index": "products-v2" }
}

// 2. Switch Alias (atomic)
POST /_aliases
{
  "actions": [
    { "remove": { "index": "products-v1", "alias": "products" } },
    { "add": { "index": "products-v2", "alias": "products" } }
  ]
}

Application uses only products alias → Zero-downtime switch


Indexing Performance Optimization#

Bulk Indexing Checklist#

// 1. Disable Replicas
PUT /products/_settings
{ "number_of_replicas": 0 }

// 2. Disable Refresh
PUT /products/_settings
{ "refresh_interval": "-1" }

// 3. Perform Bulk indexing
POST /_bulk
...

// 4. Refresh
POST /products/_refresh

// 5. Restore settings
PUT /products/_settings
{
  "number_of_replicas": 1,
  "refresh_interval": "1s"
}

Optimal Bulk Size#

ItemRecommended
Request size5-15 MB
Document count1,000-5,000
Concurrent requests2-3 (per node)

Indexing Threads#

PUT /products/_settings
{
  "index": {
    "indexing": {
      "slowlog": {
        "threshold": {
          "index": {
            "warn": "10s",
            "info": "5s"
          }
        }
      }
    }
  }
}

Next Steps#

GoalRecommended Document
Cluster configurationCluster Management
Search optimizationPerformance Tuning
Failure responseHigh Availability