Cluster Scaling

This guide walks you through safely scaling an Elasticsearch cluster.

Estimated time: About 30-60 minutes (additional time for node addition and shard rebalancing)

Scope of This Guide
Covers: Adding data nodes, role-based node separation, Hot-Warm-Cold architecture, and post-scaling validation
Does not cover: For query-level performance optimization, see Slow Query Optimization. For memory issues, see Memory Troubleshooting.

TL;DR
Horizontal scaling: Add data nodes and rebalance shards
Role separation: Separate master, data, coordinating, and ingest nodes for stability
Hot-Warm-Cold: Tier nodes by data access frequency for cost optimization
Post-scaling validation: Always verify cluster status, shard distribution, and performance metrics

Before You Begin#

Verify the following prerequisites:

Item	Requirement	How to Verify
Elasticsearch version	Same version on all nodes	`curl -X GET "localhost:9200/_cat/nodes?v&h=name,version"`
Cluster status	green (recommended)	`curl -X GET "localhost:9200/_cluster/health"`
Network	New node can communicate with existing nodes	ping or telnet test
Security settings	Identical security configuration (TLS, auth)	Check elasticsearch.yml

# Comprehensive cluster status check
curl -X GET "localhost:9200/_cluster/health?pretty"

# Per-node status
curl -X GET "localhost:9200/_cat/nodes?v&h=name,role,heap.percent,cpu,disk.avail"

# Shard distribution per index
curl -X GET "localhost:9200/_cat/shards?v&s=index"

Warning
Always record the current state before scaling the cluster. This serves as a baseline for comparison if issues arise.

Symptoms#

You need cluster scaling in the following situations:

Low disk space:

# Check disk usage
curl -X GET "localhost:9200/_cat/allocation?v&h=node,disk.percent,disk.avail"

# disk.percent > 85%: Warning (watermark)
# disk.percent > 90%: Risk of switching to read-only

Increased query latency:

# Check per-node search latency
curl -X GET "localhost:9200/_cat/nodes?v&h=name,search.query_total,search.query_time"

Decreased indexing speed:

# Check indexing performance
curl -X GET "localhost:9200/_cat/nodes?v&h=name,indexing.index_total,indexing.index_time"

Step 1: Analyze Current State#

1.1 Identify Bottlenecks#

# Per-node resource overview
curl -X GET "localhost:9200/_cat/nodes?v&h=name,role,heap.percent,cpu,disk.percent,disk.avail"

# Example output:
# name    role  heap.percent cpu disk.percent disk.avail
# node-1  dim   82           75  88           20gb
# node-2  dim   78           70  85           30gb

1.2 Check Shard Distribution#

# Shard count per node
curl -X GET "localhost:9200/_cat/allocation?v"

# Check for uneven shard placement
curl -X GET "localhost:9200/_cat/shards?v&s=node,index"

Step 2: Horizontal Scaling (Adding Data Nodes)#

2.1 Configure the New Node#

Configure elasticsearch.yml for the new node:

# Basic settings
cluster.name: my-cluster
node.name: node-3

# Network
network.host: 192.168.1.103
discovery.seed_hosts: ["192.168.1.101", "192.168.1.102"]

# Role settings (data node)
node.roles: ["data"]

# Path settings
path.data: /var/data/elasticsearch
path.logs: /var/log/elasticsearch

2.2 Start the Node and Verify#

# Start the new node
systemctl start elasticsearch

# Verify it joined the cluster
curl -X GET "localhost:9200/_cat/nodes?v"

# Check cluster status (verify the node count increased)
curl -X GET "localhost:9200/_cluster/health?pretty"

2.3 Shard Rebalancing#

When a new node joins, Elasticsearch automatically rebalances shards. You can also adjust manually:

# Check rebalancing progress
curl -X GET "localhost:9200/_cat/recovery?v&active_only=true"

# Adjust rebalancing speed (default: 40mb/s)
curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
{
  "persistent": {
    "cluster.routing.allocation.node_concurrent_recoveries": 4,
    "indices.recovery.max_bytes_per_sec": "100mb"
  }
}'

# Move a specific shard of an index to a specific node
curl -X POST "localhost:9200/_cluster/reroute" -H 'Content-Type: application/json' -d'
{
  "commands": [{
    "move": {
      "index": "products",
      "shard": 0,
      "from_node": "node-1",
      "to_node": "node-3"
    }
  }]
}'

Step 3: Role-Based Node Separation#

For large clusters, separate node roles to improve stability.

Role Configuration#

Role	Description	Recommended Specs	Minimum Count
master	Cluster state management	Low CPU/memory, stable disk	3 (odd number)
data	Data storage and search	High CPU/memory/disk	2+
coordinating	Request routing and result merging	High CPU/memory	2+
ingest	Pipeline processing	High CPU	1+

Configuration Examples#

# Dedicated master node
node.roles: ["master"]

# Dedicated data node
node.roles: ["data"]

# Dedicated coordinating node (no roles = coordinating)
node.roles: []

# Dedicated ingest node
node.roles: ["ingest"]

Warning
Always run an odd number of master nodes. A minimum of 3 is recommended to prevent split-brain.

Step 4: Hot-Warm-Cold Architecture#

Tier nodes by data access frequency to optimize costs.

Architecture Overview#

flowchart TB
    subgraph Hot["Hot Nodes<br>&#40;SSD, High Performance&#41;"]
        H1["node-hot-1"]
        H2["node-hot-2"]
    end

    subgraph Warm["Warm Nodes<br>&#40;HDD, Medium Performance&#41;"]
        W1["node-warm-1"]
        W2["node-warm-2"]
    end

    subgraph Cold["Cold Nodes<br>&#40;Large HDD, Low Performance&#41;"]
        C1["node-cold-1"]
    end

    Client["Client Request"] --> Coord["Coordinating Node"]
    Coord --> Hot
    Coord --> Warm
    Coord --> Cold

    Hot -->|"After 7 days"| Warm
    Warm -->|"After 30 days"| Cold

4.1 Node Role Settings#

# Hot node (latest data, frequent reads/writes)
node.roles: ["data_hot"]
node.attr.data: hot

# Warm node (older data, occasional reads)
node.roles: ["data_warm"]
node.attr.data: warm

# Cold node (archived data, rarely accessed)
node.roles: ["data_cold"]
node.attr.data: cold

4.2 Index Allocation Settings#

# Place new indices on hot nodes
curl -X PUT "localhost:9200/logs-2026.03.23" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "index.routing.allocation.require.data": "hot",
    "number_of_shards": 3,
    "number_of_replicas": 1
  }
}'

4.3 Automate with ILM (Index Lifecycle Management)#

# Create an ILM policy
curl -X PUT "localhost:9200/_ilm/policy/logs_policy" -H 'Content-Type: application/json' -d'
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_size": "50gb",
            "max_age": "7d"
          }
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "allocate": {
            "require": { "data": "warm" }
          },
          "forcemerge": { "max_num_segments": 1 },
          "shrink": { "number_of_shards": 1 }
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "allocate": {
            "require": { "data": "cold" }
          }
        }
      },
      "delete": {
        "min_age": "90d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}'

4.4 Apply ILM to Index Templates#

curl -X PUT "localhost:9200/_index_template/logs_template" -H 'Content-Type: application/json' -d'
{
  "index_patterns": ["logs-*"],
  "template": {
    "settings": {
      "index.lifecycle.name": "logs_policy",
      "index.lifecycle.rollover_alias": "logs",
      "index.routing.allocation.require.data": "hot"
    }
  }
}'

Step 5: Post-Scaling Validation#

5.1 Check Cluster Status#

# Cluster health
curl -X GET "localhost:9200/_cluster/health?pretty"

# Expected values:
# "status": "green"
# "number_of_nodes": increased node count
# "unassigned_shards": 0

5.2 Check Shard Distribution#

# Verify even shard distribution across nodes
curl -X GET "localhost:9200/_cat/allocation?v&h=node,shards,disk.percent,disk.avail"

# Example (even distribution):
# node    shards disk.percent disk.avail
# node-1  50     60           80gb
# node-2  48     58           85gb
# node-3  52     55           90gb

5.3 Check Performance Metrics#

# Check search latency
curl -X GET "localhost:9200/_cat/nodes?v&h=name,search.query_total,search.query_time"

# Check indexing performance
curl -X GET "localhost:9200/_cat/nodes?v&h=name,indexing.index_total,indexing.index_time"

# Per-node resource usage
curl -X GET "localhost:9200/_cat/nodes?v&h=name,heap.percent,cpu,disk.percent"

Checklist#

Items to verify during cluster scaling:

Are all nodes on the same ES version? - Version mismatch prevents joining
Is network connectivity working? - Check firewall, ports (9200, 9300)
Are security settings identical? - TLS certificates, authentication config
Is the cluster status green? - Verify no unassigned shards
Is shard distribution even? - Verify no node is overloaded
Is the ILM policy applied? - When using Hot-Warm-Cold

Verifying Success#

Confirm the cluster scaling succeeded using the following methods:

Cluster status: Verify it is green and all nodes are healthy
```
curl -X GET "localhost:9200/_cluster/health?pretty"
```

Shard rebalancing complete: Verify no rebalancing is in progress

curl -X GET "localhost:9200/_cat/recovery?v&active_only=true"
# No output means rebalancing is complete

Disk usage: Verify each node’s disk usage is below 70%

curl -X GET "localhost:9200/_cat/allocation?v&h=node,disk.percent"

Success Criteria
Cluster status is green, all nodes are healthy
Shard rebalancing is complete with even distribution
Disk usage is below 70%
Search/indexing performance has improved or remained stable

Common Errors#

New Node Does Not Join the Cluster#

Cause: cluster.name mismatch or network issues

Solution:

Verify cluster.name is identical in elasticsearch.yml
Verify firewall allows ports 9200 (HTTP) and 9300 (Transport)
Verify discovery.seed_hosts includes the existing node addresses

# Check new node logs
tail -f /var/log/elasticsearch/my-cluster.log | grep "master not discovered"

Shards Are Not Rebalancing#

Cause: Allocation settings are restrictive or disk watermark is exceeded

Solution:

# Check allocation settings
curl -X GET "localhost:9200/_cluster/settings?include_defaults=true&filter_path=**.allocation"

# Enable allocation
curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
{
  "persistent": {
    "cluster.routing.allocation.enable": "all"
  }
}'

# Check disk watermark
curl -X GET "localhost:9200/_cluster/settings?include_defaults=true&filter_path=**.watermark"

“master_not_discovered_exception”#

Cause: Unable to reach the master node

Solution:

Verify the master node is running
Check discovery.seed_hosts and cluster.initial_master_nodes settings
Check network connectivity (port 9300)

# Check master node
curl -X GET "localhost:9200/_cat/master?v"

Mapping Migration - Zero-downtime mapping migration
Index Rebuild - Large-scale index rebuild
Memory Troubleshooting - Handling memory issues after scaling

Before You Begin#

Symptoms#

Step 1: Analyze Current State#

1.1 Identify Bottlenecks#

1.2 Check Shard Distribution#

Step 2: Horizontal Scaling (Adding Data Nodes)#

2.1 Configure the New Node#

2.2 Start the Node and Verify#

2.3 Shard Rebalancing#

Step 3: Role-Based Node Separation#

Role Configuration#

Configuration Examples#

Step 4: Hot-Warm-Cold Architecture#

Architecture Overview#

4.1 Node Role Settings#

4.2 Index Allocation Settings#

4.3 Automate with ILM (Index Lifecycle Management)#

4.4 Apply ILM to Index Templates#

Step 5: Post-Scaling Validation#

5.1 Check Cluster Status#

5.2 Check Shard Distribution#

5.3 Check Performance Metrics#

Checklist#

Verifying Success#

Common Errors#

New Node Does Not Join the Cluster#

Shards Are Not Rebalancing#

“master_not_discovered_exception”#

Related Documents#