Learn how to configure nodes, allocate shards, and monitor status in an Elasticsearch cluster.

Elasticsearch was designed from the ground up as a distributed system. Even when starting with a single node, it operates internally as a cluster. Thanks to this distributed architecture, you can scale horizontally by simply adding nodes as data grows, and the service can continue even when some nodes experience failures.

However, distributed systems introduce different complexities than single servers. You need to understand role distribution among nodes, physical shard placement, master election processes, and behavior during network partitions to operate stably. In particular, incorrect Master node configuration can bring down the entire cluster, and unbalanced shard placement can concentrate load on specific nodes. This document covers core concepts and practical know-how for stable cluster operation.

Cluster Architecture#

Basic Architecture#

flowchart TB
    subgraph Cluster["Elasticsearch Cluster"]
        M[Master Node<br>Cluster State Management]
        D1[Data Node 1<br>Data Storage]
        D2[Data Node 2<br>Data Storage]
        D3[Data Node 3<br>Data Storage]
        C[Coordinating Node<br>Request Routing]
    end

    Client --> C
    C --> D1
    C --> D2
    C --> D3
    M -.State Management.-> D1
    M -.State Management.-> D2
    M -.State Management.-> D3

Node Roles#

Role Types#

RoleConfigurationFunction
masternode.roles: [master]Cluster state management, index creation/deletion
datanode.roles: [data]Document storage, search/aggregation execution
data_contentnode.roles: [data_content]Hot data storage
data_hotnode.roles: [data_hot]Active read/write data
data_warmnode.roles: [data_warm]Read-heavy data
data_coldnode.roles: [data_cold]Inactive data
ingestnode.roles: [ingest]Pre-indexing pipeline processing
mlnode.roles: [ml]Machine learning tasks
coordinatingnode.roles: []Request routing only (empty array)
# Master Node (3 recommended)
node.roles: [master]
node.name: master-1

# Data Node
node.roles: [data, ingest]
node.name: data-1

# Coordinating Node
node.roles: []
node.name: coord-1

Minimum Configuration#

Cluster SizeRecommended Configuration
Development/Test1 node (all roles)
Small3 nodes (master + data combined)
Medium3 master + 3 data
Large3 master + N data + 2 coordinating

Cluster Status#

Status Check#

GET /_cluster/health
{
  "cluster_name": "my-cluster",
  "status": "green",
  "number_of_nodes": 5,
  "number_of_data_nodes": 3,
  "active_primary_shards": 50,
  "active_shards": 100,
  "relocating_shards": 0,
  "initializing_shards": 0,
  "unassigned_shards": 0
}

Status Meanings#

StatusMeaningAction
🟢 greenAll Primary/Replica assignedNormal
🟡 yellowPrimary OK, some Replica unassignedCheck/add nodes
🔴 redSome Primary unassignedImmediate action required

Detailed Diagnosis#

GET /_cluster/health?level=indices
GET /_cluster/health?level=shards
GET /_cluster/allocation/explain

Shard Allocation#

Shard Placement Rules#

  1. Primary and Replica are placed on different nodes
  2. Shards of the same index are evenly distributed
  3. No allocation to nodes with over 80% disk usage

Allocation Settings#

PUT /_cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.enable": "all",
    "cluster.routing.allocation.disk.threshold_enabled": true,
    "cluster.routing.allocation.disk.watermark.low": "85%",
    "cluster.routing.allocation.disk.watermark.high": "90%",
    "cluster.routing.allocation.disk.watermark.flood_stage": "95%"
  }
}

Shard Rebalancing#

// Manual shard move
POST /_cluster/reroute
{
  "commands": [
    {
      "move": {
        "index": "products",
        "shard": 0,
        "from_node": "node-1",
        "to_node": "node-2"
      }
    }
  ]
}

Unassigned Shard Diagnosis#

GET /_cluster/allocation/explain
{
  "index": "products",
  "shard": 0,
  "primary": true
}

Node Management#

Node List#

GET /_cat/nodes?v&h=name,ip,role,master,heap.percent,disk.used_percent
name    ip          role  master heap.percent disk.used_percent
data-1  10.0.0.1    d     -      45           60
data-2  10.0.0.2    d     -      52           55
master-1 10.0.0.3   m     *      30           20

Adding a Node#

  1. Install Elasticsearch on new node
  2. Configure elasticsearch.yml:
    cluster.name: my-cluster
    node.name: data-4
    network.host: 10.0.0.4
    discovery.seed_hosts: ["10.0.0.1", "10.0.0.2", "10.0.0.3"]
  3. Start Elasticsearch → Automatically joins cluster

Safe Node Removal#

// 1. Exclude from shard allocation
PUT /_cluster/settings
{
  "transient": {
    "cluster.routing.allocation.exclude._name": "data-4"
  }
}

// 2. Wait for shard migration to complete
GET /_cat/shards?v&h=index,shard,prirep,node

// 3. Shut down node
// 4. Remove exclusion setting
PUT /_cluster/settings
{
  "transient": {
    "cluster.routing.allocation.exclude._name": null
  }
}

Rolling Restart#

Zero-downtime cluster restart:

Procedure#

// 1. Disable shard reallocation
PUT /_cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.enable": "primaries"
  }
}

// 2. Flush (optional)
POST /_flush/synced

// 3. Restart nodes one by one
// - Stop node
// - Change settings/upgrade
// - Start node
// - Verify cluster status is green before next node

// 4. Re-enable shard reallocation
PUT /_cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.enable": "all"
  }
}

During Version Upgrade#

# 1. Create snapshot (backup)
PUT /_snapshot/my_backup/pre_upgrade

# 2. Perform rolling restart

# 3. Verify upgrade complete
GET /

Monitoring#

Key APIs#

# Cluster health
GET /_cluster/health

# Node statistics
GET /_nodes/stats

# Index statistics
GET /_stats

# Shard status
GET /_cat/shards?v

# Thread pool
GET /_cat/thread_pool?v

# Slow logs
GET /_cat/indices?v&h=index,health,pri,rep,docs.count,store.size

Key Monitoring Metrics#

MetricNormal RangeHow to Check
Cluster Statusgreen/_cluster/health
JVM Heap Usage< 75%/_nodes/stats/jvm
Disk Usage< 80%/_cat/allocation
Search Latency< 100ms/_nodes/stats/indices/search
Indexing Latency< 50ms/_nodes/stats/indices/indexing

Kibana Stack Monitoring#

  1. Kibana → Stack Monitoring menu
  2. View Cluster, Node, Index dashboards
  3. Configure alert rules

Cluster Settings#

Setting Types#

TypePersistenceUse Case
transientReset on restartTemporary adjustments
persistentPermanently storedOperational settings
elasticsearch.ymlFile-basedPer-node settings

View/Change Settings#

// View current settings
GET /_cluster/settings?include_defaults=true

// Change settings
PUT /_cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.enable": "all"
  }
}

Troubleshooting#

Yellow Status#

Cause: Not enough nodes to allocate Replicas

Solution:

  1. Add nodes
  2. Or reduce Replica count:
    PUT /products/_settings
    { "number_of_replicas": 0 }

Red Status#

Cause: Primary shard unassigned

Solution:

// 1. Identify cause
GET /_cluster/allocation/explain

// 2. Force allocation (potential data loss)
POST /_cluster/reroute
{
  "commands": [
    {
      "allocate_stale_primary": {
        "index": "products",
        "shard": 0,
        "node": "data-1",
        "accept_data_loss": true
      }
    }
  ]
}

Disk Full#

// Temporarily adjust watermark
PUT /_cluster/settings
{
  "transient": {
    "cluster.routing.allocation.disk.watermark.flood_stage": "98%"
  }
}

// Delete old indices or add disk

Next Steps#

GoalRecommended Document
Search optimizationPerformance Tuning
Failure responseHigh Availability
Practical implementationProduct Search System