High Availability

Learn about Elasticsearch cluster Replica, Snapshot, and failure response strategies.

High Availability Concepts#

HA (High Availability) Goals#

Metric	Description	Target
Availability	Service uptime	99.9% (8.76 hours downtime/year)
Durability	Data loss prevention	99.999999% (9-nines)
Recovery Time	Failure to recovery	< 30 minutes

HA Components#

flowchart TB
    A[High Availability] --> B[Replica Shard]
    A --> C[Snapshot & Restore]
    A --> D[Cross-Cluster Replication]
    A --> E[Cluster Design]

This diagram shows the four core components of high availability (Replica Shard, Snapshot & Restore, CCR, and Cluster Design) and their relationship.

Replica Shard#

Why do we need Replica Shards? What happens when a node containing a Primary Shard goes down due to hardware failure? That shard’s data becomes inaccessible, and the cluster status turns Red. Replica Shards maintain copies of the Primary on different nodes, so when a failure occurs, the Replica is immediately promoted to Primary, continuing operations without service interruption.

Role#

flowchart LR
    subgraph Node1
        P0[Primary 0]
    end
    subgraph Node2
        R0[Replica 0]
    end
    subgraph Node3
        P1[Primary 1]
    end

    P0 -->|Replication| R0
    Client -->|Write| P0
    Client -->|Read| R0

This diagram shows how write requests are sent to the Primary Shard, replicated to the Replica Shard, and read requests are distributed across replicas.

Data Redundancy: Replica is promoted when Primary fails
Read Performance: Search requests distributed

Replica Configuration#

PUT /products
{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1
  }
}

Dynamic Change#

PUT /products/_settings
{
  "number_of_replicas": 2
}

Recommended Settings#

Environment	number_of_replicas
Development	0
Small Production	1
Large/Critical Data	2

Auto-Expand Replicas#

Automatically adjust based on node count:

PUT /products/_settings
{
  "index.auto_expand_replicas": "0-2"  // min 0, max 2
}

Snapshot & Restore#

Why aren’t Replicas enough? Replicas handle node failures, but they are powerless against operator mistakes like accidentally deleting an index or overwriting data with incorrect values. Replicas are deleted/modified in the same way. Snapshots store the data state at a specific point in time in a separate storage, enabling data recovery even from such logical failures.

What is a Snapshot?#

A backup that saves the state of indices at a specific point in time.

Repository Setup#

S3 Repository:

PUT /_snapshot/my_s3_backup
{
  "type": "s3",
  "settings": {
    "bucket": "my-elasticsearch-backups",
    "region": "ap-northeast-2",
    "base_path": "snapshots"
  }
}

File System:

PUT /_snapshot/my_fs_backup
{
  "type": "fs",
  "settings": {
    "location": "/mount/backups",
    "compress": true
  }
}

path.repo setting required in elasticsearch.yml

Create Snapshot#

// Entire cluster
PUT /_snapshot/my_backup/snapshot_2024_01_15
{
  "indices": "*",
  "include_global_state": true
}

// Specific indices only
PUT /_snapshot/my_backup/products_backup
{
  "indices": "products,orders",
  "include_global_state": false
}

Check Snapshot Status#

GET /_snapshot/my_backup/snapshot_2024_01_15/_status

List Snapshots#

GET /_snapshot/my_backup/_all

Restore#

// Full restore
POST /_snapshot/my_backup/snapshot_2024_01_15/_restore

// Specific indices with rename
POST /_snapshot/my_backup/snapshot_2024_01_15/_restore
{
  "indices": "products",
  "rename_pattern": "(.+)",
  "rename_replacement": "restored_$1"
}

SLM (Snapshot Lifecycle Management)#

Automated backup policy:

PUT /_slm/policy/daily_backup
{
  "schedule": "0 30 2 * * ?",     // Daily at 02:30
  "name": "<daily-snap-{now/d}>",
  "repository": "my_backup",
  "config": {
    "indices": "*",
    "include_global_state": true
  },
  "retention": {
    "expire_after": "30d",
    "min_count": 5,
    "max_count": 50
  }
}

Cross-Cluster Replication (CCR)#

Why do we need Cross-Cluster Replication? Replicas and Snapshots address failures within the same datacenter. But what happens when the datacenter itself is paralyzed by a disaster (fire, power outage, network disconnection)? CCR replicates data in real-time to a cluster in a different region, enabling service continuity even in datacenter-level disasters.

Concept#

Replicate data to a remote cluster in real-time.

flowchart LR
    subgraph Leader["Leader Cluster (Seoul)"]
        L[products]
    end
    subgraph Follower["Follower Cluster (Busan)"]
        F[products-replica]
    end

    L -->|Real-time Replication| F

This diagram shows the CCR configuration where data is replicated in real-time from the Leader Cluster in Seoul to the Follower Cluster in Busan.

Use Cases#

Disaster Recovery (DR): Maintain replica in different region
Regional Reads: Reduce latency
Data Centralization: Multiple clusters → Central aggregation

Configuration#

1. Remote Cluster Connection:

PUT /_cluster/settings
{
  "persistent": {
    "cluster": {
      "remote": {
        "leader_cluster": {
          "seeds": ["leader-node:9300"]
        }
      }
    }
  }
}

2. Create Follower Index:

PUT /products-replica/_ccr/follow
{
  "remote_cluster": "leader_cluster",
  "leader_index": "products"
}

Failure Scenarios and Response#

Scenario 1: Single Node Failure#

Situation: 1 Data Node down

Automatic Response:

Replica promoted to Primary (immediate)
New Replica allocated (on another node)
Cluster status: Remains Green (if Replica exists)

Verification:

GET /_cluster/health
GET /_cat/shards?v

Scenario 2: Master Node Failure#

Situation: Master Node down

Automatic Response:

Master election (another Master-eligible node)
New Master manages cluster state

Recommendation: Minimum 3 Master-eligible nodes (maintain quorum)

Scenario 3: Disk Failure#

Situation: Data disk corrupted

Response:

// 1. Exclude the node
PUT /_cluster/settings
{
  "transient": {
    "cluster.routing.allocation.exclude._name": "damaged-node"
  }
}

// 2. Replace disk and restart node

// 3. Remove exclusion
PUT /_cluster/settings
{
  "transient": {
    "cluster.routing.allocation.exclude._name": null
  }
}

Scenario 4: Complete Cluster Failure#

Situation: Datacenter failure

Response:

Activate DR cluster (if using CCR)
Or restore from snapshot

POST /_snapshot/my_backup/latest/_restore
{
  "indices": "*",
  "include_global_state": true
}

Cluster Design Patterns#

Pattern 1: Active-Passive#

flowchart LR
    subgraph Active["Active Cluster"]
        A1[Node 1]
        A2[Node 2]
        A3[Node 3]
    end
    subgraph Passive["Passive Cluster (DR)"]
        P1[Node 1]
        P2[Node 2]
        P3[Node 3]
    end

    Active -->|CCR| Passive
    Client --> Active

This diagram shows the Active-Passive pattern where only the Active cluster handles reads and writes, while CCR replicates data to the Passive cluster for failover.

Read/write on Active
Passive is standby (activated on failure)

Pattern 2: Active-Active#

flowchart TB
    subgraph Seoul["Seoul Cluster"]
        S[products]
    end
    subgraph Busan["Busan Cluster"]
        B[products]
    end

    SeoulClient --> Seoul
    BusanClient --> Busan
    Seoul <-->|Bidirectional CCR| Busan

This diagram shows the Active-Active pattern where each region (Seoul and Busan) independently handles reads and writes, with bidirectional CCR synchronizing data between them.

Read/write in each region
Bidirectional sync (conflict management required)

Pattern 3: Multi-Datacenter#

// Zone Awareness setting
PUT /_cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.awareness.attributes": "zone",
    "cluster.routing.allocation.awareness.force.zone.values": "zone1,zone2"
  }
}

# elasticsearch.yml (per node)
node.attr.zone: zone1  # or zone2

→ Primary and Replica placed in different Zones

Monitoring Alert Setup#

Key Alert Conditions#

Condition	Severity	Action
Cluster status Yellow	Warning	Check nodes
Cluster status Red	Critical	Immediate response
Node down	Critical	Recover node
Disk > 80%	Warning	Free up space
Disk > 90%	Critical	Emergency expansion
JVM Heap > 85%	Warning	Check memory

Watcher Alerts (Basic License+)#

PUT /_watcher/watch/cluster_health_watch
{
  "trigger": {
    "schedule": { "interval": "1m" }
  },
  "input": {
    "http": {
      "request": {
        "host": "localhost",
        "port": 9200,
        "path": "/_cluster/health"
      }
    }
  },
  "condition": {
    "compare": {
      "ctx.payload.status": { "eq": "red" }
    }
  },
  "actions": {
    "send_email": {
      "email": {
        "to": "admin@example.com",
        "subject": "Elasticsearch Cluster RED Status!",
        "body": "Cluster status is RED. Check immediately."
      }
    }
  }
}

Real-World Failure Cases and Lessons#

Case 1: Cluster Paralysis from Disk Full#

Situation:

Log indices grew faster than expected
Disk usage exceeded 95% → All indices switched to read-only
New logs couldn’t be ingested, service monitoring stopped

Response:

# 1. Emergency: Delete old indices
DELETE /logs-2024.01.*

# 2. Remove read-only block
PUT /_all/_settings
{ "index.blocks.read_only_allow_delete": null }

# 3. Prevention: Apply ILM policy

Lessons:

Alert on 80% disk usage is essential
ILM auto-delete policy is required
Plan for 2x capacity headroom

Case 2: Master Node Single Point of Failure#

Situation:

Only 1 Master-eligible node running (cost savings)
Master node failure → Entire cluster down
Adding new nodes didn’t help (quorum not met)

Response:

# elasticsearch.yml - Force master election (dangerous!)
cluster.initial_master_nodes: ["node-1"]

Lessons:

Minimum 3 Master-eligible nodes required
Maintain odd numbers (3 is safer than 2)
Configure discovery.seed_hosts correctly

Case 3: OOM During Bulk Indexing#

Situation:

Bulk indexing 100 million documents during migration
JVM Heap 100% → OOM → Node down
Cascading overload on other nodes

Response:

# 1. Adjust bulk size (5-15MB recommended)
# 2. Disable refresh
PUT /products/_settings
{ "refresh_interval": "-1" }

# 3. Temporarily disable replicas
PUT /products/_settings
{ "number_of_replicas": 0 }

# 4. Restore after indexing complete

Lessons:

Manage bulk size by bytes, not document count
Disable refresh_interval during bulk operations
Consider dedicated indexing nodes

Case 4: Hot Spot from Shard Imbalance#

Situation:

Shards concentrated on specific node
That node at 100% CPU, others at 10%
Search response time increased 10x

Response:

// Rebalance shards
POST /_cluster/reroute
{
  "commands": [{
    "move": {
      "index": "products",
      "shard": 0,
      "from_node": "hot-node",
      "to_node": "cold-node"
    }
  }]
}

Lessons:

Monitor /_cat/allocation regularly
Apply Hot-Warm architecture
Use zone awareness for even distribution

Case 5: Snapshot Restore Failure#

Situation:

Failure occurred → Attempted snapshot restore
Snapshot was corrupted, restore failed
Discovered late because backup verification wasn’t done

Response:

# Automate weekly restore tests
# Verify restore on test cluster
POST /_snapshot/my_backup/weekly_snapshot/_restore?wait_for_completion=true
{
  "indices": "products",
  "rename_pattern": "(.+)",
  "rename_replacement": "test_$1"
}

Lessons:

Backup without restore testing is not backup
Monthly restore drills required
Replicate snapshots to different regions

Checklist#

Daily Check#

Cluster status check (/_cluster/health)
Node status check (/_cat/nodes)
Disk usage check (/_cat/allocation)

Weekly Check#

Verify snapshot creation
JVM memory trend review
Slow query log review

Quarterly Check#

Snapshot restore test
DR failover drill
Capacity planning review

Next Steps#

Goal	Recommended Document
Cluster configuration	Cluster Management
Performance optimization	Performance Tuning
Practical implementation	Product Search System