High Availability

Learn about Elasticsearch cluster Replica, Snapshot, and failure response strategies.

High Availability Concepts#

HA (High Availability) Goals#

Metric	Description	Target
Availability	Service uptime	99.9% (8.76 hours downtime/year)
Durability	Data loss prevention	99.999999% (9-nines)
Recovery Time	Failure to recovery	< 30 minutes

HA Components#

flowchart TB
    A[High Availability] --> B[Replica Shard]
    A --> C[Snapshot & Restore]
    A --> D[Cross-Cluster Replication]
    A --> E[Cluster Design]

Replica Shard#

Role#

flowchart LR
    subgraph Node1
        P0[Primary 0]
    end
    subgraph Node2
        R0[Replica 0]
    end
    subgraph Node3
        P1[Primary 1]
    end

    P0 -->|Replication| R0
    Client -->|Write| P0
    Client -->|Read| R0

Data Redundancy: Replica is promoted when Primary fails
Read Performance: Search requests distributed

Replica Configuration#

PUT /products
{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1
  }
}

Dynamic Change#

PUT /products/_settings
{
  "number_of_replicas": 2
}

Recommended Settings#

Environment	number_of_replicas
Development	0
Small Production	1
Large/Critical Data	2

Auto-Expand Replicas#

Automatically adjust based on node count:

PUT /products/_settings
{
  "index.auto_expand_replicas": "0-2"  // min 0, max 2
}

Snapshot & Restore#

What is a Snapshot?#

A backup that saves the state of indices at a specific point in time.

Repository Setup#

S3 Repository:

PUT /_snapshot/my_s3_backup
{
  "type": "s3",
  "settings": {
    "bucket": "my-elasticsearch-backups",
    "region": "ap-northeast-2",
    "base_path": "snapshots"
  }
}

File System:

PUT /_snapshot/my_fs_backup
{
  "type": "fs",
  "settings": {
    "location": "/mount/backups",
    "compress": true
  }
}

path.repo setting required in elasticsearch.yml

Create Snapshot#

// Entire cluster
PUT /_snapshot/my_backup/snapshot_2024_01_15
{
  "indices": "*",
  "include_global_state": true
}

// Specific indices only
PUT /_snapshot/my_backup/products_backup
{
  "indices": "products,orders",
  "include_global_state": false
}

Check Snapshot Status#

GET /_snapshot/my_backup/snapshot_2024_01_15/_status

List Snapshots#

GET /_snapshot/my_backup/_all

Restore#

// Full restore
POST /_snapshot/my_backup/snapshot_2024_01_15/_restore

// Specific indices with rename
POST /_snapshot/my_backup/snapshot_2024_01_15/_restore
{
  "indices": "products",
  "rename_pattern": "(.+)",
  "rename_replacement": "restored_$1"
}

SLM (Snapshot Lifecycle Management)#

Automated backup policy:

PUT /_slm/policy/daily_backup
{
  "schedule": "0 30 2 * * ?",     // Daily at 02:30
  "name": "<daily-snap-{now/d}>",
  "repository": "my_backup",
  "config": {
    "indices": "*",
    "include_global_state": true
  },
  "retention": {
    "expire_after": "30d",
    "min_count": 5,
    "max_count": 50
  }
}

Cross-Cluster Replication (CCR)#

Concept#

Replicate data to a remote cluster in real-time.

flowchart LR
    subgraph Leader["Leader Cluster (Seoul)"]
        L[products]
    end
    subgraph Follower["Follower Cluster (Busan)"]
        F[products-replica]
    end

    L -->|Real-time Replication| F

Use Cases#

Disaster Recovery (DR): Maintain replica in different region
Regional Reads: Reduce latency
Data Centralization: Multiple clusters → Central aggregation

Configuration#

1. Remote Cluster Connection:

PUT /_cluster/settings
{
  "persistent": {
    "cluster": {
      "remote": {
        "leader_cluster": {
          "seeds": ["leader-node:9300"]
        }
      }
    }
  }
}

2. Create Follower Index:

PUT /products-replica/_ccr/follow
{
  "remote_cluster": "leader_cluster",
  "leader_index": "products"
}

Failure Scenarios and Response#

Scenario 1: Single Node Failure#

Situation: 1 Data Node down

Automatic Response:

Replica promoted to Primary (immediate)
New Replica allocated (on another node)
Cluster status: Remains Green (if Replica exists)

Verification:

GET /_cluster/health
GET /_cat/shards?v

Scenario 2: Master Node Failure#

Situation: Master Node down

Automatic Response:

Master election (another Master-eligible node)
New Master manages cluster state

Recommendation: Minimum 3 Master-eligible nodes (maintain quorum)

Scenario 3: Disk Failure#

Situation: Data disk corrupted

Response:

// 1. Exclude the node
PUT /_cluster/settings
{
  "transient": {
    "cluster.routing.allocation.exclude._name": "damaged-node"
  }
}

// 2. Replace disk and restart node

// 3. Remove exclusion
PUT /_cluster/settings
{
  "transient": {
    "cluster.routing.allocation.exclude._name": null
  }
}

Scenario 4: Complete Cluster Failure#

Situation: Datacenter failure

Response:

Activate DR cluster (if using CCR)
Or restore from snapshot

POST /_snapshot/my_backup/latest/_restore
{
  "indices": "*",
  "include_global_state": true
}

Cluster Design Patterns#

Pattern 1: Active-Passive#

flowchart LR
    subgraph Active["Active Cluster"]
        A1[Node 1]
        A2[Node 2]
        A3[Node 3]
    end
    subgraph Passive["Passive Cluster (DR)"]
        P1[Node 1]
        P2[Node 2]
        P3[Node 3]
    end

    Active -->|CCR| Passive
    Client --> Active

Read/write on Active
Passive is standby (activated on failure)

Pattern 2: Active-Active#

flowchart TB
    subgraph Seoul["Seoul Cluster"]
        S[products]
    end
    subgraph Busan["Busan Cluster"]
        B[products]
    end

    SeoulClient --> Seoul
    BusanClient --> Busan
    Seoul <-->|Bidirectional CCR| Busan

Read/write in each region
Bidirectional sync (conflict management required)

Pattern 3: Multi-Datacenter#

// Zone Awareness setting
PUT /_cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.awareness.attributes": "zone",
    "cluster.routing.allocation.awareness.force.zone.values": "zone1,zone2"
  }
}

# elasticsearch.yml (per node)
node.attr.zone: zone1  # or zone2

→ Primary and Replica placed in different Zones

Monitoring Alert Setup#

Key Alert Conditions#

Condition	Severity	Action
Cluster status Yellow	Warning	Check nodes
Cluster status Red	Critical	Immediate response
Node down	Critical	Recover node
Disk > 80%	Warning	Free up space
Disk > 90%	Critical	Emergency expansion
JVM Heap > 85%	Warning	Check memory

Watcher Alerts (Basic License+)#

PUT /_watcher/watch/cluster_health_watch
{
  "trigger": {
    "schedule": { "interval": "1m" }
  },
  "input": {
    "http": {
      "request": {
        "host": "localhost",
        "port": 9200,
        "path": "/_cluster/health"
      }
    }
  },
  "condition": {
    "compare": {
      "ctx.payload.status": { "eq": "red" }
    }
  },
  "actions": {
    "send_email": {
      "email": {
        "to": "admin@example.com",
        "subject": "Elasticsearch Cluster RED Status!",
        "body": "Cluster status is RED. Check immediately."
      }
    }
  }
}

Real-World Failure Cases and Lessons#

Case 1: Cluster Paralysis from Disk Full#

Situation:

Log indices grew faster than expected
Disk usage exceeded 95% → All indices switched to read-only
New logs couldn’t be ingested, service monitoring stopped

Response:

# 1. Emergency: Delete old indices
DELETE /logs-2024.01.*

# 2. Remove read-only block
PUT /_all/_settings
{ "index.blocks.read_only_allow_delete": null }

# 3. Prevention: Apply ILM policy

Lessons:

Alert on 80% disk usage is essential
ILM auto-delete policy is required
Plan for 2x capacity headroom

Case 2: Master Node Single Point of Failure#

Situation:

Only 1 Master-eligible node running (cost savings)
Master node failure → Entire cluster down
Adding new nodes didn’t help (quorum not met)

Response:

# elasticsearch.yml - Force master election (dangerous!)
cluster.initial_master_nodes: ["node-1"]

Lessons:

Minimum 3 Master-eligible nodes required
Maintain odd numbers (3 is safer than 2)
Configure discovery.seed_hosts correctly

Case 3: OOM During Bulk Indexing#

Situation:

Bulk indexing 100 million documents during migration
JVM Heap 100% → OOM → Node down
Cascading overload on other nodes

Response:

# 1. Adjust bulk size (5-15MB recommended)
# 2. Disable refresh
PUT /products/_settings
{ "refresh_interval": "-1" }

# 3. Temporarily disable replicas
PUT /products/_settings
{ "number_of_replicas": 0 }

# 4. Restore after indexing complete

Lessons:

Manage bulk size by bytes, not document count
Disable refresh_interval during bulk operations
Consider dedicated indexing nodes

Case 4: Hot Spot from Shard Imbalance#

Situation:

Shards concentrated on specific node
That node at 100% CPU, others at 10%
Search response time increased 10x

Response:

// Rebalance shards
POST /_cluster/reroute
{
  "commands": [{
    "move": {
      "index": "products",
      "shard": 0,
      "from_node": "hot-node",
      "to_node": "cold-node"
    }
  }]
}

Lessons:

Monitor /_cat/allocation regularly
Apply Hot-Warm architecture
Use zone awareness for even distribution

Case 5: Snapshot Restore Failure#

Situation:

Failure occurred → Attempted snapshot restore
Snapshot was corrupted, restore failed
Discovered late because backup verification wasn’t done

Response:

# Automate weekly restore tests
# Verify restore on test cluster
POST /_snapshot/my_backup/weekly_snapshot/_restore?wait_for_completion=true
{
  "indices": "products",
  "rename_pattern": "(.+)",
  "rename_replacement": "test_$1"
}

Lessons:

Backup without restore testing is not backup
Monthly restore drills required
Replicate snapshots to different regions

Checklist#

Daily Check#

Cluster status check (/_cluster/health)
Node status check (/_cat/nodes)
Disk usage check (/_cat/allocation)

Weekly Check#

Verify snapshot creation
JVM memory trend review
Slow query log review

Quarterly Check#

Snapshot restore test
DR failover drill
Capacity planning review

Next Steps#

Goal	Recommended Document
Cluster configuration	Cluster Management
Performance optimization	Performance Tuning
Practical implementation	Product Search System