Learn about Elasticsearch cluster Replica, Snapshot, and failure response strategies.
High Availability Concepts#
HA (High Availability) Goals#
| Metric | Description | Target |
|---|---|---|
| Availability | Service uptime | 99.9% (8.76 hours downtime/year) |
| Durability | Data loss prevention | 99.999999% (9-nines) |
| Recovery Time | Failure to recovery | < 30 minutes |
HA Components#
flowchart TB
A[High Availability] --> B[Replica Shard]
A --> C[Snapshot & Restore]
A --> D[Cross-Cluster Replication]
A --> E[Cluster Design]This diagram shows the four core components of high availability (Replica Shard, Snapshot & Restore, CCR, and Cluster Design) and their relationship.
Replica Shard#
Why do we need Replica Shards? What happens when a node containing a Primary Shard goes down due to hardware failure? That shard’s data becomes inaccessible, and the cluster status turns Red. Replica Shards maintain copies of the Primary on different nodes, so when a failure occurs, the Replica is immediately promoted to Primary, continuing operations without service interruption.
Role#
flowchart LR
subgraph Node1
P0[Primary 0]
end
subgraph Node2
R0[Replica 0]
end
subgraph Node3
P1[Primary 1]
end
P0 -->|Replication| R0
Client -->|Write| P0
Client -->|Read| R0This diagram shows how write requests are sent to the Primary Shard, replicated to the Replica Shard, and read requests are distributed across replicas.
- Data Redundancy: Replica is promoted when Primary fails
- Read Performance: Search requests distributed
Replica Configuration#
PUT /products
{
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1
}
}Dynamic Change#
PUT /products/_settings
{
"number_of_replicas": 2
}Recommended Settings#
| Environment | number_of_replicas |
|---|---|
| Development | 0 |
| Small Production | 1 |
| Large/Critical Data | 2 |
Auto-Expand Replicas#
Automatically adjust based on node count:
PUT /products/_settings
{
"index.auto_expand_replicas": "0-2" // min 0, max 2
}Snapshot & Restore#
Why aren’t Replicas enough? Replicas handle node failures, but they are powerless against operator mistakes like accidentally deleting an index or overwriting data with incorrect values. Replicas are deleted/modified in the same way. Snapshots store the data state at a specific point in time in a separate storage, enabling data recovery even from such logical failures.
What is a Snapshot?#
A backup that saves the state of indices at a specific point in time.
Repository Setup#
S3 Repository:
PUT /_snapshot/my_s3_backup
{
"type": "s3",
"settings": {
"bucket": "my-elasticsearch-backups",
"region": "ap-northeast-2",
"base_path": "snapshots"
}
}File System:
PUT /_snapshot/my_fs_backup
{
"type": "fs",
"settings": {
"location": "/mount/backups",
"compress": true
}
}
path.reposetting required inelasticsearch.yml
Create Snapshot#
// Entire cluster
PUT /_snapshot/my_backup/snapshot_2024_01_15
{
"indices": "*",
"include_global_state": true
}
// Specific indices only
PUT /_snapshot/my_backup/products_backup
{
"indices": "products,orders",
"include_global_state": false
}Check Snapshot Status#
GET /_snapshot/my_backup/snapshot_2024_01_15/_statusList Snapshots#
GET /_snapshot/my_backup/_allRestore#
// Full restore
POST /_snapshot/my_backup/snapshot_2024_01_15/_restore
// Specific indices with rename
POST /_snapshot/my_backup/snapshot_2024_01_15/_restore
{
"indices": "products",
"rename_pattern": "(.+)",
"rename_replacement": "restored_$1"
}SLM (Snapshot Lifecycle Management)#
Automated backup policy:
PUT /_slm/policy/daily_backup
{
"schedule": "0 30 2 * * ?", // Daily at 02:30
"name": "<daily-snap-{now/d}>",
"repository": "my_backup",
"config": {
"indices": "*",
"include_global_state": true
},
"retention": {
"expire_after": "30d",
"min_count": 5,
"max_count": 50
}
}Cross-Cluster Replication (CCR)#
Why do we need Cross-Cluster Replication? Replicas and Snapshots address failures within the same datacenter. But what happens when the datacenter itself is paralyzed by a disaster (fire, power outage, network disconnection)? CCR replicates data in real-time to a cluster in a different region, enabling service continuity even in datacenter-level disasters.
Concept#
Replicate data to a remote cluster in real-time.
flowchart LR
subgraph Leader["Leader Cluster (Seoul)"]
L[products]
end
subgraph Follower["Follower Cluster (Busan)"]
F[products-replica]
end
L -->|Real-time Replication| FThis diagram shows the CCR configuration where data is replicated in real-time from the Leader Cluster in Seoul to the Follower Cluster in Busan.
Use Cases#
- Disaster Recovery (DR): Maintain replica in different region
- Regional Reads: Reduce latency
- Data Centralization: Multiple clusters → Central aggregation
Configuration#
1. Remote Cluster Connection:
PUT /_cluster/settings
{
"persistent": {
"cluster": {
"remote": {
"leader_cluster": {
"seeds": ["leader-node:9300"]
}
}
}
}
}2. Create Follower Index:
PUT /products-replica/_ccr/follow
{
"remote_cluster": "leader_cluster",
"leader_index": "products"
}Failure Scenarios and Response#
Scenario 1: Single Node Failure#
Situation: 1 Data Node down
Automatic Response:
- Replica promoted to Primary (immediate)
- New Replica allocated (on another node)
- Cluster status: Remains Green (if Replica exists)
Verification:
GET /_cluster/health
GET /_cat/shards?vScenario 2: Master Node Failure#
Situation: Master Node down
Automatic Response:
- Master election (another Master-eligible node)
- New Master manages cluster state
Recommendation: Minimum 3 Master-eligible nodes (maintain quorum)
Scenario 3: Disk Failure#
Situation: Data disk corrupted
Response:
// 1. Exclude the node
PUT /_cluster/settings
{
"transient": {
"cluster.routing.allocation.exclude._name": "damaged-node"
}
}
// 2. Replace disk and restart node
// 3. Remove exclusion
PUT /_cluster/settings
{
"transient": {
"cluster.routing.allocation.exclude._name": null
}
}Scenario 4: Complete Cluster Failure#
Situation: Datacenter failure
Response:
- Activate DR cluster (if using CCR)
- Or restore from snapshot
POST /_snapshot/my_backup/latest/_restore
{
"indices": "*",
"include_global_state": true
}Cluster Design Patterns#
Pattern 1: Active-Passive#
flowchart LR
subgraph Active["Active Cluster"]
A1[Node 1]
A2[Node 2]
A3[Node 3]
end
subgraph Passive["Passive Cluster (DR)"]
P1[Node 1]
P2[Node 2]
P3[Node 3]
end
Active -->|CCR| Passive
Client --> ActiveThis diagram shows the Active-Passive pattern where only the Active cluster handles reads and writes, while CCR replicates data to the Passive cluster for failover.
- Read/write on Active
- Passive is standby (activated on failure)
Pattern 2: Active-Active#
flowchart TB
subgraph Seoul["Seoul Cluster"]
S[products]
end
subgraph Busan["Busan Cluster"]
B[products]
end
SeoulClient --> Seoul
BusanClient --> Busan
Seoul <-->|Bidirectional CCR| BusanThis diagram shows the Active-Active pattern where each region (Seoul and Busan) independently handles reads and writes, with bidirectional CCR synchronizing data between them.
- Read/write in each region
- Bidirectional sync (conflict management required)
Pattern 3: Multi-Datacenter#
// Zone Awareness setting
PUT /_cluster/settings
{
"persistent": {
"cluster.routing.allocation.awareness.attributes": "zone",
"cluster.routing.allocation.awareness.force.zone.values": "zone1,zone2"
}
}# elasticsearch.yml (per node)
node.attr.zone: zone1 # or zone2→ Primary and Replica placed in different Zones
Monitoring Alert Setup#
Key Alert Conditions#
| Condition | Severity | Action |
|---|---|---|
| Cluster status Yellow | Warning | Check nodes |
| Cluster status Red | Critical | Immediate response |
| Node down | Critical | Recover node |
| Disk > 80% | Warning | Free up space |
| Disk > 90% | Critical | Emergency expansion |
| JVM Heap > 85% | Warning | Check memory |
Watcher Alerts (Basic License+)#
PUT /_watcher/watch/cluster_health_watch
{
"trigger": {
"schedule": { "interval": "1m" }
},
"input": {
"http": {
"request": {
"host": "localhost",
"port": 9200,
"path": "/_cluster/health"
}
}
},
"condition": {
"compare": {
"ctx.payload.status": { "eq": "red" }
}
},
"actions": {
"send_email": {
"email": {
"to": "admin@example.com",
"subject": "Elasticsearch Cluster RED Status!",
"body": "Cluster status is RED. Check immediately."
}
}
}
}Real-World Failure Cases and Lessons#
Case 1: Cluster Paralysis from Disk Full#
Situation:
- Log indices grew faster than expected
- Disk usage exceeded 95% → All indices switched to read-only
- New logs couldn’t be ingested, service monitoring stopped
Response:
# 1. Emergency: Delete old indices
DELETE /logs-2024.01.*
# 2. Remove read-only block
PUT /_all/_settings
{ "index.blocks.read_only_allow_delete": null }
# 3. Prevention: Apply ILM policyLessons:
- Alert on 80% disk usage is essential
- ILM auto-delete policy is required
- Plan for 2x capacity headroom
Case 2: Master Node Single Point of Failure#
Situation:
- Only 1 Master-eligible node running (cost savings)
- Master node failure → Entire cluster down
- Adding new nodes didn’t help (quorum not met)
Response:
# elasticsearch.yml - Force master election (dangerous!)
cluster.initial_master_nodes: ["node-1"]Lessons:
- Minimum 3 Master-eligible nodes required
- Maintain odd numbers (3 is safer than 2)
- Configure
discovery.seed_hostscorrectly
Case 3: OOM During Bulk Indexing#
Situation:
- Bulk indexing 100 million documents during migration
- JVM Heap 100% → OOM → Node down
- Cascading overload on other nodes
Response:
# 1. Adjust bulk size (5-15MB recommended)
# 2. Disable refresh
PUT /products/_settings
{ "refresh_interval": "-1" }
# 3. Temporarily disable replicas
PUT /products/_settings
{ "number_of_replicas": 0 }
# 4. Restore after indexing completeLessons:
- Manage bulk size by bytes, not document count
- Disable refresh_interval during bulk operations
- Consider dedicated indexing nodes
Case 4: Hot Spot from Shard Imbalance#
Situation:
- Shards concentrated on specific node
- That node at 100% CPU, others at 10%
- Search response time increased 10x
Response:
// Rebalance shards
POST /_cluster/reroute
{
"commands": [{
"move": {
"index": "products",
"shard": 0,
"from_node": "hot-node",
"to_node": "cold-node"
}
}]
}Lessons:
- Monitor
/_cat/allocationregularly - Apply Hot-Warm architecture
- Use zone awareness for even distribution
Case 5: Snapshot Restore Failure#
Situation:
- Failure occurred → Attempted snapshot restore
- Snapshot was corrupted, restore failed
- Discovered late because backup verification wasn’t done
Response:
# Automate weekly restore tests
# Verify restore on test cluster
POST /_snapshot/my_backup/weekly_snapshot/_restore?wait_for_completion=true
{
"indices": "products",
"rename_pattern": "(.+)",
"rename_replacement": "test_$1"
}Lessons:
- Backup without restore testing is not backup
- Monthly restore drills required
- Replicate snapshots to different regions
Checklist#
Daily Check#
- Cluster status check (
/_cluster/health) - Node status check (
/_cat/nodes) - Disk usage check (
/_cat/allocation)
Weekly Check#
- Verify snapshot creation
- JVM memory trend review
- Slow query log review
Quarterly Check#
- Snapshot restore test
- DR failover drill
- Capacity planning review
Next Steps#
| Goal | Recommended Document |
|---|---|
| Cluster configuration | Cluster Management |
| Performance optimization | Performance Tuning |
| Practical implementation | Product Search System |