Problem:
The client implemented a 4-node OpenSearch cluster to ensure high availability for their application. When all four nodes were operational, both indexing and searching worked seamlessly. However, during a high availability test where two nodes were intentionally turned off, the indexing process stalled, and no documents were processed. Indexing resumed only after the two nodes were brought back online. The client used the OpenSearch REST Client Java library to connect to the cluster.
Process:
Upon receiving the client’s request, the expert identified the root causes of the issue and proposed a structured approach to resolve it.
Step 1: Assessing Cluster Configuration and Health
The expert reviewed the configuration files provided by the client to confirm the cluster settings. All four nodes were identified as master-eligible, meaning each could potentially act as the cluster’s master.
The expert clarified that OpenSearch requires a majority (50% + 1) of master-eligible nodes to elect a master and maintain cluster operations. When two of the four nodes were turned off, only 50% of the nodes remained operational, which is insufficient for a quorum. This caused the cluster to lose its master node, leading to the suspension of indexing.
Using the curl -s http://{ANY_HOST}:9200/_cluster/health?pretty
command, the expert demonstrated how the cluster’s status could be monitored. In this case, the status was likely “red,” indicating no master and unassigned shards.
Step 2: Addressing the Master Node Quorum Issue
The expert recommended reconfiguring the cluster to use an odd number of master-eligible nodes. For small clusters, this configuration ensures a reliable quorum even if some nodes go offline.
For the client’s 4-node setup, the expert suggested reducing the number of master-eligible nodes to 3 or adding an additional node to the cluster. Also, the client was advised to verify master election rules and ensure that at least three master-eligible nodes are always operational during high availability testing.
Step 3: Verifying Replica Placement and Shard Assignment
The expert highlighted that replica shard placement can also affect indexing during node outages:
- If an index has two replicas and only two nodes remain operational, the cluster cannot allocate the replicas. While indexing should still work in a “yellow” state, excessive node outages could exacerbate problems.
The expert recommended monitoring the “unassigned_shards” and “initializing_shards” fields in the cluster health output to ensure shards are reassigned or rebuilt properly after node recovery.
Step 4: Simulating Failover with Best Practices
The expert outlined failover testing best practices:
- Always maintain 50% + 1 operational master-eligible nodes.
- For clusters with replicas, ensure simultaneous node outages do not exceed the maximum number of replicas minus one. For example, with two replicas, only one node can go offline at a time.
- Monitor the cluster health status closely and allow time for shard replication and recovery before taking down additional nodes.
Solution:
The expert resolved the client’s issue by reconfiguring the cluster with three master-eligible nodes instead of four. This ensured that the cluster could maintain a quorum even if two nodes were taken offline. Additionally, the client implemented proper replica management and monitored cluster health metrics during failover testing.
Conclusion:
This case underscores the importance of careful cluster configuration and planning for high availability. By following the expert’s recommendations, the client was able to conduct failover testing successfully without encountering indexing failures. For small clusters, maintaining an odd number of master-eligible nodes and adhering to replica management best practices are crucial for ensuring stability and performance during node outages.