Home
Knowledge Base
Case Studies
Data Management and Analytics
Data Analytics
Managing Out-of-Memory (OOM) Errors and Optimizing Shard Configuration in OpenSearch Production Environment

Managing Out-of-Memory (OOM) Errors and Optimizing Shard Configuration in OpenSearch Production Environment

by the Hossted team

22.11.2024

Problem:

In the production environment of a multi-node OpenSearch cluster, the nodes frequently crashed due to Out-of-Memory (OOM) errors. Initially, the heap size was increased from 16 GB to 30 GB based on IBM’s recommendations, but the problem persisted. IBM further suggested increasing the number of shards from 16 to 64 to mitigate memory overload. However, the client sought clarification on whether this change could be applied without the need for reindexing, and whether further shard increases could reduce OOM errors.

Process:

Step 1 – Initial Assessment:

Upon reviewing the logs and system setup, the expert confirmed that increasing the number of shards would require reindexing the existing data. While increasing shards would distribute the workload more efficiently, simply adding more shards wouldn’t resolve the OOM issue without data redistribution. The team recommended using OpenSearch’s _reindex API to achieve this.

Key Findings:

OpenSearch does not automatically distribute data across newly added shards.
Reindexing was necessary to optimize memory and reduce the likelihood of frequent garbage collection calls that were contributing to the OOM issue.

Step 2 – Recommendations and Client Communication:

The expert proposed a strategy to reindex the data during off-peak hours to minimize disruptions. This would leverage the additional 64 shards and improve memory management across the nodes.

However, the client raised further questions regarding shard-to-heap memory ratios, based on a previous IBM recommendation of 20 shards per GB of heap memory. Given the current 30 GB heap, the client inquired whether increasing shards to 600 (as opposed to 64) would better resolve the OOM errors and what the impact on CPU usage and performance would be.

Step 3 – Expert Analysis and Response:

The expert explained that while increasing the number of shards allows better data parallelization, it also increases CPU usage due to the additional data indexing and search operations. A significant increase in shards (e.g., to 600) without sufficient CPU and I/O resources could create bottlenecks or slowdowns.

The expert suggested gradually increasing shards in increments (e.g., 100, 200, 300) to monitor CPU and memory usage, instead of a sudden jump to 600 shards. The client was also advised to monitor inefficient queries or large data payloads that could contribute to OOM errors despite the increase in shards.

Solution:

Reindexing Strategy: The team guided the client through reindexing their data to fully utilize the newly expanded shard count of 64, optimizing memory management.
Shard Scaling: A conservative shard scaling plan was advised, starting with 100 shards and increasing as necessary while monitoring system resource usage.
Performance Monitoring: The team emphasized the importance of monitoring CPU and I/O load to identify potential bottlenecks from increasing shards. The expert also highlighted the importance of analyzing aggregation-heavy queries and disabling unnecessary field data caching to alleviate system strain.

Results:

After implementing the proposed changes, the system exhibited a more stable memory footprint, reducing the frequency of OOM errors. The gradual increase in shards, coupled with optimizations to queries and field data management, helped balance the load on the CPU while improving the system’s overall stability.

Conclusion:

Managing OpenSearch requires balancing memory allocation, shard configuration, and system resource usage. Increasing shards without considering CPU and I/O resources can result in performance bottlenecks. A gradual approach to scaling, along with ongoing performance monitoring and query optimization, is essential for mitigating OOM errors and maintaining long-term stability in the production environment.