Problem:

The client faced significant delays in executing Elasticsearch queries within their production environment. A particular query, which involved a simple numeric account identifier, took an alarming 68 seconds to execute, despite returning only six hits. The total size of the query output was 583KB, yet the Elasticsearch profiler indicated that 67 seconds of this time was spent in the “HighlightPhase.”

One of the documents returned by the query, identified as Alerts.CC0127390193, was notably large, with its full JSON data totaling 191MB. Despite this, the query did not involve highlighting or searching this field. The client was operating Elasticsearch version 7.17 on a Linux environment.

The client also provided various logs, outputs, and configurations, revealing no meaningful differences in node configurations and no active monitoring of the cluster.

Process:

Upon reviewing the issue, the expert team initiated a series of investigative steps and recommendations to identify the root cause and potential solutions:

Initial Query Analysis:

The expert team first requested the client to run the same query without highlighting to compare execution times. This step was crucial to determine whether highlighting was indeed the primary bottleneck.

Evaluation of Environment and Configuration:

The expert team reviewed the client’s Elasticsearch setup, including the node configurations and document mappings. They identified that the cluster consisted of five nodes, each with varying CPU allocations, which was suboptimal. The cluster also utilized a max heap size of 32GB per node, exceeding the recommended 30GB limit for Elasticsearch to maintain compressed ordinary object pointers (OOP).

Engagement with Client’s Architecture:

The expert team held meetings with the client’s architects to discuss potential optimization strategies. During these discussions, the team emphasized the importance of monitoring tools such as APM or profiling to collect data on disk usage, network traffic, and other relevant metrics.

Hypotheses Testing:

The client was asked to test the query without the highlighting phase and report the results. Additionally, the expert team suggested increasing the CPU and RAM allocations on the Elasticsearch nodes, though they recognized that such changes might not be immediate due to the production environment’s constraints.

Client Feedback and Follow-up:

The client provided feedback on two specific points:

  • Highlighting Disabled: When the highlighting was disabled, the query execution time dropped significantly to just 1-2 seconds, depending on the node from which it was run.
  • Document Retrieval: Fetching the entire contents of the large document (Alerts.CC0127390193) took only 3 seconds, indicating that data retrieval and network transmission were not the bottlenecks.

Deep Dive into Shard Configuration:

The team discovered that the client’s Elasticsearch index had been configured with 75 primary shards for a 350GB index. The expert team questioned this configuration and recommended reducing the number of shards to optimize performance.

Examination of Document Size and Highlighting:

The large attachment field, which was around 200MB, was identified as a critical factor in the slow query performance. The team suggested chunking this field into smaller pieces or considering other storage solutions, although they recognized that this would involve substantial changes to the application.

Solution:

After thorough analysis and testing, the expert team provided the following solutions:

  • Heap Size Adjustment: The team recommended reducing the heap size to 30GB to ensure proper memory management and avoid potential performance degradation due to incorrect heap size allocation.
  • Node Scaling and Resource Allocation: The expert team advised doubling the CPU and RAM resources for the Elasticsearch nodes to handle the large documents more efficiently. However, they acknowledged the client’s concerns about the time required to implement these changes in a production environment.
  • Optimizing Shard Configuration: The team suggested reducing the number of primary shards from 75 to a more optimal number to balance the load and improve query performance.
  • Disabling Highlighting: The team confirmed that disabling highlighting significantly reduced query execution times, and they recommended continuing this practice for queries that did not require highlighting.
  • Document Chunking: For long-term optimization, the team recommended chunking large documents into smaller pieces, although this solution was recognized as potentially complex and requiring significant application modifications.
  • Implementation of Monitoring Tools: The expert team emphasized the importance of implementing monitoring tools such as APM to gain insights into disk usage, network traffic, and other performance metrics, which would help in proactively identifying and resolving issues.

Conclusion:

The client’s Elasticsearch query performance issues were primarily attributed to the large size of certain document fields and the inefficient configuration of the Elasticsearch cluster. By disabling highlighting, adjusting the heap size, optimizing the shard configuration, and considering future infrastructure scaling, the client was able to significantly improve query performance. Moving forward, the implementation of monitoring tools and potential application refactoring were recommended to maintain and further enhance system performance.

The expert team’s involvement provided the client with actionable insights and a clear path to resolving their performance challenges, ensuring that their Elasticsearch environment could better handle large datasets and complex queries in the future.