Problem:
Certain Elasticsearch queries timed out after 30 seconds.
Details:
The customer used Elasticsearch (version 7.17.0 or slightly newer) to query documents created by the Actimize application.
The Elastic index contained approximately 80 million documents, amounting to several terabytes.
Typically, queries were executed within a few seconds, but some queries consistently took 30 seconds or more to complete and subsequently timed out.
The issue was reproducible with specific query inputs, such as searching for a particular person’s name.
Even with Elastic profiling enabled, the root cause of the delay was unclear. Profiling indicated that the “HighlightPhase” was consuming most of the time.
Due to data sensitivity, only redacted queries and results were available for analysis.
The documents in the index were mostly small (a few hundred characters), except for a field named “attachment,” which could be up to 10MB, containing text extracted from various documents like PDFs, Word, and Excel files.
Disabling highlighting for the “attachment” field did not resolve the slowness.
Process:
Initial Data Gathering:
Requested outputs for _cat/nodes?v, _cat/indices?v, and index mappings for the relevant indices.
Gathered details about the size of the returned documents for problematic queries.
Expert Analysis:
Determined that large document sizes could be the cause of the slow queries, due to the time required to load, decompress, and transmit large JSON responses.
Recommended keeping document sizes below 50KB in Elasticsearch/OpenSearch and storing larger attachments separately (e.g., in S3 or HDFS) while keeping a reference pointer in the index.
Further Client Feedback:
The total size of the JSON response for the slow query was 1.8MB.
A large portion of the response size was attributed to profiling information.
The 19 hits returned by the query accounted for about 983,664 bytes, with individual document sizes ranging from 5KB to 220KB.
Consolidated Expert Advice:
Confirmed that the issue was likely due to the size of the responses.
Suggested verifying response size by executing the query in CLI and checking content-length.
Solution:
The following recommendations were provided:
Document Size Management: Keep documents below 50KB and store larger attachments externally.
Query Optimization: Exclude large fields like “attachment” from being returned and highlighted.
Conclusion:
The primary cause of the query timeouts was identified as the large size of the documents being returned by Elasticsearch. The expert team recommended optimizing document sizes and excluding large fields from queries to improve performance. Implementing these recommendations was necessary to resolve the timeout issues and enhance the overall performance of Elasticsearch queries.