Problem:
The client reported encountering a request timeout error when querying the PLDT Cassandra database in a production environment. The specific query involved selecting records from the jesi.service_monitoring table, which was attached along with a screenshot for further context.
Process:
Upon receiving the issue, the support team initiated an investigation. They first inquired about the size of the jesi.service_monitoring table and the overall health of the Cassandra cluster, which consists of five nodes. Additional information was requested, including the version of Cassandra and configuration files.
The client provided that the table contained 5 SSTables and used approximately 2.15 GB of space, and confirmed the use of Cassandra version 3.11.8. Recommendations were formulated based on potential causes, such as data compaction issues, node or network problems, and resource constraints.
Solution:
- Increase the limits for read_request_timeout_in_ms to 1 minute and range_request_timeout_in_ms to 2 minutes. This will allow for longer-running queries to complete without timing out.
- Monitor system performance during query execution to understand system behavior and identify bottlenecks or performance issues.
- Review and optimize Cassandra queries for better performance, possibly involving query rewrites or improved data access patterns.
- Add indexes to non-primary key columns used in filtering to speed up query response times for certain queries.
- Test query performance with a 30-second timeout in a non-production environment to ensure the changes are effective without risking production stability.
- Consider increasing the timeout to 5 minutes, and then to 10 minutes as a last resort if the 1-minute timeout is still insufficient, but only after careful monitoring and testing.
Conclusion:
The proposed solution addresses both immediate and underlying performance issues in the Cassandra cluster. Increasing the timeout setting serves as a short-term fix, while the suggested maintenance and monitoring tasks will provide long-term stability and performance improvements. Ensuring the database is well-configured and balanced will reduce the likelihood of future timeouts, creating a more reliable environment for the client.