Problem:
The client encountered a Cassandra exception: “WriteTimeoutException: Cassandra timeout during SIMPLE write query at consistency QUORUM.” This issue, occurring since March 9, 2023, revolves around an INSERT INTO query. They seek troubleshooting assistance as this exception had not occurred before that date.
Process:
Step 1 – Initial Investigation and Troubleshooting:
The expert team initiated the investigation by requesting technical details from the client regarding the Cassandra Cluster and Exception Report. The client provided the following information:
Cloud Instances:
Compactions:
- prod-0: 0.93, 1.26, 1.36 - prod-1: 0.82, 0.91, 0.91 - prod-2: 1.43, 1.02, 1.17 - prod-dr-0: 0.02, 0.04, 0.05 - prod-dr-1: 0.00, 0.02, 0.05 - prod-dr-2: 0.21, 0.10, 0.07
Metric:
Cluster Nodes:
Keyspace:
Tables:
Replication:
Solution:
After investigating the issue, the expert team concluded that a replication factor of 3 was excessive for the 3-node cluster. They recommended adjusting the keyspace attributes to use 2x replicas for each location.
The issue pertained solely to WRITE timeouts. It was determined that some queries attempted to read data from the service_monitoring and service_monitoring_payload tables in the prod1_amil keyspace in Cassandra. The SMTool loaded data only when users performed searches in the UI, and the process of inserting data into the service_monitoring tables caused the problem. The queries did not utilize primary keys or indexed columns.
As a preventive measure, the application team was instructed to use the tool with filter criteria based on purchase order (PO) and date range.
Conclusion:
After investigating, the experts found that the WriteTimeoutException in Cassandra was caused by excessive replication settings for a 3-node cluster. They recommended adjusting the keyspace to use 2x replicas per location. The issue arose from inefficient queries without primary keys or indexed columns on the service_monitoring and service_monitoring_payload tables in the prod1_amil keyspace. To prevent future incidents, the application team was advised to optimize the SMTool usage with specific filter criteria based on purchase orders (PO) and date ranges.