Problem The client faced a critical issue with their Grafana setup: Grafana alerts were failing to trigger when configured thresholds were breached, and the “TEST ALERT” feature consistently resulted in a “NO DATA” message. Process Step 1: Initial Investigation To address this issue, a multi-step approach was taken. In the initial investigation, a meeting was […]
Case Studies 15 Nov 2023 Resolving Memory Consumption Issues in PostgreSQL ClusterProblem: The client reported high memory consumption on both leader (Node 1) and replica (Node 2) nodes in PostgreSQL version 13. Memory utilization on both nodes was observed to be significantly elevated. On Node 1, high memory usage was associated with PostgreSQL processes such as checkpoint and background writer operations, while Node 2 was undergoing […]
Database 13 Nov 2023 Resolving Cassandra CorruptSSTableException IssueProblem: The client reported that Cassandra 2.2.5, running in a single-node configuration, crashed and failed to start. The error logs pointed to a “CorruptSSTableException” in the “system_traces.events” table, indicating corruption in an SSTable file. Since this was a critical system table, the client needed a way to bypass the corrupted data and restart the Cassandra […]
Database 11 Nov 2023 Resolution of Cassandra Cluster Pending Tasks IssueProblem: The client reported a critical increase in pending tasks on one of the nodes within their Cassandra cluster. This issue was causing concern, and the client sought assistance in understanding the root cause and implementing a resolution. Process: The client initially executed the nodetool compaction-stats -H command on the affected node and restarted it, […]
Database 8 Nov 2023 Title: Stabilizing Applications Amidst DNS Challenges: Tackling Intermittent Connection Issues in KubernetesTitle: Stabilizing Applications Amidst DNS Challenges: Tackling Intermittent Connection Issues in Kubernetes Problem: The customer is facing intermittent connection issues in their application, resulting in “connection refused” errors in the logs. These errors are linked to DNS resolution failures and connection timeouts. As a temporary fix, the customer is restarting the CoreDNS pod every hour, […]
Developer Tools 6 Nov 2023 Resolving Cassandra Node Crashes Due to Heavy Server Activity and Interface InstabilityProblem: The client reported recurring crashes of a Cassandra node with errors related to “too many open files”. Despite increasing the maximum open files limit, the issue persisted. The problem was observed primarily during high server load, with regular crashes around 01:15 AM. The client suspected that network instability or heavy operations, such as running […]
Database 1 Jan 2023 Pgpool switchover due to idle session switching offProblem: The client has master and standby configuration and connections are managed by pgpool (version 4.1.5) in the PostgreSQL (version 11.5) production environment. Recently the client faced a switchover/failover when an “idle_in_transaction” session was killed using “pg_terminate_backend” after connecting via pgpoolvip. Process: Step 1 – Initial investigation: The expert team gathered the information and started […]
Database