Problem: A telecommunications customer operating a production Kubernetes cluster deployed via Kubespray encountered an infrastructure challenge. One of their original control-plane nodes (kz-bss-k8om01) had previously failed and was replaced with a new node named kz-bss-k8om04. Later, the client requested to rename this node back to its original FQDN (kz-bss-k8om01) and ideally retain the original IP […]
Developer Tools 26 May 2025 Resolving Cassandra Backup Failures Due to Priam IncompatibilityProblem: The client encountered a failure while attempting to run a Cassandra backup using Commvault on their QAT cluster. The backup process failed with a 500 HTTP error originating from the local Priam REST endpoint: HTTP ERROR 500 Problem accessing /REST/v1/cassadmin/info. Reason: Commvault support traced the issue to the Priam service and advised the client […]
Database 23 May 2025 Optimizing Nodetool Cleanup Performance in a Large-Scale Apache Cassandra 4.1.5 Cluster During Node AdditionProblem: The client faced performance challenges while running nodetool cleanup on an Apache Cassandra 4.1.5 cluster during a node addition activity in a production environment. Specifically, the cleanup process was taking an unexpectedly long time on nodes with over 600GiB of data load, raising concerns about the overall timeline and impact on production workflows. The […]
Database 23 May 2025 Optimizing Apache Cassandra Repair: Reducing CPU Utilization from 90% to 30%Problem: The client reported high CPU utilization (up to 90%) across all nodes in their 3-node Apache Cassandra 4.1.3 cluster during full or incremental repair operations initiated from any single node. Despite relatively low data volumes (~25 GB per node), the CPU spike raised concerns about system performance, stability, and potential downtime during repairs. Process: […]
Database 19 May 2025 Patch for Jenkins Active Choices Plugin: Fixing Multi-Level Parameter RenderingProblem: After upgrading the Jenkins Active Choices Plugin from version 2.6.1 to 2.8.1, the client’s Jenkins instance began exhibiting critical malfunctions in jobs that utilized multi-level reactive reference parameters. These parameters, implemented via Active Choices Reactive Reference Parameter fields, rely on Groovy scripts to dynamically populate choices based on the values of one or more […]
Developer Tools 16 May 2025 Cassandra Timeouts Traced to Host OversubscriptionProblem: The client reported a sudden and significant drop in Apache Cassandra performance on a 4-node cluster. The issue appeared without any recent configuration or infrastructure changes. The application started experiencing frequent timeouts, and restarting Cassandra services on all nodes did not resolve the problem. The client provided various monitoring graphs, system logs, and other […]
Database 14 May 2025 Seamless Cassandra Cluster Scaling Without DowntimeProblem: The client needed to scale their production Cassandra cluster from 6 nodes to 12 nodes (3 to 6 nodes per data center) without any downtime. Their existing setup includes Cassandra version 4.1.6, with two data centers (PROD and DR), each containing 3 nodes, forming a 6-node cluster with a replication factor of 3 and […]
Database 9 May 2025 Apache Cassandra: Migration Connectivity Failure During Production DeploymentProblem: The client encountered a critical issue while starting one of their production pods during a Cassandra migration. Although the PostgreSQL migration completed successfully, the Cassandra migration failed with a com.datastax.oss.driver.api.core.AllNodesFailedException, indicating that the driver could not connect to any Cassandra nodes. This blocked the production deployment. Process: Step 1 – Initial Analysis The logs […]
Database 7 May 2025 Diagnosing and Resolving ETCD Cluster Sync IssuesProblem: The client encountered an issue with desynchronization of nodes in an ETCD cluster running on RHEL 8.8. Error logs indicated significant disk write delays (slow fdatasync), which caused the Patroni cluster to fail and become unavailable. Process: Step 1 – Initial Analysis The expert asked the client to check the health of the cluster […]
Developer Tools