Problem: The client reported encountering a request timeout error when querying the PLDT Cassandra database in a production environment. The specific query involved selecting records from the jesi.service_monitoring table, which was attached along with a screenshot for further context. Process: Upon receiving the issue, the support team initiated an investigation. They first inquired about the […]
Database 29 Nov 2024 Rolling Upgrade of ETCD and Patroni Nodes in a Multi-Node PostgreSQL ClusterProblem: The client wanted to perform a rolling upgrade of the underlying operating system from RHEL 7 to RHEL 9 for their ETCD nodes in a Patroni-managed PostgreSQL cluster. The cluster contained three ETCD nodes and three Patroni-managed PostgreSQL instances (one primary and two standby). With a Recovery Point Objective (RPO) and Recovery Time Objective […]
Database 27 Nov 2024 Title: Mitigating Frequent Docker Swarm Re-elections: Adjusting Election Timeout for Improved StabilityProblem: The customer is facing frequent Docker Swarm re-elections, triggered even by brief server issues lasting just a few seconds. They are seeking guidance on how to modify the Swarm election timeout and whether adjusting this value will have any impact on the system. Process: Step 1: Initial Investigation The customer reported frequent leader re-elections […]
Developer Tools 25 Nov 2024 Resolving PostgreSQL Filesystem Bloat and Replication Slot Stuck IssueProblem: The client encountered a significant issue with their PostgreSQL database (PGDB). They reported that the filesystem (FS) utilization suddenly increased from 74% to 94% without any new objects being created. Despite their efforts to recreate the replication slot and restart PGPool, the filesystem remained at 94%. Logs revealed a termination error related to another […]
Database 22 Nov 2024 Managing Out-of-Memory (OOM) Errors and Optimizing Shard Configuration in OpenSearch Production EnvironmentProblem: In the production environment of a multi-node OpenSearch cluster, the nodes frequently crashed due to Out-of-Memory (OOM) errors. Initially, the heap size was increased from 16 GB to 30 GB based on IBM’s recommendations, but the problem persisted. IBM further suggested increasing the number of shards from 16 to 64 to mitigate memory overload. […]
Data Analytics 20 Nov 2024 PostgreSQL: Replication Failure in Patroni ClusterProblem: The client reported a replication issue in their A1 BG Production environment, consisting of a Patroni cluster with two PostgreSQL instances (Leader and Replica). Replication stopped, causing the leader’s /pgcluster file system to fill up with pg_wal files, leading to a full disk. The client requested help to identify the root cause of the […]
Database 18 Nov 2024 Apache Cassandra: Addressing High CPU Utilization After UpgradeProblem: Following an upgrade from Cassandra 4.0.9 to 4.1.3, the client reported a noticeable increase in CPU utilization. The average CPU usage on their systems jumped from around 20% to approximately 37%. This escalation in CPU usage adversely impacted system performance and stability. The issue was notably more severe on servers running Red Hat Enterprise […]
Database 15 Nov 2024 Resolving HBase Region Transition and Hadoop File System Permission Issues in a PROD EnvironmentProblem: The client encountered a critical issue in their production environment involving HBase regions stuck in a transition state. This problem resulted in service disruptions within their Hadoop cluster. The issue was exacerbated by file system permission changes following a cold restart of the cluster, leading to difficulties in accessing data and managing HBase operations. […]
Data Analytics 13 Nov 2024 Resolving PostgreSQL Failover and Transaction File Access IssueProblem: After performing a manual failover in PostgreSQL, the client encountered the following error when running a query on a partitioned table ‘ac1_control’: ERROR: could not access the status of transaction 613182547; DETAIL: Could not open file ‘pg_xact/0248’: No such file or directory. Despite restarting the PostgreSQL instance, the issue persisted. The client was operating […]
Database