Problem The client encountered an issue with desynchronization of nodes in an ETCD cluster running on RHEL 8.8. Error logs indicated significant disk write delays (slow fdatasync), which caused the Patroni cluster to fail and become unavailable. Process Step 1 – Initial Analysis The expert asked the client to check the health of the cluster […]
Developer Tools 21 Apr 2025 Stabilizing Docker Swarm Elections: Overcoming Raft Configuration Limitations in Version 1.13.1Problem The client encountered frequent master (manager) re-elections in their production Docker Swarm cluster, despite having the dispatcher-heartbeat value set to 2 minutes. These re-elections were happening within fractions of a second, causing concerns around Swarm stability and service availability. The client’s Docker environment was based on version 1.13.1 running on RHEL 7.9. Key symptoms […]
Case Studies DevOps Developer Tools 26 Mar 2025 Seamless Jenkins-Keycloak Integration: Overcoming API Authentication ChallengesProblem The client faced an issue integrating Jenkins with Keycloak for authentication. While the Jenkins UI successfully authenticated users via Keycloak, API calls from backend services were failing. According to Jenkins’ documentation, API requests should be authenticated using an API token, but despite following the recommended steps, the client encountered authentication failures (403 Forbidden & […]
Developer Tools 19 Mar 2025 Docker Swarm Configuration and Container Recovery IssuesProblem: The client experienced issues with Docker Swarm configuration in production. Specifically, when a container restarted, the application failed to recover properly. The client requested a review of the configuration to identify the root cause and potential improvements to enhance the cluster’s functionality. Process: Step 1: Initial Investigation The client provided details of the Docker […]
Developer Tools 21 Feb 2025 Resolving Nexus Image Deletion IssueProblem: The client experienced a problem where one of the images in their Nexus Repository was deleted unexpectedly without any trace. The client needed assistance in answering the following questions: How was the image deleted and is it possible to recover it? How can future abrupt deletions of images be prevented? How can Nexus logging […]
Developer Tools 15 Jan 2025 Resolving Kubelet Certificate Expiration Issues in Rancher ClustersProblem: The client’s production environment includes Rancher installed on two clusters: a Rancher cluster and an application cluster. During the cluster setup, the kubelet certificate was generated with a validity of one year, which recently expired. According to the Rancher RKE documentation, additional configuration is needed to manage certificate validity. The client observed inconsistencies: Some […]
Developer Tools 10 Jan 2025 Mitigating Frequent Docker Swarm Re-Elections by Adjusting Timeout ParametersProblem: The client faced issues with frequent re-elections in a Docker Swarm cluster whenever there were brief server-level disruptions. They sought guidance on modifying the swarm election timeout to stabilize the cluster and prevent unnecessary re-elections. Additionally, they wanted to understand the relationship between election timeout, heartbeat, and dispatcher-heartbeat settings. Process: Step 1: Initial Investigation […]
Developer Tools 4 Dec 2024 Kubernetes Upgrade and Node Restoration for Customer’s Onsite EnvironmentProblem: The client reported two main issues: One of the Kubernetes master nodes was in a “not ready” state. They needed to upgrade their Kubernetes version from 1.26 to 1.29. The client requested support to address these concerns. The client had already shut down the master node and was awaiting further instructions for troubleshooting. Process: […]
Developer Tools 27 Nov 2024 Mitigating Frequent Docker Swarm Re-elections: Adjusting Election Timeout for Improved StabilityProblem: The customer is facing frequent Docker Swarm re-elections, triggered even by brief server issues lasting just a few seconds. They are seeking guidance on how to modify the Swarm election timeout and whether adjusting this value will have any impact on the system. Process: Step 1: Initial Investigation The customer reported frequent leader re-elections […]
Developer Tools