Problem: A telecommunications customer operating a production Kubernetes cluster deployed via Kubespray encountered an infrastructure challenge. One of their original control-plane nodes (kz-bss-k8om01) had previously failed and was replaced with a new node named kz-bss-k8om04. Later, the client requested to rename this node back to its original FQDN (kz-bss-k8om01) and ideally retain the original IP […]
Developer Tools 19 May 2025 Patch for Jenkins Active Choices Plugin: Fixing Multi-Level Parameter RenderingProblem: After upgrading the Jenkins Active Choices Plugin from version 2.6.1 to 2.8.1, the client’s Jenkins instance began exhibiting critical malfunctions in jobs that utilized multi-level reactive reference parameters. These parameters, implemented via Active Choices Reactive Reference Parameter fields, rely on Groovy scripts to dynamically populate choices based on the values of one or more […]
Developer Tools 7 May 2025 Diagnosing and Resolving ETCD Cluster Sync IssuesProblem: The client encountered an issue with desynchronization of nodes in an ETCD cluster running on RHEL 8.8. Error logs indicated significant disk write delays (slow fdatasync), which caused the Patroni cluster to fail and become unavailable. Process: Step 1 – Initial Analysis The expert asked the client to check the health of the cluster […]
Developer Tools 21 Apr 2025 Stabilizing Docker Swarm Elections: Overcoming Raft Configuration Limitations in Version 1.13.1Problem: The client encountered frequent master (manager) re-elections in their production Docker Swarm cluster, despite having the dispatcher-heartbeat value set to 2 minutes. These re-elections were happening within fractions of a second, causing concerns around Swarm stability and service availability. The client’s Docker environment was based on version 1.13.1 running on RHEL 7.9. Key symptoms […]
Case Studies DevOps Developer Tools 26 Mar 2025 Seamless Jenkins-Keycloak Integration: Overcoming API Authentication ChallengesProblem: The client faced an issue integrating Jenkins with Keycloak for authentication. While the Jenkins UI successfully authenticated users via Keycloak, API calls from backend services were failing. According to Jenkins’ documentation, API requests should be authenticated using an API token, but despite following the recommended steps, the client encountered authentication failures (403 Forbidden & […]
Developer Tools 19 Mar 2025 Docker Swarm Configuration and Container Recovery IssuesProblem: The client experienced issues with Docker Swarm configuration in production. Specifically, when a container restarted, the application failed to recover properly. The client requested a review of the configuration to identify the root cause and potential improvements to enhance the cluster’s functionality. Process: Step 1: Initial Investigation The client provided details of the Docker […]
Developer Tools 21 Feb 2025 Resolving Nexus Image Deletion IssueProblem: The client experienced a problem where one of the images in their Nexus Repository was deleted unexpectedly without any trace. The client needed assistance in answering the following questions: How was the image deleted and is it possible to recover it? How can future abrupt deletions of images be prevented? How can Nexus logging […]
Developer Tools 15 Jan 2025 Resolving Kubelet Certificate Expiration Issues in Rancher ClustersProblem: The client’s production environment includes Rancher installed on two clusters: a Rancher cluster and an application cluster. During the cluster setup, the kubelet certificate was generated with a validity of one year, which recently expired. According to the Rancher RKE documentation, additional configuration is needed to manage certificate validity. The client observed inconsistencies: Some […]
Developer Tools 10 Jan 2025 Mitigating Frequent Docker Swarm Re-Elections by Adjusting Timeout ParametersProblem: The client faced issues with frequent re-elections in a Docker Swarm cluster whenever there were brief server-level disruptions. They sought guidance on modifying the swarm election timeout to stabilize the cluster and prevent unnecessary re-elections. Additionally, they wanted to understand the relationship between election timeout, heartbeat, and dispatcher-heartbeat settings. Process: Step 1: Initial Investigation […]
Developer Tools