Troubleshooting Kubernetes Cluster Instability - Proactive Insights and Support For Open-Source Applications

Problem:

The kube-api pod was frequently restarting with the error message: apiserver received an error that is not a metav1.Status: rpctypes.EtcdError{code:0xe, desc:”etcdserver: request timed out”}. This issue began following an etcd data migration from a single-node Kubernetes cluster to a multi-node cluster using an unconventional method. Despite multiple attempts to resolve the issue, the problem persisted, severely impacting customer automations, particularly during nighttime hours. The client, a Kubernetes user, encountered significant operational challenges due to these frequent restarts. These restarts disrupted automation processes, causing system crashes and delays, especially during nighttime operations.

Process:

1) Issue Occurrence:

The issue was first noted when customer automation processes failed consistently every night due to the kube-api pod restarts.

2) Impact During Nighttime:

The nightly crashes necessitated prompt investigation, which was complicated by the timing.

3) Initial Suspicions and Mitigations:

Etcd Snapshot Issue: It was suspected that the etcd snapshot process was causing the cluster to hang. To mitigate this, the liveliness settings were increased and the snapshot count was reduced, but these changes did not resolve the issue.
Resource Crunch: There was a suspicion that a resource crunch on the control plane node was causing the problem. The resources requested for the node were increased, but the issue persisted.
Script Processing: Analysis revealed scripts processing large numbers of files, potentially degrading performance.
Scheduled Task: It was suspected that a scheduled task was causing issues at a specific time each day, but investigations did not yield conclusive evidence.

4) Metrics Monitoring:

Proposed the installation of an exporter on the master node to monitor system metrics and identify issues before they caused crashes.
Set up IO monitoring tools like IOPM to obtain statistics and a comprehensive view of system performance.

5) Logs and Data Collection:

Requested and collected extensive logs and system data, including sar logs, vmstat 1, system logs, and system metrics on the control plane node.
Analyzed the collected data to determine patterns and possible causes for the crashes.

6) Analysis of Existing Setup:

Configuration Review: The unconventional etcd migration method and a recent flannel version upgrade were identified as potential contributing factors to the kube-apiserver pod restart issue.
System Update: Found that the node was running Kubernetes version 1.19, which had been unsupported for two years, potentially leading to compatibility and performance issues.

7) Client Constraints:

A replica environment was suggested for further testing without impacting live services, allowing safe exploration of various troubleshooting options.

9) Monitoring and Investigation:

Implemented thorough monitoring to check for delays in input/output operations.
Accessed etcd from the console and made lightweight requests to check for network issues.
Conducted health checks and analyzed logs for API, etcd, and Flannel.

9) Additional Analysis:

Investigated IPs to determine if they were causing errors.
Proposed installing monitoring to identify system issues before they cause crashes.
Analyzed metrics to identify patterns and potential causes of the high I/O and system crashes.
Collected and analyzed container-level and system logs to identify when the issue started and check for missing logs during the issue period.

10) Detailed Diagnostic Analysis:

Memory Usage: Monitoring showed less than 100MB of available memory, causing system degradation and timeouts. This highlighted the need to monitor available memory, not just used memory.
CPU Usage: Although monitoring showed constant CPU usage, further analysis revealed that the actual usage was 90% instead of the reported 40%.
Unsupported Software Version: The current Kubernetes version (1.19) had been unsupported for two years. Recommendations included upgrading Kubernetes, the OS (EOL in 2024), containerd, and migrating workers from dockershim to containerd.

Solution:

The root cause was identified as performance issues related to the initial host’s I/O throughput. The solution involved migrating the VM to a host with better I/O performance.

Resolution Steps:

Error Identification: The error message during kube-api pod restarts pointed to an etcdserver request timeout.
Diagnostic Analysis: Diagnostic analysis confirmed that the problem was due to slow I/O throughput on the initial host.
Host Migration: Moved the VM to a different host known for its significantly improved I/O performance.
Issue Resolution: Following the migration, the kube-api pod no longer experienced restarts, resolving the etcdserver request timeout error.

Future Recommendations:

Regular Updates: It is necessary that all components, including Kubernetes, OS, and container runtimes, are kept up-to-date to avoid compatibility and performance issues.
Enhanced Monitoring: Comprehensive monitoring solutions to detect and address system issues proactively should be implemented.
Resource Allocation: It is important to re-evaluate and adjust resource allocations regularly to meet the demands of the system and prevent resource crunches.

Conclusion:

By relocating the VM to a host with enhanced I/O throughput, the frequent kube-api pod restarts and associated etcdserver request timeout errors were effectively resolved. This action stabilized the kube-api pod, ensuring the reliability and proper functioning of the Kubernetes cluster. The systematic approach of detailed monitoring, log analysis, and strategic migration proved successful in addressing the root cause of the issue.