Title: Stabilizing Applications Amidst DNS Challenges: Tackling Intermittent Connection Issues in Kubernetes

Problem:

The customer is facing intermittent connection issues in their application, resulting in “connection refused” errors in the logs. These errors are linked to DNS resolution failures and connection timeouts. As a temporary fix, the customer is restarting the CoreDNS pod every hour, which temporarily resolves the issue. The cluster is running on a single node with Flannel CNI, and no service mesh is in use. The customer is looking for help to identify and resolve the root cause of these intermittent failures.

Process:

Step 1: Initial Investigation

The customer experienced intermittent connection issues in their application flows and identified an error related to DNS resolution and connection timeouts in their logs. They took the following steps:

  • Connectivity Checks: Verified the connectivity of remote hosts using telnet, tnsping, and nslookup from within the pods, which appeared to be functioning correctly.
  • CoreDNS Restart: After suspecting a potential DNS issue, they restarted the CoreDNS pods. The issue was temporarily resolved for approximately 60 minutes before recurring.
  • Logging and Monitoring: Checked DNS resolution within pods continuously and monitored logs, but found no immediate issues with DNS resolution.

Step 2: Meeting and Technical Discussions

During a subsequent meeting with support experts, the following actions were taken:

  • Cluster and Network Configuration Review: Discussed the cluster configuration, including the single-node setup and the use of the 10-flannel.conflict CNI.
  • Service Mesh Verification: Verified that no service mesh (e.g., Istio) was being used.
  • CoreDNS Behavior Analysis: Discussed the CoreDNS behavior and potential issues causing the intermittent connectivity problem. The team continued to monitor and restart CoreDNS to maintain stability.

Step 3: Further Troubleshooting

Despite initial troubleshooting, the issue persisted. The following additional steps were taken:

  • Detailed Logs Examination: Investigated logs from the API server, Flannel, and pods to identify any abnormalities.
  • Exploration of Alternative Solutions: Considered alternative solutions, such as checking FQDN caches and examining external mount points for potential issues.
  • Function-Specific Debugging: Identified a specific function within the application that was causing errors and suggested testing it in a local environment.

Solution:

As a temporary workaround to address the intermittent connection issues, the customer set up a cron job to automatically restart the CoreDNS pod every 30 minutes, helping to prevent the issue from recurring. Additionally, a detailed investigation of the problematic function within the application was recommended, with testing suggested in a controlled environment to identify the root cause. Future meetings were planned to discuss progress and explore potential solutions. This approach allowed the customer to maintain application stability while continuing to monitor CoreDNS behavior and further debug the function causing the error.

Conclusion:

In conclusion, the customer faced intermittent connection issues related to DNS resolution failures and connection timeouts in their single-node cluster running Flannel CNI. Despite initial troubleshooting, which included verifying connectivity, restarting CoreDNS, and examining logs, the issue persisted. As a temporary solution, the customer implemented a cron job to restart the CoreDNS pod every 30 minutes, providing short-term stability. Moving forward, a more detailed investigation into the application’s problematic function was recommended, with plans for further technical discussions to identify and resolve the root cause of the intermittent failures.