Problem:
The client used Istio to manage service communication in a distributed microservices architecture. Centralized services, including Gitlab, Keycloak, Vault, and others, were hosted in an Amazon EKS cluster and accessed via a WireGuard-based VPN mesh (Netbird) from 10 external Kubernetes clusters.
Despite having all services exposed through Istio ingress gateways, external clusters experienced frequent timeouts accessing services. Connectivity worked only if pods were configured with hostNetwork: true
, revealing a potential networking configuration issue.
Process:
Step 1: Initial Investigation
- Network Setup Review: The client shared details about their use of Netbird for VPN connectivity, where nodes from external clusters acted as peers in the mesh network. Traffic from pods was expected to route through the Netbird interface to the central cluster.
- Observation: Pods without
hostNetwork: true
failed to connect, indicating a problem with routing or SNAT in the default pod networking configuration.
Step 2: Identifying the Root Cause
- EKS CNI Behavior: The default configuration for the Amazon VPC CNI plugin in EKS was applying Source Network Address Translation (SNAT) to pod traffic. This SNAT prevented proper routing through the Netbird WireGuard interface.
- Routing Mismatch: Traffic originating from pods lacked proper source IP addresses required for routing through the VPN mesh, leading to connection failures.
Step 3: Solution Implementation
Update EKS VPC CNI Configuration:
cluster_addons = { vpc-cni = { configuration_values = jsonencode({ env = { AWS_VPC_K8S_CNI_EXCLUDE_SNAT_CIDRS = "<netbird-cidr-block>" } }) } }
This ensured that pod traffic destined for the Netbird CIDR was excluded from SNAT, preserving the original source IP for proper routing.
Add iptables Rule for Pod Traffic:
iptables -t nat -I POSTROUTING -s ${pod_network_cidr} -o wt0 -j MASQUERADE
This forced all pod traffic to use the WireGuard interface, leveraging the existing routing setup on the Netbird server in the central cluster.
Step 4: Testing and Validation
- Connectivity Verification: Post-implementation, pods without
hostNetwork: true
successfully connected to the central cluster’s services. Traffic routing through the WireGuard interface was confirmed to work seamlessly. - Performance Check: Access times to centralized services showed significant improvement, and no further timeouts were observed.
Solution:
The connectivity issue was caused by the EKS VPC CNI configuration, which applied SNAT to the traffic from the pods, disrupting the VPN mesh routing. By excluding the Netbird CIDR block from SNAT and forcing the traffic through the WireGuard interface, the solution enabled proper routing and restored connectivity.
Conclusion:
This case study highlights the importance of understanding network routing, especially when using a VPN mesh and custom CNI configurations in Kubernetes clusters. Properly configuring VPC settings and network routing can resolve complex connectivity issues between clusters. Regular monitoring and adjustment of network configurations can help prevent such problems in the future.