Problem:

The client reported delays in the readiness of ingress virtual services and difficulty accessing services through DNS names. Despite using Istio for service-to-service communication and centralized services like Keycloak, Gitlab, Vault, and others, the setup was taking too long, especially when resolving DNS names for these services. The delay was primarily due to Crossplane configuration, leading to slow updates of OIDC configurations on Gitlab, Vault, ArgoCD, and EKS clusters.

Process:

Step 1: Initial Investigation

DNS Resolution Check: We began by performing dig and nslookup tests to determine if both internal and external domains were resolving correctly in AWS Route53. This confirmed that the issue was not with AWS DNS setup but likely within the Kubernetes cluster.

Step 2: Analyzing Crossplane and ArgoCD Execution Times

Crossplane Workspace Monitoring: We analyzed the Crossplane workspace execution time in ArgoCD, specifically monitoring the OIDC configuration times. It was observed that these configurations took an average of over 10 minutes to complete.

CoreDNS Investigation: Upon further analysis, it was found that CoreDNS was taking too long to resolve virtual service DNS names, significantly impacting the OIDC configuration times. This was due to CoreDNS trying to resolve services using internal DNS configurations rather than leveraging the Istio ingress gateway.

Step 3: Identifying the Root Cause

The delay in resolving DNS names was traced to how Kubernetes was managing external DNS and CoreDNS, especially for virtual services in Istio. Crossplane, which was responsible for automating the creation of OIDC configurations, relied on CoreDNS to resolve these virtual service hostnames, but CoreDNS resolution times were excessive, creating a bottleneck.

Step 4: Solution Implementation

Configuration Adjustment for External DNS: The immediate solution involved reconfiguring the Kubernetes ExternalDNS service to use Istio gateways instead of relying on virtual service DNS names. This setup allowed DNS resolution to occur faster by pointing directly to Istio’s ingress gateway for public and private DNS domains.

Below are the helm values updated to use Istio Gateway for ExternalDNS:

domainFilters:
  - 
  - 
txtOwnerId: 
policy: sync
dryRun: false
interval: 1m
triggerLoopOnEvent: true
txtPrefix: extdns
sources:
  - service
  - ingress
  - istio-gateway
logLevel: debug
provider: aws
...

Step 5: Testing and Monitoring

After the reconfiguration, we tested the DNS resolution using dig and confirmed that the issue with delayed readiness was resolved. DNS queries were now resolving faster, and the OIDC configurations were being applied without delay.

Solution:

The root cause was the delay in CoreDNS resolving virtual service DNS names within Kubernetes. By configuring ExternalDNS to directly use Istio’s ingress gateway for DNS resolution, the delays were eliminated. This reconfiguration ensured faster DNS lookups and resolved the readiness issues for ingress virtual services, especially for services like Keycloak and Gitlab.

Conclusion:

This case study highlights the importance of optimizing DNS resolution in a microservices ecosystem, especially when using Istio for service communication and Crossplane for infrastructure management. The solution not only resolved the immediate issue but also improved the overall performance of DNS resolution within the Kubernetes clusters. Regular monitoring of DNS configurations and external dependencies like Crossplane is recommended to prevent such delays in the future.