Problem:

Argo CD, version v2.10.4, deployed in a high-load environment with approximately 480 applications, encountered severe performance issues. Specifically, the refresh operations were inconsistent, sometimes taking up to 16 minutes, impacting deployment efficiency and causing significant CPU consumption by the controller pod. Despite efforts to adjust configuration parameters based on Argo CD documentation, the issue persisted, posing a critical barrier to adoption and operational efficiency.

Process:

Upon receiving reports of slowness and high CPU usage, the support team analyzed the provided logs, configuration files, and Grafana dashboards. It was observed that the single replica deployment of the Argo CD controller was insufficient for handling the workload of 480 applications. The controller consistently utilized a high percentage of available CPU resources, even during low traffic periods, indicating potential inefficiencies in resource management and configuration.

Solution:

Scaling Deployment:
Increased replicas of both the argocd-server and argocd-application-controller components to distribute workload and improve responsiveness.
Modified the Deployment and StatefulSet configurations to set replicas: 3 for argocd-server and replicas: 2 for argocd-application-controller.

Adjusting Configuration Parameters:
Reduced the values for controller.status.processors and controller.operation.processors to 50 and 25 respectively, from their initial values of 100 and 50.
Lowered controller.kubectl.parallelism.limit to 20 to optimize resource utilization and minimize CPU overhead.

Enabling Caching:
Configured caching with expiration times (controller.app.state.cache.expiration and controller.metrics.cache.expiration) set to 60m to reduce repetitive data fetching and further alleviate the load on the controller.

Following the implementation of these adjustments, the performance of Argo CD significantly improved. The refresh operations stabilized, with consistent execution times within acceptable limits. CPU utilization reduced notably, even during peak operational periods, demonstrating the effectiveness of scaling and fine-tuning configuration parameters as recommended by Argo CD’s best practices. This resolution enabled the client to proceed with their deployment plans without further hindrance, ensuring smooth and reliable operations across their environments.

Conclusion:

By systematically addressing the underlying performance bottlenecks and leveraging scalable deployment strategies, the support team successfully optimized Argo CD for high-load scenarios, meeting the client’s operational requirements and enhancing overall system stability and efficiency.