Problem:

There are two problems with Prometheus monitoring setup: the retention setting isn’t functioning correctly, leading to excessive data storage, and the Prometheus database is frequently getting corrupted, despite varying levels of workload and resource allocation across different environments.

Process:

Case Details:

  • Environment 1: Approximately 23GB of data is generated in Prometheus across a 3-hour duration under load with 1300 pods. Memory limit: 200GB.
  • Environment 2: Approximately 900MB of data is generated in Prometheus across a 1-hour duration in another environment (LAB2) with much lesser load and approximately 648 pods. Memory limit: 110GB.
    • Upon investigating the issue, the following data was requested from the client:

      • Configuration files
      • Prometheus logs
      • values.yaml
      • Kubernetes cluster metrics during periods of high data generation
      • Prometheus alerting rules configured for the setup
      • CPU metrics
      • Prometheus version
      • Configuration of Prometheus Alertmanager
      • Any Prometheus exporters or related components deployed in the environment that might contribute to data generation or database issues
      • Solution:

        The client was advised to:

        1. Check Prometheus CRD to find data retention information and details.
        2. Observe Prometheus to consider upgrading the Prometheus version.
        3. Schedule a call to discuss the details if the provided information did not resolve the issue.

        Conclusion:

        The case was resolved during the client’s inactivity. Unfortunately, details regarding the specific actions taken to resolve the issues are not provided in the case study.