Home
Knowledge Base
Case Studies
Data Management and Analytics
Data Analytics
Challenges in Prometheus Monitoring: Retention Setting Malfunction and Database Corruption

Challenges in Prometheus Monitoring: Retention Setting Malfunction and Database Corruption

by the Hossted team

14.04.2024

Problem:

There are two problems with Prometheus monitoring setup: the retention setting isn’t functioning correctly, leading to excessive data storage, and the Prometheus database is frequently getting corrupted, despite varying levels of workload and resource allocation across different environments.

Process:

Case Details:

Environment 1: Approximately 23GB of data is generated in Prometheus across a 3-hour duration under load with 1300 pods. Memory limit: 200GB.
Environment 2: Approximately 900MB of data is generated in Prometheus across a 1-hour duration in another environment (LAB2) with much lesser load and approximately 648 pods. Memory limit: 110GB.

Upon investigating the issue, the following data was requested from the client:

Configuration files
Prometheus logs
values.yaml
Kubernetes cluster metrics during periods of high data generation
Prometheus alerting rules configured for the setup
CPU metrics
Prometheus version
Configuration of Prometheus Alertmanager
Any Prometheus exporters or related components deployed in the environment that might contribute to data generation or database issues

Solution:

The client was advised to:

Check Prometheus CRD to find data retention information and details.
Observe Prometheus to consider upgrading the Prometheus version.
Schedule a call to discuss the details if the provided information did not resolve the issue.

Conclusion:

The case was resolved during the client’s inactivity. Unfortunately, details regarding the specific actions taken to resolve the issues are not provided in the case study.