Problem
The client faced a critical issue with their Grafana setup: Grafana alerts were failing to trigger when configured thresholds were breached, and the “TEST ALERT” feature consistently resulted in a “NO DATA” message.
Process
Step 1: Initial Investigation
To address this issue, a multi-step approach was taken.
In the initial investigation, a meeting was held with the client, and several diagnostic steps were taken, including enabling DEBUG logs and testing alert features. Despite these efforts, the root cause remained elusive, prompting a request for additional files for further analysis.
Step 2: Deeper Investigation
In the subsequent deeper investigation, a follow-up meeting took place. The team explored discrepancies between values in the Query Inspector and actual data and tested pausing and resuming alerts. However, it was determined that the alerting system was not working correctly. Various solutions were
attempted, including modifying query values, offsets, and minimum dock accounts, but none were successful. Consequently, a new solution was
proposed: downgrading Grafana to version 8.5.1 and recreating a similar test environment.
Step 3: Further Investigation
The in-depth investigation phase followed, involving verification of SMTP configuration, consideration of creating a new dashboard for cautious alert testing in a production environment, and examination of metrics retrieval from Grafana through ElasticSearch. Additionally, system settings related to alerting in Grafana were reviewed, different queries and conditions for alerts were tested, and data sources used by alerts in Grafana were explored. Grafana logs were scrutinized for patterns or anomalies related to alerting.
Based on these actions, a recommendation was made to install a new instance of Grafana on the same server to identify if the issue was specific to the current instance or system-wide.
Step 4: Preparation For Additional Testing
For the preparation of additional testing, the client received detailed
instructions for downloading and installing a second instance of Grafana in a test environment. The steps included downloading the same Grafana version from the Grafana download page, unzipping the archive to a different directory, modifying the grafana.ini configuration file for the new instance, and starting the new Grafana instance.
Step 5: Testing In The Client’s Test Environment
After setting up the test environment, it was verified to ensure proper functionality. Following successful testing, recommendations for implementing the changes in the production environment were provided.
The recommendations were:
1. Install the same Grafana version on PROD, this was done in a distinct directory similar to how it was executed in the test lab.
2. Adjust the ini configuration, specifically to select a different port if 3000 is already in use.
3. After the installation, it was ensured that Grafana is accessible via a web browser.
4. Once the new Grafana instance was running: a.An OS_Perf Data Source was set up
5. A new data source was created with visuals and tests for its alerts feature.
Conclusion
In conclusion, the comprehensive approach of this investigation and
troubleshooting process, including diagnostic steps, testing, and the proposal of practical solutions, led to the successful resolution of the Grafana alerting issue, ultimately ensuring the system’s reliability and functionality.