Problem 

The client faced a critical issue with their Grafana setup: Grafana alerts were failing to trigger when configured thresholds were breached, and the “TEST ALERT” feature consistently resulted in a “NO DATA” message. 

Process 

Step 1: Initial Investigation 

To address this issue, a multi-step approach was taken. 

In the initial investigation, a meeting was held with the client, and several diagnostic steps were taken, including enabling DEBUG logs and testing alert features. Despite these efforts, the root cause remained elusive, prompting a request for additional files for further analysis. 

Step 2: Deeper Investigation 

In the subsequent deeper investigation, a follow-up meeting took place. The team explored discrepancies between values in the Query Inspector and actual data and tested pausing and resuming alerts. However, it was determined that the alerting system was not working correctly. Various solutions were 

attempted, including modifying query values, offsets, and minimum dock accounts, but none were successful. Consequently, a new solution was 

proposed: downgrading Grafana to version 8.5.1 and recreating a similar test environment.

Step 3: Further Investigation 

The in-depth investigation phase followed, involving verification of SMTP configuration, consideration of creating a new dashboard for cautious alert testing in a production environment, and examination of metrics retrieval from Grafana through ElasticSearch. Additionally, system settings related to alerting in Grafana were reviewed, different queries and conditions for alerts were tested, and data sources used by alerts in Grafana were explored. Grafana logs were scrutinized for patterns or anomalies related to alerting. 

Based on these actions, a recommendation was made to install a new instance of Grafana on the same server to identify if the issue was specific to the current instance or system-wide. 

Step 4: Preparation For Additional Testing 

For the preparation of additional testing, the client received detailed 

instructions for downloading and installing a second instance of Grafana in a test environment. The steps included downloading the same Grafana version from the Grafana download page, unzipping the archive to a different directory, modifying the grafana.ini configuration file for the new instance, and starting the new Grafana instance.

Step 5: Testing In The Client’s Test Environment 

After setting up the test environment, it was verified to ensure proper functionality. Following successful testing, recommendations for implementing the changes in the production environment were provided. 

The recommendations were: 

1. Install the same Grafana version on PROD, this was done in a distinct directory similar to how it was executed in the test lab. 

2. Adjust the ini configuration, specifically to select a different port if 3000 is already in use. 

3. After the installation, it was ensured that Grafana is accessible via a web browser. 

4. Once the new Grafana instance was running: a.An OS_Perf Data Source was set up 

5. A new data source was created with visuals and tests for its alerts feature. 

Conclusion 

In conclusion, the comprehensive approach of this investigation and 

troubleshooting process, including diagnostic steps, testing, and the proposal of practical solutions, led to the successful resolution of the Grafana alerting issue, ultimately ensuring the system’s reliability and functionality.