Problem: The client faced significant delays in executing Elasticsearch queries within their production environment. A particular query, which involved a simple numeric account identifier, took an alarming 68 seconds to execute, despite returning only six hits. The total size of the query output was 583KB, yet the Elasticsearch profiler indicated that 67 seconds of this […]
Data Analytics 21 Aug 2024 Resolving Prometheus Pod Crashing Issue in Production EnvironmentProblem: The client reported an issue where the Prometheus pod was crashing in the production environment. The error logs indicated a variety of issues including “Terminated Reason: Error” and messages about unhealthy blocks and existing lock files. The specific error message highlighted was: Last State: Terminated Reason: Error Message: und healthy block” mint=1680069600000 maxt=1680091200000 ulid=01GWPY8332RWJAPSFZ8KNAJQ9H […]
Data Analytics 31 Jul 2024 Resolving Prometheus Directory Growth IssueProblem: The client reported that the Prometheus directory inside /var/lib had grown to 23GB, leading to high disk utilization on /var and potentially impacting other services. The /var directory has a total capacity of 200GB, which is shared by other service libraries and log files. Currently, the utilization on /var is at 80%, and the […]
Data Analytics 29 Jul 2024 Kafka Streams Application: Efficient Management of Changelog TopicsProblem: The client, utilizing Kafka Streams application version 3.3.1, encountered issues with managing changelog topics. Despite configuring the application for automatic cleanup of records within these topics, unchecked growth was noticed, posing potential risks to system performance and stability. Process: Initial Assessment: The client reported the issue, highlighting the design of their Kafka Streams application […]
Data Analytics 20 Jul 2024 Grid Connector Stuck and Failing: Offsets Not Committed, Leading to Increasing LagProblem: The client has requested assistance with the following issue regarding the `connect-eoc-data-summary-to-grid-sink-httpfile-connector`. The connector is experiencing a lag where it is not reading any records, and the offset is not being committed, causing the lag to keep increasing. The client indicated that the grid connector appears to be stuck and has failed. The following […]
Data Analytics 19 Jul 2024 Resolving Elasticsearch Query Timeouts ProblemProblem: Certain Elasticsearch queries timed out after 30 seconds. Details: The customer used Elasticsearch (version 7.17.0 or slightly newer) to query documents created by the Actimize application. The Elastic index contained approximately 80 million documents, amounting to several terabytes. Typically, queries were executed within a few seconds, but some queries consistently took 30 seconds or […]
Data Analytics 12 Jul 2024 Prometheus’ node exporter failing on ARM64 machinesProblem: The customer is experiencing the “exec format error” issue when using Prometheus node exporter versions 1.5.0 and 1.6.0 on ARM64 machines, particularly Graviton-type instances in an AWS environment. This error is observed in the node exporter pods running as a DaemonSet in a Kubernetes cluster with nodes having ARM64 architecture. Process: The experts requested […]
Data Analytics 6 Jul 2024 Risks in Airflow Version 2.5.2 – Unauthenticated Page VulnerabilityProblem: The user was unable to reach the application page and received the error ‘Unauthenticated Page’. Process: Step 1: Initial Investigation The security issue pertains to an unauthenticated page within the Airflow version 2.5.2 instance. This unauthenticated page poses a potential security risk, as it can be accessed without proper authentication, potentially exposing sensitive information […]
Data Analytics 28 Jun 2024 Enhancing Security Measures for Prometheus Operator Cluster RolesProblem: The client, deploying the Prometheus operator using a community helm chart, encountered a security concern regarding the permissions granted to the Prometheus operator. Upon closer examination, it was discovered that the community helm chart provided overly permissive access rights, particularly with ‘*’ permissions for secrets and configmaps, as well as delete permissions for default […]
Data Analytics