Problem:

The client, reported repeated network timeouts and instability in their Apache Kafka environment, resulting in missing Call Detail Records for specific time periods. The issue manifested as persistent connectivity errors between Kafka brokers and 100% CPU utilization on the node hosting critical services (UMS, App, and ODF pods).

Process:

Step 1: Initial Identification

The client observed repeated timeout errors between Kafka brokers, specifically between broker 1 and broker 2. The logs contained messages such as:

Disconnecting from node 2 due to request timeout

This indicated potential network instability or a performance bottleneck on broker 2. To mitigate the issue, the client’s team restarted the node hosting ODF broker 2. Following the restart, broker 2 successfully reconnected to its replicas, and partition 13 was re-added to the ISR, restoring replication stability.

Step 2: Extended Observations

Despite the temporary resolution, the customer reported continued service instability. Investigation revealed that the physical host running the virtual machines (worker nodes) experienced 100% CPU utilization, leading to degraded performance of UMS, App, and ODF pods. The client addressed this by migrating the affected VMs to a different physical host, reducing resource contention and improving system performance.

Step 3: Expert Analysis

The expert reviewed the logs, system performance data, and client responses. The analysis focused on confirming whether the broker restart and CPU utilization were linked to message loss. The expert requested clarification on several points, including:

  • Whether the issue was resolved after restarting broker 2.
  • If the affected node was indeed experiencing 100% CPU usage.
  • Whether the missing messages were consecutive or appeared in random order.

The client confirmed that:

  • The issue was temporarily fixed after restarting broker 2.
  • The node hosting UMS, App, and ODF pods was running at 100% CPU.
  • The missing messages occurred consecutively in time.

The client also inquired why message loss occurred even though two brokers in the three-node cluster remained operational. Additionally, they observed Kafka Connect pods timing out while committing offsets during the incident period.

Step 4: Root Cause Analysis and Solution Proposal

Based on the analysis, the expert identified the following root causes and contributing factors:

  • Resource Saturation: The host running Kafka brokers was overutilized, leading to severe timeouts between brokers—a typical “noisy neighbor” scenario in virtualized environments.
  • Broker Failure and Cluster Overload: One Kafka node went down due to high load or was terminated by an external process. The failure triggered additional synchronization efforts on the remaining nodes, further increasing system load.
  • Message Acknowledgment Failures: Under extreme CPU pressure and network lag, Kafka acknowledgment messages were not delivered reliably, causing incomplete message consumption and potential checkpoint loss.

Solution:

To resolve the issue and prevent recurrence, the expert recommended several key actions. The missing messages should be re-published to the Kafka topic to restore the lost data and allow consumers to process it again. Kafka brokers need to be distributed across separate physical hosts to eliminate single points of resource contention. It is also important to review VM placement and resource allocation policies to ensure balanced load distribution and prevent CPU saturation on any single host. The Kafka client’s checkpoint and re-consume configurations should be properly adjusted to manage retries and handle duplicate records effectively. Finally, implementing real-time monitoring and alerting for CPU, memory, and Kafka broker performance will help detect and mitigate issues before they lead to service disruption.

Conclusion:

The incident was caused by excessive CPU utilization on the host running Kafka brokers, leading to broker timeouts, synchronization failures, and message loss. While restarting the affected broker restored short-term stability, the underlying resource contention required deeper infrastructure-level adjustments. By redistributing the Kafka brokers across different hosts, optimizing consumer checkpoint handling, and implementing proactive monitoring, the client was able to prevent similar message loss incidents and ensure more resilient CDR processing in the future.