Problem:

The client faced recurring Kafka sink connector failures (e.g., chf-cdr-sftp-sink-connector) in a Kubernetes environment (Kafka 3.2.0 with three brokers and ZooKeeper). The failures were caused by corrupt messages at specific offsets, leading to task crashes. Despite skipping corrupt offsets and restarting connectors, the issue persisted, requiring a more permanent solution.

Process:

Step 1: Environment Review

The expert reviewed the environment configuration, including Kafka version (3.2.0) and Kubernetes setup, to establish a baseline understanding of the issue.

Step 2: Error Analysis

Logs showed CRC checksum errors and task crashes during message deserialization, suggesting the problem was at the Kafka broker level, not the connector configuration.

Step 3: Root Cause Investigation

The expert identified potential causes of corruption:

  • Unsupported Java versions.
  • Disk-related issues during high I/O.
  • Forceful Kafka broker shutdowns (e.g., SIGKILL), which could corrupt segment files.

Step 4: Misunderstanding of auto.offset.reset

The client had attempted to resolve the issue by changing auto.offset.reset settings. The expert clarified that this setting only applies when no committed offset exists. If an offset is committed and corrupt, Kafka will still attempt to read it, causing failure.

Step 5: Recovery Plan

The expert recommended the following recovery steps:

  • Pause data production to the affected topic.
  • Attempt to consume valid messages.
  • Delete and recreate the corrupted topic.
  • Use ephemeral topics for critical workloads to simplify recovery.

Step 6: Secondary Failure

A similar issue in a Kafka Streams changelog topic confirmed the need for improved Kafka practices and infrastructure reliability.

Step 7: Retention Policy Fix

The expert discovered that the corrupt segments weren’t being cleaned up due to retention settings. To fix this:

  • Temporarily reduce the retention time (e.g., 10 minutes).
  • Restart brokers to clear corrupted logs.

Step 8: Topic Migration

To preserve continuity:

  • Create a new topic and redirect the producer to it.
  • Let the consumer drain the old topic.
  • Delete the old topic after draining.

Step 9: Safe Broker Shutdown

The expert emphasized proper shutdown procedures to prevent corruption from unclean broker shutdowns:

  • Use systemctl stop kafka or kafka-server-stop.sh for a graceful shutdown.

Solution:

With the expert’s guidance, the client took several corrective actions:

  • Manual Purge of Corrupt Logs: Adjusted retention settings to remove corrupt logs.
  • Topic Migration: Migrated to new topics as needed to avoid ongoing issues.
  • Safe Broker Shutdown Practices: Adopted proper shutdown procedures to prevent future corruption.
  • Health Checks: Introduced routine checks to ensure message integrity and prevent future corruption.

These actions restored connector stability and improved Kafka message handling.

Conclusion:

The issue was caused by unsafe Kafka broker shutdowns and improper log cleanup. By following the expert’s structured recovery process and implementing best practices, the client:

  • Resolved connector crashes caused by corrupt messages.
  • Mitigated future risks of corruption.
  • Gained operational clarity on Kafka retention and offset management.

The expert’s approach ensured long-term stability and resilience for Kafka data pipelines in high-throughput environments.