Problem:
The client has requested assistance with the following issue regarding the `connect-eoc-data-summary-to-grid-sink-httpfile-connector`. The connector is experiencing a lag where it is not reading any records, and the offset is not being committed, causing the lag to keep increasing.
The client indicated that the grid connector appears to be stuck and has failed. The following exception was observed in the Connect pods:
2023-11-30 21:41:58,896 ERROR [eoc-data-summary-to-grid-sink-httpfile-connector|task-0] WorkerSinkTask{id=eoc-data-summary-to-grid-sink-httpfile-connector-0} Commit of offsets threw an unexpected exception for sequence number 1: {com.att.cgf.prod.ccs-billcycle-sorted-0=OffsetAndMetadata{offset=3224545, leaderEpoch=null, metadata=''}} (org.apache.kafka.connect.runtime.WorkerSinkTask) [task-thread-eoc-data-summary-to-grid-sink-httpfile-connector-0] org.apache.kafka.clients.consumer.CommitFailedException: Offset commit cannot be completed since the consumer is not part of an active group for auto partition assignment; it is likely that the consumer was kicked out of the group. at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.sendOffsetCommitRequest(ConsumerCoordinator.java:1231)
The client has provided a screenshot and shared the latest logs for further analysis.
Process:
Step 1 – Initial investigation:
After initial investigation and troubleshooting of the client’s request, the team provided the following analysis:
- The team recommended reducing the poll size as an initial remediation. They suggested examining the consumer code for bugs that could impact processing time.
- An analysis of the SM-19885-connect.log revealed that the “eoc data summary to grid sink HTTP file connector” encountered errors while committing its offset, as it was no longer a member of the group.
- The client could lose group membership if a heartbeat signal was not sent to the queue within the session timeout interval.
- To mitigate this issue, they proposed:
- 5. For long-term functionality, the team suggested examining the client’s code for bugs that could cause significantly longer processing times for certain messages read from the queue.
Parameter Names: max.poll.records session.timeout.ms
Step 2 – Meeting with the client:
Our expert team requested a meeting with the client and discussed the following topics:
- Issue Identification: The team discussed a grid connector’s failure to read records despite parameter changes and sought help from the team, requesting the client to share their screen.
- Technical Issues and Troubleshooting: An expert showed logs revealing frequent consumer group disconnections and timeout errors due to slow message processing. Recommendations included adjusting configuration settings, monitoring resources, and network throughput, and enabling verbose logging. The team agreed to enable debug logs before changing parameters, with plans for follow-up steps based on findings, either by sharing reports or scheduling another call.
- Consumer Behavior Analysis and Resource Monitoring: The client presented data on inconsistent consumer presence without actual consumption. The expert suggested possible causes such as slow consumer processing, resource constraints, or network issues.
- Azure Metrics: The client shared Azure metrics showing no critical resource issues and offered to provide debug logs for further investigation.
Step 3 – Further investigation:
In the next step, the client provided the requested log files ‘connect3.log, connect3a.log, connect3b.log, connect3d.log’. The logs showed that the heartbeat for the eoc-data-summary-to-grid-sink-httpfile-connector was sent correctly every 3 seconds, as indicated by multiple ‘Sending Heartbeat request…’ and ‘Received successful Heartbeat response…’ entries.
Solution:
To resolve the issue with the connect-eoc-data-summary-to-grid-sink-httpfile-connector not reading records and failing to commit offsets, the team recommended reducing the maximum poll size from 50 to a lower value, such as halving it initially. They also suggested increasing the session timeout parameter and examining the consumer code for bugs causing slow processing, which might disrupt the heartbeat signal. Additionally, enabling verbose logging and monitoring the system’s performance were advised to identify and address underlying causes, ensuring both immediate resolution and long-term stability.
Conclusion:
The client’s issue with the connect-eoc-data-summary-to-grid-sink-httpfile-connector stemmed from the consumer’s failure to read records and commit offsets, resulting in increasing lag and an unexpected exception due to lost group membership. The team’s initial recommendations included reducing the poll size, adjusting the session timeout, and examining the consumer code for potential bugs. Despite proper heartbeat signals every 3 seconds, the consumer still faced disconnections, suggesting that slow processing or other factors might be causing the issue. Further analysis and adjustments to parameters and code were proposed for both immediate resolution and long-term stability.