Application didn't checkpoint at end of shard #211

posac · 2017-09-15T12:01:44Z

Hi,

I have issue with checkpoints and ending shards.
In my processor, after each processRecords call I am calling checkpoint with sequence number. Sequence number is also stored in processor field.

On shutdown I am also calling :
if (shutdownInput.getShutdownReason() == ShutdownReason.TERMINATE) {
checkpoint(shutdownInput.getCheckpointer());
}

But after "Reached end of shard " I got exception:
java.lang.IllegalArgumentException: Application didn't checkpoint at end of shard shardId-00000001505383995695-aa7bbe9c at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShutdownTask.call(ShutdownTask.java:107) ~[amazon-kinesis-client-1.8.1.jar:?]

I have debug your code and for me there is bug in : RecordProcessorCheckpointer.java in method advancePosition.

I believe that in condition :
if (extendedSequenceNumber != null && !extendedSequenceNumber.equals(lastCheckpointValue)) {

should be :
if (checkpointToRecord != null && !checkpointToRecord.equals(lastCheckpointValue)) {

Otherwise I am not able to store SHARD_END - cause when processRecord calls checkpoint in checkpoint there is no sequenceNumberAtShardEnd yet, It is when it is called from shutdown but then mentioned condition is not satisfy -> lastCheckpointValue is equal to extendedSequenceNumber

The text was updated successfully, but these errors were encountered:

lskrajny · 2017-10-24T18:31:56Z

+1

versemonger · 2018-04-06T00:20:29Z

+1

klesniewski · 2018-07-21T22:04:49Z

I have also encountered this problem. It is annoying, as when it happens the worker cannot progress causing service downtime (it doesn't crash, but it stops working as expected). It may happen when instance of IRecordProcessor checkpoints specific records (sequence numbers) instead of using IRecordProcessorCheckpointer.checkpoint().

Here is how it is exhibited in logs:

2018-07-20 22:00:20,469 DEBUG      StreamsRecordProcessor> Processing 1 records
2018-07-20 22:00:20,480  INFO KafkaStreamsRecordProcessor> KafkaSendCompleted to 'devices-ingress' at 14:a4
2018-07-20 22:00:20,486  INFO     SequenceNumberValidator> Validated sequence number 1434700000000005227153817 with shard id shardId-00000001532110040782-92c71e5b
2018-07-20 22:00:25,464  INFO                      Worker> Current stream shard assignments: shardId-00000001532110040782-92c71e5b
2018-07-20 22:00:25,464  INFO                      Worker> Sleeping ...
2018-07-20 22:00:35,466  INFO                 ProcessTask> Reached end of shard shardId-00000001532110040782-92c71e5b
2018-07-20 22:00:36,466  INFO      StreamsRecordProcessor> Shutting down with status TERMINATE
2018-07-20 22:00:36,471  INFO     SequenceNumberValidator> Validated sequence number 1434700000000005227153817 with shard id shardId-00000001532110040782-92c71e5b
2018-07-20 22:00:36,471 ERROR                ShutdownTask> Application exception.
java.lang.IllegalArgumentException: Application didn't checkpoint at end of shard shardId-00000001532110040782-92c71e5b
        at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShutdownTask.call(ShutdownTask.java:113) [service.jar:?]
        at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:49) [service.jar:?]
        at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:24) [service.jar:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:844) [?:?]

Here is what happened:

In 20th second new record was picked up from the stream and processed. The record was checkpointed. The shard is still active and so RecordProcessorCheckpointer.sequenceNumberAtShardEnd is null.
The next 14 runs (polling every second) don't return any new records. The shard is still active.
On 15th run the library finds out the shard was closed and that we are at shard end. It calls IRecordProcessor.shutdown(ShutdownInput) with ShutdownReason.TERMINATE. I flush processing and commit the last record I know about. This is the record that was processed in 20th second. This is the last record in the shard. The RecordProcessorCheckpointer.advancePosition(ExtendedSequenceNumber) sets checkpointToRecord to SHARD_END, but mentioned by @posac condition is false (this sequence number was already checkpointed) and the SHARD_END is not checkpointed. The KCL throws IllegalArgumentException and retries in next second. It gets repeated every second and application doesn't progress.

It can be worked around by using IRecordProcessorCheckpointer.checkpoint() in IRecordProcessor.shutdown() , which will checkpoint RecordProcessorCheckpointer.largestPermittedCheckpointValue (it is set to SHARD_END once we are at shard end). If you are processing your records asynchronously, make sure that all records were processed before you commit using any of above workarounds.

Correction suggested by @posac will solve the problem.

versemonger mentioned this issue Jun 6, 2018

Application didn't checkpoint at end of shard #345

Closed

etspaceman mentioned this issue Apr 1, 2019

Scaling Kinesis Shards Yields Errors WW-Digital/reactive-kinesis#37

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Application didn't checkpoint at end of shard #211

Application didn't checkpoint at end of shard #211

posac commented Sep 15, 2017 •

edited

Loading

lskrajny commented Oct 24, 2017

versemonger commented Apr 6, 2018

klesniewski commented Jul 21, 2018 •

edited

Loading

Application didn't checkpoint at end of shard #211

Application didn't checkpoint at end of shard #211

Comments

posac commented Sep 15, 2017 • edited Loading

lskrajny commented Oct 24, 2017

versemonger commented Apr 6, 2018

klesniewski commented Jul 21, 2018 • edited Loading

posac commented Sep 15, 2017 •

edited

Loading

klesniewski commented Jul 21, 2018 •

edited

Loading