Worker goes idle forever #20

klesniewski · 2018-10-12T07:43:16Z

In one of our applications, we have observed that DynamoDB Streams processing sometimes stops until application is restarted. The first time it happened it caused quite a headache, as we discovered it more than 24 hours later (some data was no longer available in the stream). Now, with monitoring in place, we can see it happens every few days (happened 4 times so far). We have observed the following:

It starts idling after reaching SHARD_END (not every time though). RecordProcessor is shut down with status TERMINATE and no new RecordProcessor is created. ShutdownTask does not report CreateLeases metrics, which it usually does.
When idling, there is no RecordProcessor thread and worker repeatedly logs that it has No activities assigned. We can see in lease table that there is only one shard with checkpoint at SHARD_END. When refreshing the table, we can see that leaseCounter gets incremented. The TakeLeases and RenewAllLeases operations keep successfully running (by successfully I mean it reports success in metrics). LeaseTaker sees no new shards to take.
After restart new shards are added to the lease table with checkpoint at TRIM_HORIZON, one is child of the shard with checkpoint at SHARD_END and parent of the other shard with TRIM_HORIZON checkpoint. The application resumes processing where it left off (or at oldest available data).

Checking KCL library implementation, we have noticed that LeaseTaker will take new leases only if these are available in the lease table. Discovering and inserting new leases to lease table happens only on 2 occasions: on worker initialization and on reaching shard end. We suspect that sometimes when shard end is reached and shards are listed, information about new shards is not yet available. Because of that, no new shards are inserted into lease table and so LeaseTaker will not see the new shards. As no shard is being consumed, no shard end is reached, no shards are ever inserted to lease table, and so the worker stays idle forever. Given there is more than one worker instance, the problem is probably less visible, since shards will be synced again when another worker finishes its shard, unlocking the idle worker. Nevertheless, there will be a period where worker is idle because shards are not in sync in lease table.

I am not sure, whether this issue belongs to KCL library or the DynamoDB Adapter. It seems KCL is working under assumption, that information about new shards is always available before shard end is reached. I don't know whether this assumption is intentional and violated by the Adapter, or whether the assumption is wrong and has to be fixed in KCL. Therefore I created this issue in both projects. The same issue in the other project: awslabs/amazon-kinesis-client#442

Libraries used:

com.amazonaws:dynamodb-streams-kinesis-adapter:1.4.0
com.amazonaws:amazon-kinesis-client:1.9.0

The text was updated successfully, but these errors were encountered:

parijatsinha · 2018-10-12T07:50:51Z

We are aware of this issue (leases not created due to delay in shards appearing in Streams metadata) and released a fix in v1.4.0. Are you initializing your worker using the recommended factory method mentioned in the Readme?

klesniewski · 2018-10-12T10:22:48Z

Thank you for your fast response! Great to know you already have a fix for it. We are not using the factory mentioned in the readme, but I will give it a try now. If I understand correctly, when Proxy is used, it will detect case when some new shards are not returned, and will try a few more times before returning, so that the new shards are returned. Is it more or less correct?

Could you please update the documentation? I was following the Walkthough there, but it does not use the added and recommended worker factory. People may fall in the same problem in the future.

klesniewski · 2018-10-19T11:20:01Z

The fix is in production for nearly a week now. Since then, the issue did not appear. We can see in logs, that in the last 3 days, the added proxy spotted and resolved inconsistencies roughly once daily.

2018-10-16 03:35:26,939 DEBUG: Building shard graph snapshot; total shard count: 8
2018-10-16 03:35:26,939  INFO: Inconsistency resolution retry attempt: 0. Backing off for 934 millis.
2018-10-16 03:35:27,873  WARN: Inconsistent shard graph state detected. Fetched: 8 shards. Closed leaves: 1 shards
2018-10-16 03:35:27,873 DEBUG: Following leaf node shards are closed: shardId-********************-c685d878
2018-10-16 03:35:27,883 DEBUG: Attempting to resolve inconsistencies in the graph with the following shards:
 shardId-********************-611e6f95
2018-10-16 03:35:27,883 DEBUG: Resolving inconsistencies in shard graph; total shard count: 9
2018-10-16 03:35:27,883  INFO: An intermediate page in DescribeStream response resolved inconsistencies. Total retry attempts taken to resolve inconsistencies: 1
2018-10-16 03:35:27,883 DEBUG: Num shards: 9

I think we can consider the problem resolved. Thank you guys for fixing it! I will leave the issue open as a reminder to update the documentation.

parijatsinha · 2018-10-19T13:48:25Z

I have requested for the documentation/walkthrough to be updated.

parijatsinha · 2019-01-16T07:36:50Z

Documentation has been updated. Closing this issue.

https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.KCLAdapter.Walkthrough.CompleteProgram.html

klesniewski mentioned this issue Oct 12, 2018

Worker goes idle forever awslabs/amazon-kinesis-client#442

Closed

parijatsinha self-assigned this Oct 21, 2018

rwightman mentioned this issue Oct 24, 2018

Cannot find the shard given the shardId - Non stop log spam #21

Closed

pfifer mentioned this issue Oct 30, 2018

"Stuck" Kinesis Shards awslabs/amazon-kinesis-client#185

Open

parijatsinha closed this as completed Jan 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Worker goes idle forever #20

Worker goes idle forever #20

klesniewski commented Oct 12, 2018 •

edited

Loading

parijatsinha commented Oct 12, 2018

klesniewski commented Oct 12, 2018

klesniewski commented Oct 19, 2018

parijatsinha commented Oct 19, 2018

parijatsinha commented Jan 16, 2019

Worker goes idle forever #20

Worker goes idle forever #20

Comments

klesniewski commented Oct 12, 2018 • edited Loading

parijatsinha commented Oct 12, 2018

klesniewski commented Oct 12, 2018

klesniewski commented Oct 19, 2018

parijatsinha commented Oct 19, 2018

parijatsinha commented Jan 16, 2019

klesniewski commented Oct 12, 2018 •

edited

Loading