Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worker goes idle forever #20

Closed
klesniewski opened this issue Oct 12, 2018 · 5 comments
Closed

Worker goes idle forever #20

klesniewski opened this issue Oct 12, 2018 · 5 comments
Assignees

Comments

@klesniewski
Copy link

klesniewski commented Oct 12, 2018

In one of our applications, we have observed that DynamoDB Streams processing sometimes stops until application is restarted. The first time it happened it caused quite a headache, as we discovered it more than 24 hours later (some data was no longer available in the stream). Now, with monitoring in place, we can see it happens every few days (happened 4 times so far). We have observed the following:

  • It starts idling after reaching SHARD_END (not every time though). RecordProcessor is shut down with status TERMINATE and no new RecordProcessor is created. ShutdownTask does not report CreateLeases metrics, which it usually does.
  • When idling, there is no RecordProcessor thread and worker repeatedly logs that it has No activities assigned. We can see in lease table that there is only one shard with checkpoint at SHARD_END. When refreshing the table, we can see that leaseCounter gets incremented. The TakeLeases and RenewAllLeases operations keep successfully running (by successfully I mean it reports success in metrics). LeaseTaker sees no new shards to take.
  • After restart new shards are added to the lease table with checkpoint at TRIM_HORIZON, one is child of the shard with checkpoint at SHARD_END and parent of the other shard with TRIM_HORIZON checkpoint. The application resumes processing where it left off (or at oldest available data).

Checking KCL library implementation, we have noticed that LeaseTaker will take new leases only if these are available in the lease table. Discovering and inserting new leases to lease table happens only on 2 occasions: on worker initialization and on reaching shard end. We suspect that sometimes when shard end is reached and shards are listed, information about new shards is not yet available. Because of that, no new shards are inserted into lease table and so LeaseTaker will not see the new shards. As no shard is being consumed, no shard end is reached, no shards are ever inserted to lease table, and so the worker stays idle forever. Given there is more than one worker instance, the problem is probably less visible, since shards will be synced again when another worker finishes its shard, unlocking the idle worker. Nevertheless, there will be a period where worker is idle because shards are not in sync in lease table.

I am not sure, whether this issue belongs to KCL library or the DynamoDB Adapter. It seems KCL is working under assumption, that information about new shards is always available before shard end is reached. I don't know whether this assumption is intentional and violated by the Adapter, or whether the assumption is wrong and has to be fixed in KCL. Therefore I created this issue in both projects. The same issue in the other project: awslabs/amazon-kinesis-client#442

Libraries used:

  • com.amazonaws:dynamodb-streams-kinesis-adapter:1.4.0
  • com.amazonaws:amazon-kinesis-client:1.9.0
@parijatsinha
Copy link
Contributor

We are aware of this issue (leases not created due to delay in shards appearing in Streams metadata) and released a fix in v1.4.0. Are you initializing your worker using the recommended factory method mentioned in the Readme?

@klesniewski
Copy link
Author

Thank you for your fast response! Great to know you already have a fix for it. We are not using the factory mentioned in the readme, but I will give it a try now. If I understand correctly, when Proxy is used, it will detect case when some new shards are not returned, and will try a few more times before returning, so that the new shards are returned. Is it more or less correct?

Could you please update the documentation? I was following the Walkthough there, but it does not use the added and recommended worker factory. People may fall in the same problem in the future.

@klesniewski
Copy link
Author

The fix is in production for nearly a week now. Since then, the issue did not appear. We can see in logs, that in the last 3 days, the added proxy spotted and resolved inconsistencies roughly once daily.

2018-10-16 03:35:26,939 DEBUG: Building shard graph snapshot; total shard count: 8
2018-10-16 03:35:26,939  INFO: Inconsistency resolution retry attempt: 0. Backing off for 934 millis.
2018-10-16 03:35:27,873  WARN: Inconsistent shard graph state detected. Fetched: 8 shards. Closed leaves: 1 shards
2018-10-16 03:35:27,873 DEBUG: Following leaf node shards are closed: shardId-********************-c685d878
2018-10-16 03:35:27,883 DEBUG: Attempting to resolve inconsistencies in the graph with the following shards:
 shardId-********************-611e6f95
2018-10-16 03:35:27,883 DEBUG: Resolving inconsistencies in shard graph; total shard count: 9
2018-10-16 03:35:27,883  INFO: An intermediate page in DescribeStream response resolved inconsistencies. Total retry attempts taken to resolve inconsistencies: 1
2018-10-16 03:35:27,883 DEBUG: Num shards: 9

I think we can consider the problem resolved. Thank you guys for fixing it! I will leave the issue open as a reminder to update the documentation.

@parijatsinha
Copy link
Contributor

I have requested for the documentation/walkthrough to be updated.

@parijatsinha
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants