Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing shards after Kinesis stream resharding #339

Open
Jackyjjc opened this issue May 16, 2018 · 7 comments
Open

Missing shards after Kinesis stream resharding #339

Jackyjjc opened this issue May 16, 2018 · 7 comments

Comments

@Jackyjjc
Copy link

We scaled up our Kinesis stream in us-east-1 from 340 shards to 435 shards yesterday. We kept our KCL service running throughout the resharding. After the previous open shards are closed, we found that the KCL service is not processing 5 of the new shards.

In the logs we can see a lot of messages like:
"Cannot get the shard for this ProcessTask, so duplicate KPL user records in the event of resharding will not be dropped during deaggregation of Amazon Kinesis records"
as well as
"Cannot find the shard given the shardId shardId-000000003769"

the shards it report cannot find are some of the new shards, not the old, closed ones. I've redeploy the KCL service and now it can find all the shards.

The version of the KCL we are running on is 1.9.1

@Jackyjjc
Copy link
Author

The missing shards are not the ones get reported by the KCL in the log messages

@ghost
Copy link

ghost commented May 16, 2018

This sounds very similar to what we see from time to time. We've seen it correlate to resharding but also a random instance will stop processing all of the shards it holds leases to until a restart. Check out this open issue for more: #185

@sahilpalvia
Copy link
Contributor

Are you using the Java KCL or the Multilang KCL?

The messages in the logs that you are seeing are warning messages coming from ProcessTask and KinesisProxy. They wouldn't cause your the ShardConsumer to be blocked.

@Jackyjjc
Copy link
Author

@sahilpalvia We are using Java KCL. We don't see any other messages that indicates the ShardConsumer being stuck other than those two messages flooding the log files.

@pfifer
Copy link
Contributor

pfifer commented May 18, 2018

The message you're seeing is from code that handles KPL messages. When you redeployed the KCL ran ListShards again, which fixed the KPL messages. The fact that the KPL messages where not clearing up is somewhat worrying, that should resolve once the KCL gets a full shard map.

To add to #185 there is one thing to remember: The lease renewer doesn't check that the record processor is working or making progress. This is partially due to the lease renewer doesn't know how long the record processor could block.

@BobbyJohansen
Copy link

We see the same affect here. Some of the shards will stop processing at what feels like random times but also when we reshard. The KCL seems like it does not want to process certain shards when it mistakenly loses a lease. We have a 64 shard stream

@toadzky
Copy link

toadzky commented Nov 12, 2019

I'm seeing this with KCL on dynamo streams as well, version 1.13.0. the logs i have come in 3 flavors:

Cannot find the shard given the shardId shardId-00000001573573543850-3fa5da13
Cannot get the shard for this ProcessTask, so duplicate KPL user records in the event of resharding will not be dropped during deaggregation of Amazon Kinesis records.
Inconsistent shard graph state detected. Fetched: 25 shards. Closed leaves: 1 shards

When it happens, we have to restart the services using KCL on the dynamo stream to make the errors stop. Since it's generating a ton of log traffic, it's kind of an expensive error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants