Missing shards after Kinesis stream resharding #339

Jackyjjc · 2018-05-16T02:30:17Z

We scaled up our Kinesis stream in us-east-1 from 340 shards to 435 shards yesterday. We kept our KCL service running throughout the resharding. After the previous open shards are closed, we found that the KCL service is not processing 5 of the new shards.

In the logs we can see a lot of messages like:
"Cannot get the shard for this ProcessTask, so duplicate KPL user records in the event of resharding will not be dropped during deaggregation of Amazon Kinesis records"
as well as
"Cannot find the shard given the shardId shardId-000000003769"

the shards it report cannot find are some of the new shards, not the old, closed ones. I've redeploy the KCL service and now it can find all the shards.

The version of the KCL we are running on is 1.9.1

Jackyjjc · 2018-05-16T02:32:23Z

The missing shards are not the ones get reported by the KCL in the log messages

ghost · 2018-05-16T16:13:55Z

This sounds very similar to what we see from time to time. We've seen it correlate to resharding but also a random instance will stop processing all of the shards it holds leases to until a restart. Check out this open issue for more: #185

sahilpalvia · 2018-05-16T18:10:43Z

Are you using the Java KCL or the Multilang KCL?

The messages in the logs that you are seeing are warning messages coming from ProcessTask and KinesisProxy. They wouldn't cause your the ShardConsumer to be blocked.

Jackyjjc · 2018-05-18T01:28:23Z

@sahilpalvia We are using Java KCL. We don't see any other messages that indicates the ShardConsumer being stuck other than those two messages flooding the log files.

pfifer · 2018-05-18T12:54:10Z

The message you're seeing is from code that handles KPL messages. When you redeployed the KCL ran ListShards again, which fixed the KPL messages. The fact that the KPL messages where not clearing up is somewhat worrying, that should resolve once the KCL gets a full shard map.

To add to #185 there is one thing to remember: The lease renewer doesn't check that the record processor is working or making progress. This is partially due to the lease renewer doesn't know how long the record processor could block.

BobbyJohansen · 2019-10-09T21:13:39Z

We see the same affect here. Some of the shards will stop processing at what feels like random times but also when we reshard. The KCL seems like it does not want to process certain shards when it mistakenly loses a lease. We have a 64 shard stream

toadzky · 2019-11-12T16:56:12Z

I'm seeing this with KCL on dynamo streams as well, version 1.13.0. the logs i have come in 3 flavors:

Cannot find the shard given the shardId shardId-00000001573573543850-3fa5da13

Cannot get the shard for this ProcessTask, so duplicate KPL user records in the event of resharding will not be dropped during deaggregation of Amazon Kinesis records.

Inconsistent shard graph state detected. Fetched: 25 shards. Closed leaves: 1 shards

When it happens, we have to restart the services using KCL on the dynamo stream to make the errors stop. Since it's generating a ton of log traffic, it's kind of an expensive error.

sahilpalvia added the investigation label May 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing shards after Kinesis stream resharding #339

Missing shards after Kinesis stream resharding #339

Jackyjjc commented May 16, 2018

Jackyjjc commented May 16, 2018

ghost commented May 16, 2018

sahilpalvia commented May 16, 2018

Jackyjjc commented May 18, 2018

pfifer commented May 18, 2018

BobbyJohansen commented Oct 9, 2019

toadzky commented Nov 12, 2019

Missing shards after Kinesis stream resharding #339

Missing shards after Kinesis stream resharding #339

Comments

Jackyjjc commented May 16, 2018

Jackyjjc commented May 16, 2018

ghost commented May 16, 2018

sahilpalvia commented May 16, 2018

Jackyjjc commented May 18, 2018

pfifer commented May 18, 2018

BobbyJohansen commented Oct 9, 2019

toadzky commented Nov 12, 2019