Faster acquisition of leases from workers that have been gracefully shutdown? #845

rtyley · 2021-08-06T18:18:36Z

New instances seem to take a relatively long time to acquire leases after old instances have stopped renewing them (seen in KCL 2.3.6, probably other versions too) - apart from incremental lease stealing, the new instances seem to have to wait for the full failoverTimeMillis (which is referred to as leaseDurationMillis within DynamoDBLeaseCoordinator & leaseDurationNanos within DynamoDBLeaseTaker) before a lease is considered expired and they can take it. With failoverTimeMillis set to just 30s, I've seen new instances take ~3 minutes to take all leases from old instances in a deployment (old instances were fully terminated by 16:57:40 in the graph below, but the new record processors weren't all initialised until 17:00:35).

Although failover timeout is obviously good for handling ShardRecordProcessors that become unresponsive, if an instance is smoothly shutting down (eg scheduler.startGracefulShutdown() has been called), couldn't it be clearly indicated that the old scheduler is no longer responsible for the lease by invoking evictLease() (setting the lease owner to null) on the leases it still holds during graceful shutdown? This would be after shutdownRequested() is called on the ShardRecordProcessor & before leaseLost(). It could possibly be in s.a.k.lifecycle.ShutdownTask.call()?

Having done this, DynamoDBLeaseTaker.takeLeases() could be more acquisitive, and as well as taking expired leases, could take unassigned ones too (see #848) - so in the case of a graceful shutdown of old processors, the handover could be much quicker than waiting failoverTimeMillis.

Does this make sense?! Or could it be plagued by all sorts of race-conditions or complexity that you're probably very careful to avoid?! Just from an education point of view, I'd be interested to learn what problems the approach has.

The text was updated successfully, but these errors were encountered:

…taken If a lease is 'unassigned' (it has no lease owner) then it should be considered available for taking in `DynamoDBLeaseTaker`. Prior to this change, the only ways `DynamoDBLeaseTaker` could take leases for a scheduler was either by incremental lease stealing, or waiting for the lease to expire by not having been updated in `failoverTimeMillis` - which could be slow if `failoverTimeMillis` was set reasonably high (with it set to just 30s, I've seen new instances take over 3 minutes to take all leases from old instances in a deployment). This would be one half of the a fix for awslabs#845 - the other half of the fix is invoking `evictLease()` (setting the lease owner to null) on graceful shutdown of a scheduler.

bjrara · 2022-01-18T00:35:15Z

Since failoverTimeMillis is set to 30s, all the leases should be treated as expired after that duration if the previous owner fails to renew the expiry time.
In the case described here, old instances stopped at 16:57:40, meaning their leases were available for picking up at 16:58:10. However, according to the graph, we found new instances actually started to work on taking more leases at 16:58:30, for one instance, it happened as late as 16:59:30. There were 50 to 110 seconds time differences.

So I think there're other reasons causing the slow acquisition.

One thing I noticed is how long the lease taker is scheduled to execute. IIUC, the interval of lease taking is by default 2x failoverTimeMillis. In your case, it was 1 minute.

amazon-kinesis-client/amazon-kinesis-client/src/main/java/software/amazon/kinesis/leases/dynamodb/DynamoDBLeaseCoordinator.java

Line 161 in a3e51d5

this.takerIntervalMillis = (leaseDurationMillis + epsilonMillis) * 2;

If all the new instances just executed LeaseTaker right before the leases expired (16:58:10) we would have to wait until 16:59:10 to observe the first lease acquisition event.

It may help explain why 2 of the new instances started to take lease around 16:58:30.

However, it seems to me that the new instances didn't think all the leases were expired at 16:58:30, and the leases were not balanced among the 3 workers. This is beyond my expectation.

It would very helpful if you could post more logs in LeaseTaker to help us to understand how the cluster made the decision at that time.

…taken If a lease is 'unassigned' (it has no lease owner) then it should be considered available for taking in `DynamoDBLeaseTaker`. Prior to this change, the only ways `DynamoDBLeaseTaker` could take leases for a scheduler was either by incremental lease stealing, or waiting for the lease to expire by not having been updated in `failoverTimeMillis` - which could be slow if `failoverTimeMillis` was set reasonably high (with it set to just 30s, I've seen new instances take over 3 minutes to take all leases from old instances in a deployment). This would be one half of the a fix for awslabs#845 - the other half of the fix is invoking `evictLease()` (setting the lease owner to null) on graceful shutdown of a scheduler.

rtyley · 2022-01-18T16:52:56Z

Thanks for your response @bjrara ! If you have time, I'd be very grateful if you could review PR #848, as I believe it could be a good solution to the failover problem for the common case that current-lease-holders shutdown gracefully, while retaining the current failover-timeout behaviour for cases where lease-holders do not shutdown gracefully.

It would very helpful if you could post more logs in LeaseTaker to help us to understand how the cluster made the decision at that time.

I don't have the logs from the deploy mentioned in the original description for this issue (where failoverTimeMillis was 30s) but I have grabbed current production logs from a fresh deploy today (where failoverTimeMillis has been reduced to the default value of 10s to ameliorate this issue).

10:27:04 - Last of the old workers (i-0b7f310fbf75d6d58) issues last log message of any kind (GracefulShutdownCallable has completed)
10:27:31 - New worker i-07ad2e8efa0b53d5d finally picks up the last of the remaining leases

The issue here is less exaggerated because the failoverTimeMillis is lower, but still present: 27s elapses between the last old EC2 instance being terminated (at 10:27:04) before all shards are being processed by the new EC2 instances (10:27:31). Here the delay is roughly 3x what a failoverTimeMillis of 10s might suggest, though given the expression you point out to calculate takerIntervalMillis, that behaviour would be expected - that expression has been around for 9 years, since KCL v1.0.0, but I haven't seen any description of why it's a good idea?

The thing is, I don't believe there's really any reason why we have to wait for failoverTimeMillis to elapse when an old worker has shutdown gracefully - why can't the old worker just mark the lease as no-longer-owned (ie unset leaseOwner in the Kinesis table) during its graceful shutdown, and newer workers consider leases without owners as available to use, as in #848 ?

bjrara · 2022-01-18T18:19:46Z

@rtyley I agree that lease eviction can definitely help expedite the process of lease transfer. Just to let you know, I'm not a maintainer of this repository, but I'd like to give my vote in your PR if that could help when KCL team evaluate the changes.

mathieufortin01 · 2022-09-22T21:04:07Z

Hey there. Any updates on your side regarding this situation? We suffer from this in production, and im leaning toward integrating your pr, as well as completing the work described in your original comment, and see from there.

rtyley mentioned this issue Aug 6, 2021

Fix DynamoDBLeaseTaker logging of available leases #846

Merged

rtyley mentioned this issue Aug 8, 2021

Treat unassigned leases (as well as expired ones) as available-to-be-taken #848

Open

rtyley referenced this issue Jan 18, 2022

'Version 1.0.0 of the Amazon Kinesis Client Library'

ce9054c

mathieufortin01 mentioned this issue Oct 18, 2022

Added support for faster shard reallocation on graceful shutdown #1005

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster acquisition of leases from workers that have been gracefully shutdown? #845

Faster acquisition of leases from workers that have been gracefully shutdown? #845

rtyley commented Aug 6, 2021 •

edited

Loading

bjrara commented Jan 18, 2022

rtyley commented Jan 18, 2022

bjrara commented Jan 18, 2022

mathieufortin01 commented Sep 22, 2022

Faster acquisition of leases from workers that have been gracefully shutdown? #845

Faster acquisition of leases from workers that have been gracefully shutdown? #845

Comments

rtyley commented Aug 6, 2021 • edited Loading

bjrara commented Jan 18, 2022

rtyley commented Jan 18, 2022

bjrara commented Jan 18, 2022

mathieufortin01 commented Sep 22, 2022

rtyley commented Aug 6, 2021 •

edited

Loading