Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster acquisition of leases from workers that have been gracefully shutdown? #845

Open
rtyley opened this issue Aug 6, 2021 · 4 comments

Comments

@rtyley
Copy link
Contributor

rtyley commented Aug 6, 2021

New instances seem to take a relatively long time to acquire leases after old instances have stopped renewing them (seen in KCL 2.3.6, probably other versions too) - apart from incremental lease stealing, the new instances seem to have to wait for the full failoverTimeMillis (which is referred to as leaseDurationMillis within DynamoDBLeaseCoordinator & leaseDurationNanos within DynamoDBLeaseTaker) before a lease is considered expired and they can take it. With failoverTimeMillis set to just 30s, I've seen new instances take ~3 minutes to take all leases from old instances in a deployment (old instances were fully terminated by 16:57:40 in the graph below, but the new record processors weren't all initialised until 17:00:35).

image

Although failover timeout is obviously good for handling ShardRecordProcessors that become unresponsive, if an instance is smoothly shutting down (eg scheduler.startGracefulShutdown() has been called), couldn't it be clearly indicated that the old scheduler is no longer responsible for the lease by invoking evictLease() (setting the lease owner to null) on the leases it still holds during graceful shutdown? This would be after shutdownRequested() is called on the ShardRecordProcessor & before leaseLost(). It could possibly be in s.a.k.lifecycle.ShutdownTask.call()?

Having done this, DynamoDBLeaseTaker.takeLeases() could be more acquisitive, and as well as taking expired leases, could take unassigned ones too (see #848) - so in the case of a graceful shutdown of old processors, the handover could be much quicker than waiting failoverTimeMillis.

Does this make sense?! Or could it be plagued by all sorts of race-conditions or complexity that you're probably very careful to avoid?! Just from an education point of view, I'd be interested to learn what problems the approach has.

rtyley added a commit to rtyley/amazon-kinesis-client that referenced this issue Aug 8, 2021
…taken

If a lease is 'unassigned' (it has no lease owner) then it should be
considered available for taking in `DynamoDBLeaseTaker`. Prior to this
change, the only ways `DynamoDBLeaseTaker` could take leases for a
scheduler was either by incremental lease stealing, or waiting for the
lease to expire by not having been updated in `failoverTimeMillis` - which
could be slow if `failoverTimeMillis` was set reasonably high (with it set
to just 30s, I've seen new instances take over 3 minutes to take all leases
from old instances in a deployment).

This would be one half of the a fix for
awslabs#845 - the other
half of the fix is invoking `evictLease()` (setting the lease owner to
null) on graceful shutdown of a scheduler.
rtyley added a commit to rtyley/amazon-kinesis-client that referenced this issue Oct 4, 2021
…taken

If a lease is 'unassigned' (it has no lease owner) then it should be
considered available for taking in `DynamoDBLeaseTaker`. Prior to this
change, the only ways `DynamoDBLeaseTaker` could take leases for a
scheduler was either by incremental lease stealing, or waiting for the
lease to expire by not having been updated in `failoverTimeMillis` - which
could be slow if `failoverTimeMillis` was set reasonably high (with it set
to just 30s, I've seen new instances take over 3 minutes to take all leases
from old instances in a deployment).

This would be one half of the a fix for
awslabs#845 - the other
half of the fix is invoking `evictLease()` (setting the lease owner to
null) on graceful shutdown of a scheduler.
rtyley added a commit to rtyley/amazon-kinesis-client that referenced this issue Oct 4, 2021
…taken

If a lease is 'unassigned' (it has no lease owner) then it should be
considered available for taking in `DynamoDBLeaseTaker`. Prior to this
change, the only ways `DynamoDBLeaseTaker` could take leases for a
scheduler was either by incremental lease stealing, or waiting for the
lease to expire by not having been updated in `failoverTimeMillis` - which
could be slow if `failoverTimeMillis` was set reasonably high (with it set
to just 30s, I've seen new instances take over 3 minutes to take all leases
from old instances in a deployment).

This would be one half of the a fix for
awslabs#845 - the other
half of the fix is invoking `evictLease()` (setting the lease owner to
null) on graceful shutdown of a scheduler.
rtyley added a commit to rtyley/amazon-kinesis-client that referenced this issue Oct 4, 2021
…taken

If a lease is 'unassigned' (it has no lease owner) then it should be
considered available for taking in `DynamoDBLeaseTaker`. Prior to this
change, the only ways `DynamoDBLeaseTaker` could take leases for a
scheduler was either by incremental lease stealing, or waiting for the
lease to expire by not having been updated in `failoverTimeMillis` - which
could be slow if `failoverTimeMillis` was set reasonably high (with it set
to just 30s, I've seen new instances take over 3 minutes to take all leases
from old instances in a deployment).

This would be one half of the a fix for
awslabs#845 - the other
half of the fix is invoking `evictLease()` (setting the lease owner to
null) on graceful shutdown of a scheduler.
rtyley added a commit to rtyley/amazon-kinesis-client that referenced this issue Nov 25, 2021
…taken

If a lease is 'unassigned' (it has no lease owner) then it should be
considered available for taking in `DynamoDBLeaseTaker`. Prior to this
change, the only ways `DynamoDBLeaseTaker` could take leases for a
scheduler was either by incremental lease stealing, or waiting for the
lease to expire by not having been updated in `failoverTimeMillis` - which
could be slow if `failoverTimeMillis` was set reasonably high (with it set
to just 30s, I've seen new instances take over 3 minutes to take all leases
from old instances in a deployment).

This would be one half of the a fix for
awslabs#845 - the other
half of the fix is invoking `evictLease()` (setting the lease owner to
null) on graceful shutdown of a scheduler.
rtyley added a commit to rtyley/amazon-kinesis-client that referenced this issue Nov 25, 2021
…taken

If a lease is 'unassigned' (it has no lease owner) then it should be
considered available for taking in `DynamoDBLeaseTaker`. Prior to this
change, the only ways `DynamoDBLeaseTaker` could take leases for a
scheduler was either by incremental lease stealing, or waiting for the
lease to expire by not having been updated in `failoverTimeMillis` - which
could be slow if `failoverTimeMillis` was set reasonably high (with it set
to just 30s, I've seen new instances take over 3 minutes to take all leases
from old instances in a deployment).

This would be one half of the a fix for
awslabs#845 - the other
half of the fix is invoking `evictLease()` (setting the lease owner to
null) on graceful shutdown of a scheduler.
rtyley added a commit to rtyley/amazon-kinesis-client that referenced this issue Nov 25, 2021
…taken

If a lease is 'unassigned' (it has no lease owner) then it should be
considered available for taking in `DynamoDBLeaseTaker`. Prior to this
change, the only ways `DynamoDBLeaseTaker` could take leases for a
scheduler was either by incremental lease stealing, or waiting for the
lease to expire by not having been updated in `failoverTimeMillis` - which
could be slow if `failoverTimeMillis` was set reasonably high (with it set
to just 30s, I've seen new instances take over 3 minutes to take all leases
from old instances in a deployment).

This would be one half of the a fix for
awslabs#845 - the other
half of the fix is invoking `evictLease()` (setting the lease owner to
null) on graceful shutdown of a scheduler.
rtyley added a commit to rtyley/amazon-kinesis-client that referenced this issue Nov 25, 2021
…taken

If a lease is 'unassigned' (it has no lease owner) then it should be
considered available for taking in `DynamoDBLeaseTaker`. Prior to this
change, the only ways `DynamoDBLeaseTaker` could take leases for a
scheduler was either by incremental lease stealing, or waiting for the
lease to expire by not having been updated in `failoverTimeMillis` - which
could be slow if `failoverTimeMillis` was set reasonably high (with it set
to just 30s, I've seen new instances take over 3 minutes to take all leases
from old instances in a deployment).

This would be one half of the a fix for
awslabs#845 - the other
half of the fix is invoking `evictLease()` (setting the lease owner to
null) on graceful shutdown of a scheduler.
@bjrara
Copy link

bjrara commented Jan 18, 2022

Since failoverTimeMillis is set to 30s, all the leases should be treated as expired after that duration if the previous owner fails to renew the expiry time.
In the case described here, old instances stopped at 16:57:40, meaning their leases were available for picking up at 16:58:10. However, according to the graph, we found new instances actually started to work on taking more leases at 16:58:30, for one instance, it happened as late as 16:59:30. There were 50 to 110 seconds time differences.

So I think there're other reasons causing the slow acquisition.

One thing I noticed is how long the lease taker is scheduled to execute. IIUC, the interval of lease taking is by default 2x failoverTimeMillis. In your case, it was 1 minute.

If all the new instances just executed LeaseTaker right before the leases expired (16:58:10) we would have to wait until 16:59:10 to observe the first lease acquisition event.

It may help explain why 2 of the new instances started to take lease around 16:58:30.

However, it seems to me that the new instances didn't think all the leases were expired at 16:58:30, and the leases were not balanced among the 3 workers. This is beyond my expectation.

It would very helpful if you could post more logs in LeaseTaker to help us to understand how the cluster made the decision at that time.

rtyley added a commit to rtyley/amazon-kinesis-client that referenced this issue Jan 18, 2022
…taken

If a lease is 'unassigned' (it has no lease owner) then it should be
considered available for taking in `DynamoDBLeaseTaker`. Prior to this
change, the only ways `DynamoDBLeaseTaker` could take leases for a
scheduler was either by incremental lease stealing, or waiting for the
lease to expire by not having been updated in `failoverTimeMillis` - which
could be slow if `failoverTimeMillis` was set reasonably high (with it set
to just 30s, I've seen new instances take over 3 minutes to take all leases
from old instances in a deployment).

This would be one half of the a fix for
awslabs#845 - the other
half of the fix is invoking `evictLease()` (setting the lease owner to
null) on graceful shutdown of a scheduler.
@rtyley
Copy link
Contributor Author

rtyley commented Jan 18, 2022

Thanks for your response @bjrara ! If you have time, I'd be very grateful if you could review PR #848, as I believe it could be a good solution to the failover problem for the common case that current-lease-holders shutdown gracefully, while retaining the current failover-timeout behaviour for cases where lease-holders do not shutdown gracefully.

It would very helpful if you could post more logs in LeaseTaker to help us to understand how the cluster made the decision at that time.

I don't have the logs from the deploy mentioned in the original description for this issue (where failoverTimeMillis was 30s) but I have grabbed current production logs from a fresh deploy today (where failoverTimeMillis has been reduced to the default value of 10s to ameliorate this issue).

image

  • 10:27:04 - Last of the old workers (i-0b7f310fbf75d6d58) issues last log message of any kind (GracefulShutdownCallable has completed)
  • 10:27:31 - New worker i-07ad2e8efa0b53d5d finally picks up the last of the remaining leases

The issue here is less exaggerated because the failoverTimeMillis is lower, but still present: 27s elapses between the last old EC2 instance being terminated (at 10:27:04) before all shards are being processed by the new EC2 instances (10:27:31). Here the delay is roughly 3x what a failoverTimeMillis of 10s might suggest, though given the expression you point out to calculate takerIntervalMillis, that behaviour would be expected - that expression has been around for 9 years, since KCL v1.0.0, but I haven't seen any description of why it's a good idea?

The thing is, I don't believe there's really any reason why we have to wait for failoverTimeMillis to elapse when an old worker has shutdown gracefully - why can't the old worker just mark the lease as no-longer-owned (ie unset leaseOwner in the Kinesis table) during its graceful shutdown, and newer workers consider leases without owners as available to use, as in #848 ?

@bjrara
Copy link

bjrara commented Jan 18, 2022

@rtyley I agree that lease eviction can definitely help expedite the process of lease transfer. Just to let you know, I'm not a maintainer of this repository, but I'd like to give my vote in your PR if that could help when KCL team evaluate the changes.

@mathieufortin01
Copy link

Hey there. Any updates on your side regarding this situation? We suffer from this in production, and im leaning toward integrating your pr, as well as completing the work described in your original comment, and see from there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants