-
Notifications
You must be signed in to change notification settings - Fork 467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster acquisition of leases from workers that have been gracefully shutdown? #845
Comments
…taken If a lease is 'unassigned' (it has no lease owner) then it should be considered available for taking in `DynamoDBLeaseTaker`. Prior to this change, the only ways `DynamoDBLeaseTaker` could take leases for a scheduler was either by incremental lease stealing, or waiting for the lease to expire by not having been updated in `failoverTimeMillis` - which could be slow if `failoverTimeMillis` was set reasonably high (with it set to just 30s, I've seen new instances take over 3 minutes to take all leases from old instances in a deployment). This would be one half of the a fix for awslabs#845 - the other half of the fix is invoking `evictLease()` (setting the lease owner to null) on graceful shutdown of a scheduler.
…taken If a lease is 'unassigned' (it has no lease owner) then it should be considered available for taking in `DynamoDBLeaseTaker`. Prior to this change, the only ways `DynamoDBLeaseTaker` could take leases for a scheduler was either by incremental lease stealing, or waiting for the lease to expire by not having been updated in `failoverTimeMillis` - which could be slow if `failoverTimeMillis` was set reasonably high (with it set to just 30s, I've seen new instances take over 3 minutes to take all leases from old instances in a deployment). This would be one half of the a fix for awslabs#845 - the other half of the fix is invoking `evictLease()` (setting the lease owner to null) on graceful shutdown of a scheduler.
…taken If a lease is 'unassigned' (it has no lease owner) then it should be considered available for taking in `DynamoDBLeaseTaker`. Prior to this change, the only ways `DynamoDBLeaseTaker` could take leases for a scheduler was either by incremental lease stealing, or waiting for the lease to expire by not having been updated in `failoverTimeMillis` - which could be slow if `failoverTimeMillis` was set reasonably high (with it set to just 30s, I've seen new instances take over 3 minutes to take all leases from old instances in a deployment). This would be one half of the a fix for awslabs#845 - the other half of the fix is invoking `evictLease()` (setting the lease owner to null) on graceful shutdown of a scheduler.
…taken If a lease is 'unassigned' (it has no lease owner) then it should be considered available for taking in `DynamoDBLeaseTaker`. Prior to this change, the only ways `DynamoDBLeaseTaker` could take leases for a scheduler was either by incremental lease stealing, or waiting for the lease to expire by not having been updated in `failoverTimeMillis` - which could be slow if `failoverTimeMillis` was set reasonably high (with it set to just 30s, I've seen new instances take over 3 minutes to take all leases from old instances in a deployment). This would be one half of the a fix for awslabs#845 - the other half of the fix is invoking `evictLease()` (setting the lease owner to null) on graceful shutdown of a scheduler.
…taken If a lease is 'unassigned' (it has no lease owner) then it should be considered available for taking in `DynamoDBLeaseTaker`. Prior to this change, the only ways `DynamoDBLeaseTaker` could take leases for a scheduler was either by incremental lease stealing, or waiting for the lease to expire by not having been updated in `failoverTimeMillis` - which could be slow if `failoverTimeMillis` was set reasonably high (with it set to just 30s, I've seen new instances take over 3 minutes to take all leases from old instances in a deployment). This would be one half of the a fix for awslabs#845 - the other half of the fix is invoking `evictLease()` (setting the lease owner to null) on graceful shutdown of a scheduler.
…taken If a lease is 'unassigned' (it has no lease owner) then it should be considered available for taking in `DynamoDBLeaseTaker`. Prior to this change, the only ways `DynamoDBLeaseTaker` could take leases for a scheduler was either by incremental lease stealing, or waiting for the lease to expire by not having been updated in `failoverTimeMillis` - which could be slow if `failoverTimeMillis` was set reasonably high (with it set to just 30s, I've seen new instances take over 3 minutes to take all leases from old instances in a deployment). This would be one half of the a fix for awslabs#845 - the other half of the fix is invoking `evictLease()` (setting the lease owner to null) on graceful shutdown of a scheduler.
…taken If a lease is 'unassigned' (it has no lease owner) then it should be considered available for taking in `DynamoDBLeaseTaker`. Prior to this change, the only ways `DynamoDBLeaseTaker` could take leases for a scheduler was either by incremental lease stealing, or waiting for the lease to expire by not having been updated in `failoverTimeMillis` - which could be slow if `failoverTimeMillis` was set reasonably high (with it set to just 30s, I've seen new instances take over 3 minutes to take all leases from old instances in a deployment). This would be one half of the a fix for awslabs#845 - the other half of the fix is invoking `evictLease()` (setting the lease owner to null) on graceful shutdown of a scheduler.
…taken If a lease is 'unassigned' (it has no lease owner) then it should be considered available for taking in `DynamoDBLeaseTaker`. Prior to this change, the only ways `DynamoDBLeaseTaker` could take leases for a scheduler was either by incremental lease stealing, or waiting for the lease to expire by not having been updated in `failoverTimeMillis` - which could be slow if `failoverTimeMillis` was set reasonably high (with it set to just 30s, I've seen new instances take over 3 minutes to take all leases from old instances in a deployment). This would be one half of the a fix for awslabs#845 - the other half of the fix is invoking `evictLease()` (setting the lease owner to null) on graceful shutdown of a scheduler.
Since So I think there're other reasons causing the slow acquisition. One thing I noticed is how long the lease taker is scheduled to execute. IIUC, the interval of lease taking is by default 2x Line 161 in a3e51d5
If all the new instances just executed LeaseTaker right before the leases expired (16:58:10) we would have to wait until 16:59:10 to observe the first lease acquisition event. It may help explain why 2 of the new instances started to take lease around 16:58:30. However, it seems to me that the new instances didn't think all the leases were expired at 16:58:30, and the leases were not balanced among the 3 workers. This is beyond my expectation. It would very helpful if you could post more logs in LeaseTaker to help us to understand how the cluster made the decision at that time. |
…taken If a lease is 'unassigned' (it has no lease owner) then it should be considered available for taking in `DynamoDBLeaseTaker`. Prior to this change, the only ways `DynamoDBLeaseTaker` could take leases for a scheduler was either by incremental lease stealing, or waiting for the lease to expire by not having been updated in `failoverTimeMillis` - which could be slow if `failoverTimeMillis` was set reasonably high (with it set to just 30s, I've seen new instances take over 3 minutes to take all leases from old instances in a deployment). This would be one half of the a fix for awslabs#845 - the other half of the fix is invoking `evictLease()` (setting the lease owner to null) on graceful shutdown of a scheduler.
Thanks for your response @bjrara ! If you have time, I'd be very grateful if you could review PR #848, as I believe it could be a good solution to the failover problem for the common case that current-lease-holders shutdown gracefully, while retaining the current failover-timeout behaviour for cases where lease-holders do not shutdown gracefully.
I don't have the logs from the deploy mentioned in the original description for this issue (where
The issue here is less exaggerated because the The thing is, I don't believe there's really any reason why we have to wait for |
@rtyley I agree that lease eviction can definitely help expedite the process of lease transfer. Just to let you know, I'm not a maintainer of this repository, but I'd like to give my vote in your PR if that could help when KCL team evaluate the changes. |
Hey there. Any updates on your side regarding this situation? We suffer from this in production, and im leaning toward integrating your pr, as well as completing the work described in your original comment, and see from there. |
New instances seem to take a relatively long time to acquire leases after old instances have stopped renewing them (seen in KCL 2.3.6, probably other versions too) - apart from incremental lease stealing, the new instances seem to have to wait for the full
failoverTimeMillis
(which is referred to asleaseDurationMillis
withinDynamoDBLeaseCoordinator
&leaseDurationNanos
withinDynamoDBLeaseTaker
) before a lease is considered expired and they can take it. WithfailoverTimeMillis
set to just 30s, I've seen new instances take ~3 minutes to take all leases from old instances in a deployment (old instances were fully terminated by 16:57:40 in the graph below, but the new record processors weren't all initialised until 17:00:35).Although failover timeout is obviously good for handling
ShardRecordProcessor
s that become unresponsive, if an instance is smoothly shutting down (egscheduler.startGracefulShutdown()
has been called), couldn't it be clearly indicated that the old scheduler is no longer responsible for the lease by invokingevictLease()
(setting the lease owner tonull
) on the leases it still holds during graceful shutdown? This would be aftershutdownRequested()
is called on theShardRecordProcessor
& beforeleaseLost()
. It could possibly be ins.a.k.lifecycle.ShutdownTask.call()
?Having done this,
DynamoDBLeaseTaker.takeLeases()
could be more acquisitive, and as well as taking expired leases, could take unassigned ones too (see #848) - so in the case of a graceful shutdown of old processors, the handover could be much quicker than waitingfailoverTimeMillis
.Does this make sense?! Or could it be plagued by all sorts of race-conditions or complexity that you're probably very careful to avoid?! Just from an education point of view, I'd be interested to learn what problems the approach has.
The text was updated successfully, but these errors were encountered: