Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Treat unassigned leases (as well as expired ones) as available-to-be-taken #848

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

rtyley
Copy link
Contributor

@rtyley rtyley commented Aug 8, 2021

If a lease is 'unassigned' (it has no lease owner) then it should be considered available for taking in DynamoDBLeaseTaker. Prior to this change, the only ways DynamoDBLeaseTaker could take leases for a scheduler was either by incremental lease stealing, or waiting for the lease to expire by not having been updated in failoverTimeMillis - which could be slow if failoverTimeMillis was set reasonably high (with it set to just 30s, I've seen new instances take over 3 minutes to take all leases from old instances in a deployment).

This would be one half of a fix for #845 - the other half is invoking evictLease() (setting the lease owner to null) on graceful shutdown of a scheduler.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@rtyley rtyley force-pushed the treat-unassigned-leases-as-available-to-be-taken branch 2 times, most recently from 613a96f to d0a2b95 Compare October 4, 2021 11:21
@rtyley rtyley force-pushed the treat-unassigned-leases-as-available-to-be-taken branch from d0a2b95 to db34266 Compare October 4, 2021 13:41
@rtyley rtyley force-pushed the treat-unassigned-leases-as-available-to-be-taken branch 4 times, most recently from 99822ea to 5cdc9b5 Compare November 25, 2021 23:00
@rtyley
Copy link
Contributor Author

rtyley commented Nov 26, 2021

I've rebased this PR on top of the latest master commit, resolving the merge conflict with #861, the PR that reverted #847 - and also fixed the embarrassing problem with null leaseOwners in the Java 8 code that I introduced with that PR (sincere apologies for that mistake).

The tests that @shanmsac introduced with #861 still pass with this updated PR, though I have tweaked one assertion to cope with the fact that the updated computeActiveLeaseCountsByWorker() method has changed with this PR to not return an entry for unassigned leases (so that a good active-worker count is simply given by the size of the map).

I hope this PR is still worth looking at, though obviously, having already managed to break things once with #847, I can understand why you might not trust this subsequent PR!

Comment on lines +546 to +585
.filter(lease -> !lease.isUnassigned() && !availableLeasesSet.contains(lease))
.collect(groupingBy(Lease::leaseOwner, summingInt(lease -> 1)));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here filtering for !lease.isUnassigned() is what now stops the groupingBy Lease::leaseOwner from failing with the NullPointerException: element cannot be mapped to a null key error reported in #861 - given that unassigned leases are filtered out, there is no attempt to insert a null key entry in the map, so the code doesn't crash.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still want to keep this change given that #861 has been merged?

…taken

If a lease is 'unassigned' (it has no lease owner) then it should be
considered available for taking in `DynamoDBLeaseTaker`. Prior to this
change, the only ways `DynamoDBLeaseTaker` could take leases for a
scheduler was either by incremental lease stealing, or waiting for the
lease to expire by not having been updated in `failoverTimeMillis` - which
could be slow if `failoverTimeMillis` was set reasonably high (with it set
to just 30s, I've seen new instances take over 3 minutes to take all leases
from old instances in a deployment).

This would be one half of the a fix for
awslabs#845 - the other
half of the fix is invoking `evictLease()` (setting the lease owner to
null) on graceful shutdown of a scheduler.
@rtyley rtyley force-pushed the treat-unassigned-leases-as-available-to-be-taken branch from 5cdc9b5 to f9d5077 Compare January 18, 2022 15:37
Copy link

@bjrara bjrara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Evicting leases during graceful shutdown would break the assumption here:

if (lease.leaseOwner() == null) {
// if this new lease is unowned, it's never been renewed.
lease.lastCounterIncrementNanos(0L);

I'm also wondering if we can make use of resetting lastCounterIncrementNanos so that we don't need extra checks on leaseOwner?

Comment on lines +546 to +585
.filter(lease -> !lease.isUnassigned() && !availableLeasesSet.contains(lease))
.collect(groupingBy(Lease::leaseOwner, summingInt(lease -> 1)));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still want to keep this change given that #861 has been merged?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
v2.x Issues related to the 2.x version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants