Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lock command performs very differently with one/multiple max lock holders #1006

Closed
justincampbell opened this issue Jun 5, 2015 · 4 comments

Comments

@justincampbell
Copy link

When running multiple consul lock commands at the same time with a command that completes almost instantly, there can be up to a 5 second delay until the next client obtains the lock (except the 2nd one, which appears to always happen immediately after the first). It takes ~30 seconds to run date 8 times.

consul lock test date

Fri Jun  5 11:04:15 EDT 2015
Fri Jun  5 11:04:15 EDT 2015
Fri Jun  5 11:04:20 EDT 2015
Fri Jun  5 11:04:25 EDT 2015
Fri Jun  5 11:04:30 EDT 2015
Fri Jun  5 11:04:35 EDT 2015
Fri Jun  5 11:04:40 EDT 2015
Fri Jun  5 11:04:45 EDT 2015

With a 1 second sleep added, it takes ~17 seconds to run the command 8 times:

consul lock test 'sleep 1 && date'

Fri Jun  5 11:02:53 EDT 2015
Fri Jun  5 11:02:54 EDT 2015
Fri Jun  5 11:02:58 EDT 2015
Fri Jun  5 11:02:59 EDT 2015
Fri Jun  5 11:03:03 EDT 2015
Fri Jun  5 11:03:04 EDT 2015
Fri Jun  5 11:03:05 EDT 2015
Fri Jun  5 11:03:10 EDT 2015

If the command take ~5 seconds instead of finishing instantly, the lock appears to transfer instantly as expected.

consul lock test 'sleep 5 && date'

Fri Jun  5 11:09:38 EDT 2015
Fri Jun  5 11:09:43 EDT 2015
Fri Jun  5 11:09:48 EDT 2015
Fri Jun  5 11:09:53 EDT 2015
Fri Jun  5 11:09:58 EDT 2015
Fri Jun  5 11:10:03 EDT 2015
Fri Jun  5 11:10:08 EDT 2015
Fri Jun  5 11:10:13 EDT 2015

However, when running with -n 2, the next client obtains the lock as soon as another command finishes, with no delays.

consul lock -n 2 test date

Fri Jun  5 11:04:58 EDT 2015
Fri Jun  5 11:04:58 EDT 2015
Fri Jun  5 11:04:58 EDT 2015
Fri Jun  5 11:04:58 EDT 2015
Fri Jun  5 11:04:58 EDT 2015
Fri Jun  5 11:04:58 EDT 2015
Fri Jun  5 11:04:58 EDT 2015
Fri Jun  5 11:04:58 EDT 2015

I'm assuming it's related to the algorithm differences mentioned in https://www.consul.io/docs/commands/lock.html:

The number of lock holders is configurable with the -n flag. By default, a single holder is allowed, and a lock is used for mutual exclusion. This uses the leader election algorithm.

If the lock holder count is more than one, then a semaphore is used instead. A semaphore allows more than a single holder, but the is less efficient than a simple lock. This follows the semaphore algorithm.

Is this behavior intentional/expected?

@justincampbell
Copy link
Author

@highlyunavailable
Copy link
Contributor

This is actually expected - a semaphore uses the KV store's long polling to determine when it needs to try to acquire the lock after release, but the lock code uses a timer.

By default, each contender for a lock will wait 5 seconds after attempting to acquire a lock and failing to retry due to DefaultLockRetryTime in lock.go.

Now that I read through the code again, the comment up top says it all: DefaultLockRetryTime is how long we wait after a failed lock acquisition before attempting to do the lock again. This is so that once a lock-delay is in affect, we do not hot loop retrying the acquisition.

The lock-delay doesn't apply to a semaphore's keys since in the implementation they get deleted but a lock-delay DOES apply to a normal lock, and coming out of lock-delay won't increment the index from what I can tell so a long poll can't be used.

However, no effort is made to distinguish between a lock acquisition failure due to a session already locking the key, and a lock delay (both return a false value from kv.Acquire()) so in every case it just waits 5 seconds.

@justincampbell
Copy link
Author

Thanks for the info! My concern is that it doesn't seem possible to have "one and only one" of something running without up to 5 seconds of downtime if the command fails.

@armon
Copy link
Member

armon commented Jun 9, 2015

So @highlyunavailable hit this exactly dead on. The semaphore implementation does not have the lock-delay while the lock does (by design). The simplest approach was to just wait 5 seconds but I agree this is a bit heavy handed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants