-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Losing Lock leaves client in a inconsistent state #4935
Conversation
Hi thanks for this. It sounds like a bug to me. I'll consider this proposal to change.
Yes this has come up before and it seems like the expectation (not documented sadly) is that a lock instance is abandoned and a new one created when lock is lost. I want to confirm with the original authors of the Lock API what the intent was as there are always subtle issues around these cases. |
I actually think that both issues you described here are symptoms of the same thing: the code reads as though it assumes that For example, this PR would not be needed because the existing defferred cleanup here: Lines 156 to 162 in e4f93aa
Would work assuming that we actually reset The same change would also make it possible to re-use a The only potential downside of that change I can see is that, whether it was the intended usage or not, currently you can do something like this: func doAThingWithLock() {
l, _ := api.LockOpts(nil)
lostCh := l.Lock()
// Idiomatic lock usage in Golang
defer l.Unlock()
stopCh := make(chan struct{})
go doThing(stopCh)
for {
select {
case <- lostCh:
return
case <- stopCh:
return
}
}
} This usage works now because the
Which means that if we actually set Now in this specific example, we are swallowing the error from I think the root of the problem here is that when the monitor goroutine gets connection failure, we actually don't know if the lock is still held or not. That means the only safe option is to assume it's not, but if for example the lock session was created with no serf health check, then just because we disconnected from server doesn't mean the lock was actually released so setting I'll think some more about this because the API is not clear and apparently buggy in this case currently. |
Thank you for the quick reply!
At first I went down this path, but then there could be real problems if you actually lose the Lock and try to re-use the same instance, consul server will return an error. So I was thinking that the intent here was not to change isHeld. Agree 100%, clarification here would be great as is not clear. |
Hi @banks, Did you have some chance to think about this? After an extra pass to the code and re-reading your comments, it seems to me that the change in this PR should work. If we want to keep what the current semantics and keep it safe, in this case of |
Sadly not yet 😞 I don't think you are wrong here but not yet found time to set aside to think through the possible failure cases and/or API behaviour changes that might break existing clients from changing this. |
Hi @banks! Do you have an ETA on when you might be able to look into this issue? |
@rafael we actually finally got a few minutes of Armon's time to discuss what the original intent here was yesterday. The tl;dr is that there is a (not clear in docs) assumption that any time you are notified that you may have lost the lock by the monitor channel being closed, it's expected that you exit what you are doing and call So the canonical correct usage would be the one I showed before where the defered Unlock will clean up the session. That said, this has confused enough people that I see these options:
Which makes more sense to you? The unlock error race thing is the only thing that makes 2 feel gross to me. This PR does half of 2 but I think we should be clearer one way or the other. |
Actually, after reflecting on this some more I think actually the current behaviour is most correct. It just needs to be documented better. The reason is that then explicit This PR (or option 2 above) would break that - if we notice connection is gone and close the lockLost chan and immediately shutdown the session, then some other instance might become leader even though this process hasn't yet noticed yet (say it was doing some heavy CPU work). By making the So I think the right answer is to not change the code, but to make the docs on Lock, Unlock much clearer and add some examples. Does that sounds reasonable? |
@banks thanks for checking Armon about this. This explanation makes sense. I think documentation addresses this problem. If we see it from this perspective it is really clear:
Closing this. |
Description
We ran into an interesting corner case with a lock and we were wondering if we are using consul golang lock API in a correct way or there is a bug in the consul client API.
The steps to reproduce are as follows:
monitorLock
but the lock is not actually lost.Is this a bug or are we using locks API incorrectly?
Side note question
ErrLockHeld
. Is it the preferred way to create a new lock instance every time you lose it? Here a PR where you can get more context: Do not reuse a consul lock vitessio/vitess#4353