Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Losing Lock leaves client in a inconsistent state #4935

Closed

Conversation

rafael
Copy link

@rafael rafael commented Nov 9, 2018

Description

We ran into an interesting corner case with a lock and we were wondering if we are using consul golang lock API in a correct way or there is a bug in the consul client API.

The steps to reproduce are as follows:

  1. Acquire a lock using golang API
  2. Get a small outage in the consul cluster that is below the renew session threshold.
  3. The client would fail monitorLock but the lock is not actually lost.
  4. All the processes that are contending for this lock get stuck because the go routine that is renewing the session keeps running.

Is this a bug or are we using locks API incorrectly?

Side note question

  • Another kind of related issue we ran into it is that once the application loses the lock during this condition, we were trying to reuse the same lock instance to acquire again, but you get an error because of ErrLockHeld. Is it the preferred way to create a new lock instance every time you lose it? Here a PR where you can get more context: Do not reuse a consul lock vitessio/vitess#4353

@banks
Copy link
Member

banks commented Nov 9, 2018

Hi thanks for this.

It sounds like a bug to me. I'll consider this proposal to change.

Is it the preferred way to create a new lock instance every time you lose it?

Yes this has come up before and it seems like the expectation (not documented sadly) is that a lock instance is abandoned and a new one created when lock is lost.

I want to confirm with the original authors of the Lock API what the intent was as there are always subtle issues around these cases.

@banks
Copy link
Member

banks commented Nov 9, 2018

I actually think that both issues you described here are symptoms of the same thing: the code reads as though it assumes that l.IsHeld is false when the lock is lost.

For example, this PR would not be needed because the existing defferred cleanup here:

consul/api/lock.go

Lines 156 to 162 in e4f93aa

// If we fail to acquire the lock, cleanup the session
defer func() {
if !l.isHeld {
close(l.sessionRenew)
l.sessionRenew = nil
}
}()

Would work assuming that we actually reset isHeld when our monitoring failed.

The same change would also make it possible to re-use a Lock object to re-attempt to lock it as expected.

The only potential downside of that change I can see is that, whether it was the intended usage or not, currently you can do something like this:

func doAThingWithLock() {
  l, _ := api.LockOpts(nil)
  lostCh := l.Lock()
  // Idiomatic lock usage in Golang
  defer l.Unlock()

  stopCh := make(chan struct{})
  go doThing(stopCh)

  for {
    select {
    case <- lostCh:
      return
    case <- stopCh:
       return
    }
  }
}

This usage works now because the Unlock in defer will always see isHeld = true even if we aborted due to "loosing the lock". The docs for Unlock though say:

// Unlock released the lock. It is an error to call this
// if the lock is not currently held.

Which means that if we actually set isHeld false to allow re-entrance when monitoring the lock fails, then You'd end up unlocking a lock that is not held and cause a error.

Now in this specific example, we are swallowing the error from Unlock so it's kinda moot. But it makes the API assumptions around this case a bit confusing.

I think the root of the problem here is that when the monitor goroutine gets connection failure, we actually don't know if the lock is still held or not. That means the only safe option is to assume it's not, but if for example the lock session was created with no serf health check, then just because we disconnected from server doesn't mean the lock was actually released so setting isHeld to false is arguably wrong: it's more like unknown.

I'll think some more about this because the API is not clear and apparently buggy in this case currently.

@banks banks added this to the Upcoming milestone Nov 9, 2018
@banks banks added the needs-discussion Topic needs discussion with the larger Consul maintainers before committing to for a release label Nov 9, 2018
@rafael
Copy link
Author

rafael commented Nov 9, 2018

Thank you for the quick reply!

The same change would also make it possible to re-use a Lock object to re-attempt to lock it as expected.

At first I went down this path, but then there could be real problems if you actually lose the Lock and try to re-use the same instance, consul server will return an error. So I was thinking that the intent here was not to change isHeld.

Agree 100%, clarification here would be great as is not clear.

@rafael
Copy link
Author

rafael commented Nov 16, 2018

Hi @banks,

Did you have some chance to think about this?

After an extra pass to the code and re-reading your comments, it seems to me that the change in this PR should work.

If we want to keep what the current semantics and keep it safe, in this case of unknown state, stopping the renewing of the session seems like a sensible approach.

@banks
Copy link
Member

banks commented Nov 16, 2018

Sadly not yet 😞

I don't think you are wrong here but not yet found time to set aside to think through the possible failure cases and/or API behaviour changes that might break existing clients from changing this.

@rafael
Copy link
Author

rafael commented Dec 5, 2018

Hi @banks! Do you have an ETA on when you might be able to look into this issue?

@banks
Copy link
Member

banks commented Dec 7, 2018

@rafael we actually finally got a few minutes of Armon's time to discuss what the original intent here was yesterday.

The tl;dr is that there is a (not clear in docs) assumption that any time you are notified that you may have lost the lock by the monitor channel being closed, it's expected that you exit what you are doing and call Unlock(). The idea is that just like sync.Mutex it's always a bug if your code doesn't Unlock() after Lock(). In that case, calling Unlock() would kill the session for you so nothing would hang.

So the canonical correct usage would be the one I showed before where the defered Unlock will clean up the session.

That said, this has confused enough people that I see these options:

  1. Leave the code alone but document MUCH better on Lock and Unlock what those expectations are and maybe add an example correct usage as a doc example

  2. Given that it's not intuitive to many that you can be told you've "(probably) lost the lock" on the channel, but still have to call Unlock. We could just change it so that the monitor goroutine actually calls Unlock in this case usages like I showed would silently error on the defered unlock but that's not a big deal - we'd need to document that Unlock might error though since it would be racy even if you didn't observe the monitor lost.

Which makes more sense to you? The unlock error race thing is the only thing that makes 2 feel gross to me.

This PR does half of 2 but I think we should be clearer one way or the other.

@banks
Copy link
Member

banks commented Dec 7, 2018

Actually, after reflecting on this some more I think actually the current behaviour is most correct. It just needs to be documented better.

The reason is that then explicit Unlock() acts an acknowledgement that the process knows it is no longer the holder currently.

This PR (or option 2 above) would break that - if we notice connection is gone and close the lockLost chan and immediately shutdown the session, then some other instance might become leader even though this process hasn't yet noticed yet (say it was doing some heavy CPU work).

By making the Unlock explicit even when lockLost chan is closed, it lets the holder "acknowledge" that it is not longer trying to be leader and will not perform any more leader duties until it can acquire again which makes it safer to only after that allow other peers to potentially grab the lock.

So I think the right answer is to not change the code, but to make the docs on Lock, Unlock much clearer and add some examples.

Does that sounds reasonable?

@banks banks added type/docs Documentation needs to be created/updated/clarified and removed needs-discussion Topic needs discussion with the larger Consul maintainers before committing to for a release labels Dec 7, 2018
@rafael
Copy link
Author

rafael commented Dec 7, 2018

@banks thanks for checking Armon about this. This explanation makes sense. I think documentation addresses this problem. If we see it from this perspective it is really clear:

The idea is that just like sync.Mutex it's always a bug if your code doesn't Unlock() after Lock(). 
In that case, calling Unlock() would kill the session for you so nothing would hang.

Closing this.

@freddygv
Copy link
Contributor

Thanks for your original report @rafael.

Re-closing this PR. I made a new issue for the documentation update: #5686

@freddygv freddygv closed this Apr 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/docs Documentation needs to be created/updated/clarified
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants