Losing Lock leaves client in a inconsistent state #4935

rafael · 2018-11-09T14:30:50Z

Description

We ran into an interesting corner case with a lock and we were wondering if we are using consul golang lock API in a correct way or there is a bug in the consul client API.

The steps to reproduce are as follows:

Acquire a lock using golang API
Get a small outage in the consul cluster that is below the renew session threshold.
The client would fail monitorLock but the lock is not actually lost.
All the processes that are contending for this lock get stuck because the go routine that is renewing the session keeps running.

Is this a bug or are we using locks API incorrectly?

Side note question

Another kind of related issue we ran into it is that once the application loses the lock during this condition, we were trying to reuse the same lock instance to acquire again, but you get an error because of ErrLockHeld. Is it the preferred way to create a new lock instance every time you lose it? Here a PR where you can get more context: Do not reuse a consul lock vitessio/vitess#4353

banks · 2018-11-09T14:36:35Z

Hi thanks for this.

It sounds like a bug to me. I'll consider this proposal to change.

Is it the preferred way to create a new lock instance every time you lose it?

Yes this has come up before and it seems like the expectation (not documented sadly) is that a lock instance is abandoned and a new one created when lock is lost.

I want to confirm with the original authors of the Lock API what the intent was as there are always subtle issues around these cases.

banks · 2018-11-09T14:54:57Z

I actually think that both issues you described here are symptoms of the same thing: the code reads as though it assumes that l.IsHeld is false when the lock is lost.

For example, this PR would not be needed because the existing defferred cleanup here:

consul/api/lock.go

Lines 156 to 162 in e4f93aa

    
           // If we fail to acquire the lock, cleanup the session 
        
           defer func() { 
        
           	if !l.isHeld { 
        
           		close(l.sessionRenew) 
        
           		l.sessionRenew = nil 
        
           	} 
        
           }()

Would work assuming that we actually reset isHeld when our monitoring failed.

The same change would also make it possible to re-use a Lock object to re-attempt to lock it as expected.

The only potential downside of that change I can see is that, whether it was the intended usage or not, currently you can do something like this:

func doAThingWithLock() {
  l, _ := api.LockOpts(nil)
  lostCh := l.Lock()
  // Idiomatic lock usage in Golang
  defer l.Unlock()

  stopCh := make(chan struct{})
  go doThing(stopCh)

  for {
    select {
    case <- lostCh:
      return
    case <- stopCh:
       return
    }
  }
}

This usage works now because the Unlock in defer will always see isHeld = true even if we aborted due to "loosing the lock". The docs for Unlock though say:

// Unlock released the lock. It is an error to call this
// if the lock is not currently held.

Which means that if we actually set isHeld false to allow re-entrance when monitoring the lock fails, then You'd end up unlocking a lock that is not held and cause a error.

Now in this specific example, we are swallowing the error from Unlock so it's kinda moot. But it makes the API assumptions around this case a bit confusing.

I think the root of the problem here is that when the monitor goroutine gets connection failure, we actually don't know if the lock is still held or not. That means the only safe option is to assume it's not, but if for example the lock session was created with no serf health check, then just because we disconnected from server doesn't mean the lock was actually released so setting isHeld to false is arguably wrong: it's more like unknown.

I'll think some more about this because the API is not clear and apparently buggy in this case currently.

rafael · 2018-11-09T15:08:22Z

Thank you for the quick reply!

The same change would also make it possible to re-use a Lock object to re-attempt to lock it as expected.

At first I went down this path, but then there could be real problems if you actually lose the Lock and try to re-use the same instance, consul server will return an error. So I was thinking that the intent here was not to change isHeld.

Agree 100%, clarification here would be great as is not clear.

rafael · 2018-11-16T09:58:42Z

Hi @banks,

Did you have some chance to think about this?

After an extra pass to the code and re-reading your comments, it seems to me that the change in this PR should work.

If we want to keep what the current semantics and keep it safe, in this case of unknown state, stopping the renewing of the session seems like a sensible approach.

banks · 2018-11-16T10:59:15Z

Sadly not yet 😞

I don't think you are wrong here but not yet found time to set aside to think through the possible failure cases and/or API behaviour changes that might break existing clients from changing this.

rafael · 2018-12-05T23:40:54Z

Hi @banks! Do you have an ETA on when you might be able to look into this issue?

banks · 2018-12-07T14:56:39Z

@rafael we actually finally got a few minutes of Armon's time to discuss what the original intent here was yesterday.

The tl;dr is that there is a (not clear in docs) assumption that any time you are notified that you may have lost the lock by the monitor channel being closed, it's expected that you exit what you are doing and call Unlock(). The idea is that just like sync.Mutex it's always a bug if your code doesn't Unlock() after Lock(). In that case, calling Unlock() would kill the session for you so nothing would hang.

So the canonical correct usage would be the one I showed before where the defered Unlock will clean up the session.

That said, this has confused enough people that I see these options:

Leave the code alone but document MUCH better on Lock and Unlock what those expectations are and maybe add an example correct usage as a doc example
Given that it's not intuitive to many that you can be told you've "(probably) lost the lock" on the channel, but still have to call Unlock. We could just change it so that the monitor goroutine actually calls Unlock in this case usages like I showed would silently error on the defered unlock but that's not a big deal - we'd need to document that Unlock might error though since it would be racy even if you didn't observe the monitor lost.

Which makes more sense to you? The unlock error race thing is the only thing that makes 2 feel gross to me.

This PR does half of 2 but I think we should be clearer one way or the other.

banks · 2018-12-07T16:32:42Z

Actually, after reflecting on this some more I think actually the current behaviour is most correct. It just needs to be documented better.

The reason is that then explicit Unlock() acts an acknowledgement that the process knows it is no longer the holder currently.

This PR (or option 2 above) would break that - if we notice connection is gone and close the lockLost chan and immediately shutdown the session, then some other instance might become leader even though this process hasn't yet noticed yet (say it was doing some heavy CPU work).

By making the Unlock explicit even when lockLost chan is closed, it lets the holder "acknowledge" that it is not longer trying to be leader and will not perform any more leader duties until it can acquire again which makes it safer to only after that allow other peers to potentially grab the lock.

So I think the right answer is to not change the code, but to make the docs on Lock, Unlock much clearer and add some examples.

Does that sounds reasonable?

rafael · 2018-12-07T18:59:19Z

@banks thanks for checking Armon about this. This explanation makes sense. I think documentation addresses this problem. If we see it from this perspective it is really clear:

The idea is that just like sync.Mutex it's always a bug if your code doesn't Unlock() after Lock(). 
In that case, calling Unlock() would kill the session for you so nothing would hang.

Closing this.

freddygv · 2019-04-18T19:55:53Z

Thanks for your original report @rafael.

Re-closing this PR. I made a new issue for the documentation update: #5686

Session should stop renewing after losing the lock

e4f93aa

banks added this to the Upcoming milestone Nov 9, 2018

banks added the needs-discussion Topic needs discussion with the larger Consul maintainers before committing to for a release label Nov 9, 2018

banks mentioned this pull request Nov 9, 2018

Locks behavior after falling consul cluster #4902

Closed

banks added type/docs Documentation needs to be created/updated/clarified and removed needs-discussion Topic needs discussion with the larger Consul maintainers before committing to for a release labels Dec 7, 2018

rafael closed this Dec 7, 2018

rafael reopened this Dec 7, 2018

freddygv mentioned this pull request Apr 18, 2019

[docs] Update api godoc for Lock() and Unlock() #5686

Open

freddygv closed this Apr 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Losing Lock leaves client in a inconsistent state #4935

Losing Lock leaves client in a inconsistent state #4935

rafael commented Nov 9, 2018

banks commented Nov 9, 2018

banks commented Nov 9, 2018

rafael commented Nov 9, 2018

rafael commented Nov 16, 2018

banks commented Nov 16, 2018

rafael commented Dec 5, 2018

banks commented Dec 7, 2018

banks commented Dec 7, 2018

rafael commented Dec 7, 2018 •

edited

Loading

freddygv commented Apr 18, 2019

Losing Lock leaves client in a inconsistent state #4935

Losing Lock leaves client in a inconsistent state #4935

Conversation

rafael commented Nov 9, 2018

Description

Side note question

banks commented Nov 9, 2018

banks commented Nov 9, 2018

rafael commented Nov 9, 2018

rafael commented Nov 16, 2018

banks commented Nov 16, 2018

rafael commented Dec 5, 2018

banks commented Dec 7, 2018

banks commented Dec 7, 2018

rafael commented Dec 7, 2018 • edited Loading

freddygv commented Apr 18, 2019

rafael commented Dec 7, 2018 •

edited

Loading