-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
defer setting the state before returning to avoid stuck in INITIALIZING
state
#10630
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code being changed here was introduced in d9d4c49, and looking at the changelog entries for that commit it seems to suggest it was attempting to fix this very bug.
Maybe the user experiencing this problem was not running v1.10 yet (where this fix was first introduced) ?
I think this PR may fix the issue, but I'm not confident I understand all the repercussions of this change. It's possible we are reverting a fix for some other bug.
I think it might be valuable to create a state diagram. I suspect setState
could be changed to better model valid states which would help make the correct behaviour easier to reason about.
Thank you for looking at this @dnephin My understanding is we should never leave
|
[edit] posted this before I saw @dhiaayachi's response which perfectly sums it up... @dnephin yeah I think it was the fixes for those other issues that introduced this. The user was running 1.9.7 which contains the same lines consul/agent/consul/leader_connect_ca.go Lines 310 to 318 in 01555fb
I think this specific issue was not the same as any of the ones Matt found and was trying to fix there although it's a similar class of problem. In general, I think his fixes made this code much more robust and worked well, I think this one issue of returning before the defer has introduced this new broken state since we are breaking the invariant that we always exit |
I added some thoughts and a diagram in #10634. I forgot that the commit would not show if the cherry-pick was needed for a backport, good call looking at the actual release branch. I think on the surface this change does fix the problem, but is it the right fix? Looking at the model I put into #10634, it seems like this is one of the "INVALID" state transitions. Maybe |
I agree, we need to understand more what got us to call |
Hmm yeah, I think cleaning up the state machine is a great idea. One thing I'd say is that even if
The thing that causes it is the background retrying of Initialize due to a misconfiguration which will only be resolved by the operator reconfiguring the CA. If we fix this by changing the state machine rules to be more restrictive then we will need to be very careful that the background Initializer is modified too. Right now if you only change I'd still also be fine with merging the PR as it is and continuing state machine cleanup as a separate task though as I think the defer is meant to make it easier to reason about the exit state regardless of code path and this ordering violates that unnecessarily but see what you think @dhiaayachi. |
Why would we allow an invalid state transition, just to check the state, then transition back? That's really unnecessarily confusing. The way this should work (without having to do all the cleanup proposed by #10634) is that err := c.setState(caStateInitializing, true)
var errState stateError
switch {
case errors.As(err, &errState) && errState.Current == caStateInitialized:
return nil
case err != nil:
return err
}
defer func() { ... } This way we never perform unnecessary and invalid state transitions. And the logic is in the right places. The state transition logic is in a single place ( |
@banks @dnephin
I propose to:
|
That plan sounds good to me Dhia.
I think to be clear the disable validation flag is intended to be used in
places like Defer callbacks where we can't actually do anything useful with
the error anyway and we'd rather be sure that the state ends up back at
something reasonable even if something strange has happened. It would be
great if there is a cleaner pattern for that though.
…On Mon, Jul 19, 2021 at 8:36 PM Dhia Ayachi ***@***.***> wrote:
@banks <https://github.com/banks> @dnephin <https://github.com/dnephin>
I went through the code again today taking into account the fact that the
issue happen when the retry loop is running and we perform a reconfigure at
the same time and I have 2 observations:
- I think the root issue is that we let a retry loop run while we do
reconfigure concurrently, this can lead to multiple race conditions other
then the one we caught (Could explain the fact that the key was not updated
in raft in our case). In my opinion reconfigure should stop the retry loop
before starting and restart it after if the reconfigure fail.
- Fixing the state machine is a bit risky for this PR specially with
the fact that we have state == caStateInitialized || at the validation
check and removing it impact other use cases. also most of the calls use validateState
= false which bypass the state machine check and remove any guaranty
on the transitions
I propose to:
- keep this PR as is to fix that specific case while limiting the risk
- investigate the possible issues around having a retry loop run while
we do reconfigure
- start a series of PRs to guarantee the consistency of the state
machine as proposed by @dnephin <https://github.com/dnephin> in #10634
<#10634> and remove
completely the option to disable the validation check.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#10630 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAA5QU5JOB6FH2FUXPXIK4TTYR5CTANCNFSM5APYRH6A>
.
|
My thinking is that we can have a state machine with a clear transition path combined with an init that return the state machine to an initial state if something bad happen and we want to get back to a known state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dnephin for changing this. I think this way of fixing the bug bring us closer to fixing the state machine and is less error prone.
I have a comment inline about the signature of SetState
that became a bit weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Daniel, that does look cleaner!
One minor nit inline and I think @dhiaayachi has a good point provided no other callers use that previous state argument that it's now unnecessary. I didn't check other callers though so there may be a reason to leave it?
There is another caller that uses it to restore the previous state on error, so I think we still need it in both places until we make a larger change.
That comment is pretty misleading because it doesn't account for all the calls with validateState=false, as we can see from the diagram in #10634, the actual state transitions are more involved. Maybe it would be better to delete the comment instead of having it be misleading. |
What about doing something like this? func (c *CAManager) setState(newState caState, validateState bool) error {
c.stateLock.Lock()
defer c.stateLock.Unlock()
state := c.state
if !validateState ||
(state == caStateInitialized && newState != caStateInitializing) ||
(state == caStateUninitialized && newState == caStateInitializing) ||
(state == caStateUninitialized && newState == caStateReconfig) {
c.previousState = c.state
c.state = newState
} else {
return &caStateError{Current: state}
}
return nil
}
func (c *CAManager) revertState(state caState) error {
c.stateLock.Lock()
defer c.stateLock.Unlock()
if c.state != state {
return &caStateError{Current: c.state}
}
previousState := c.previousState
c.previousState = c.state
c.state = previousState
return nil
} I think we should have the state managed internally through clear interfaces (set,reset,revert) and not let the state user meddle with it, and this would be a step toward it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I think the interface proposed at the bottom of #10634 does a better job of demonstrating the operations, and lets us keep more of the logic in a centralized place. With this approach the caller doesn't need to know anything about the states, or what transitions are allowed. I think there are definitely similarities between the two proposals.
Definitely agree with this, but I think Instead of |
🍒 If backport labels were added before merging, cherry-picking will start automatically. To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/422659. |
🍒 If backport labels were added before merging, cherry-picking will start automatically. To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/422703. |
🍒✅ Cherry pick of commit b495036 onto |
…ING` state (#10630) * defer setting the state before returning to avoid being stuck in `INITIALIZING` state * add changelog * move comment with the right if statement * ca: report state transition error from setSTate * update comment to reflect state transition Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
🍒✅ Cherry pick of commit b495036 onto |
…ING` state (#10630) * defer setting the state before returning to avoid being stuck in `INITIALIZING` state * add changelog * move comment with the right if statement * ca: report state transition error from setSTate * update comment to reflect state transition Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
🍒✅ Cherry pick of commit b495036 onto |
…ING` state (#10630) * defer setting the state before returning to avoid being stuck in `INITIALIZING` state * add changelog * move comment with the right if statement * ca: report state transition error from setSTate * update comment to reflect state transition Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
fix #10625