defer setting the state before returning to avoid stuck in `INITIALIZING` state #10630

dhiaayachi · 2021-07-16T15:16:28Z

…TIALIZING` state

dnephin

The code being changed here was introduced in d9d4c49, and looking at the changelog entries for that commit it seems to suggest it was attempting to fix this very bug.

Maybe the user experiencing this problem was not running v1.10 yet (where this fix was first introduced) ?

I think this PR may fix the issue, but I'm not confident I understand all the repercussions of this change. It's possible we are reverting a fix for some other bug.

I think it might be valuable to create a state diagram. I suspect setState could be changed to better model valid states which would help make the correct behaviour easier to reason about.

agent/consul/leader_connect_ca.go

dhiaayachi · 2021-07-16T19:57:02Z

The code being changed here was introduced in d9d4c49, and looking at the changelog entries for that commit it seems to suggest it was attempting to fix this very bug.

Maybe the user experiencing this problem was not running v1.10 yet (where this fix was first introduced) ?

I think this PR may fix the issue, but I'm not confident I understand all the repercussions of this change. It's possible we are reverting a fix for some other bug.

I think it might be valuable to create a state diagram. I suspect setState could be changed to better model valid states which would help make the correct behaviour easier to reason about.

Thank you for looking at this @dnephin

My understanding is we should never leave InitializeCA method with the provider state other then INITIALIZED or UNINITIALIZED which make sense as InitializeCA is supposed to do all the ca initializing and set the state to INITIALIZED or fail and set the state to UNINITIALIZED.
The current implementation have a possibility to leave the InitializeCA method with the state INITIALIZING which will break the state machine and keep it stuck in INITIALIZING, this happen in the following conditions:

we start InitializeCA with state INITIALIZED
set the state to INITIALIZING which return oldState = INITIALIZED
leave InitializeCA in the if statement

banks · 2021-07-16T20:03:00Z

[edit] posted this before I saw @dhiaayachi's response which perfectly sums it up...

@dnephin yeah I think it was the fixes for those other issues that introduced this.

The user was running 1.9.7 which contains the same lines

consul/agent/consul/leader_connect_ca.go

Lines 310 to 318 in 01555fb

    
           // Update the state before doing anything else. 
        
           oldState, err := c.setState(caStateInitializing, true) 
        
           // if we were already in the initialized state then there is nothing to be done. 
        
           if oldState == caStateInitialized { 
        
           	return nil 
        
           } 
        
           if err != nil { 
        
           	return err 
        
           }

I think this specific issue was not the same as any of the ones Matt found and was trying to fix there although it's a similar class of problem. In general, I think his fixes made this code much more robust and worked well, I think this one issue of returning before the defer has introduced this new broken state since we are breaking the invariant that we always exit InitializeCA in a valid state for further changes (i.e. either it worked and we should now be Initialized or it failed and we should be back to uninitialized.

dnephin · 2021-07-16T20:20:05Z

I added some thoughts and a diagram in #10634.

I forgot that the commit would not show if the cherry-pick was needed for a backport, good call looking at the actual release branch.

I think on the surface this change does fix the problem, but is it the right fix? Looking at the model I put into #10634, it seems like this is one of the "INVALID" state transitions. Maybe setState should actually reject this state transition, in which case the code in InitializeCA would already be correct.

dhiaayachi · 2021-07-16T20:41:22Z

I added some thoughts and a diagram in #10634.

I forgot that the commit would not show if the cherry-pick was needed for a backport, good call looking at the actual release branch.

I think on the surface this change does fix the problem, but is it the right fix? Looking at the model I put into #10634, it seems like this is one of the "INVALID" state transitions. Maybe setState should actually reject this state transition, in which case the code in InitializeCA would already be correct.

I agree, we need to understand more what got us to call InitializeCA while we are at INITIALIZED state first and then we can decide if it's better to forbid that transition in setState and remove the if statement (because it should never happen) or just move the if statement

banks · 2021-07-19T16:43:54Z

I think on the surface this change does fix the problem, but is it the right fix? Looking at the model I put into #10634, it seems like this is one of the "INVALID" state transitions. Maybe setState should actually reject this state transition, in which case the code in InitializeCA would already be correct.

Hmm yeah, I think cleaning up the state machine is a great idea. One thing I'd say is that even if setState disallowed this transition, I'd still prefer to move the return as in this PR - having it be "correct" to exit early before the defer only because of the implementation of another method somewhere else feels unnecessarily fragile when there are no downsides to moving it afterwards.

we need to understand more what got us to call InitializeCA while we are at INITIALIZED state

The thing that causes it is the background retrying of Initialize due to a misconfiguration which will only be resolved by the operator reconfiguring the CA. If we fix this by changing the state machine rules to be more restrictive then we will need to be very careful that the background Initializer is modified too. Right now if you only change setState then I think the background init routine will get into an infinite retry backoff and keep trying to initialize forever but fail because it's not allowed to initialize when the state is already "initialized". Provided that case is handled specially then it seems fine to me.

I'd still also be fine with merging the PR as it is and continuing state machine cleanup as a separate task though as I think the defer is meant to make it easier to reason about the exit state regardless of code path and this ordering violates that unnecessarily but see what you think @dhiaayachi.

dnephin · 2021-07-19T17:02:11Z

One thing I'd say is that even if setState disallowed this transition, I'd still prefer to move the return as in this PR - having it be "correct" to exit early before the defer only because of the implementation of another method somewhere else feels unnecessarily fragile

Why would we allow an invalid state transition, just to check the state, then transition back? That's really unnecessarily confusing.

The way this should work (without having to do all the cleanup proposed by #10634) is that setState should return an error. The error should indicate the current state. In InitializeCA we should handle the error and check the previous state to determine if we return nil or err.

err := c.setState(caStateInitializing, true)
var errState stateError
switch {
case errors.As(err, &errState) && errState.Current == caStateInitialized:
    return nil
case err != nil:
    return err
}
defer func() { ... }

This way we never perform unnecessary and invalid state transitions. And the logic is in the right places. The state transition logic is in a single place (setState) and the "stop looping when initialized" logic is in InitializeCA. defer would only be used when there is a nil error, which is consistent with how defer is generally used in most Go code.

dhiaayachi · 2021-07-19T19:35:57Z

@banks @dnephin
I went through the code again today taking into account the fact that the issue happen when the retry loop is running and we perform a reconfigure at the same time and I have 2 observations:

I think the root issue is that we let a retry loop run while we do reconfigure concurrently, this can lead to multiple race conditions other then the one we caught (Could explain the fact that the key was not updated in raft in our case). In my opinion reconfigure should stop the retry loop before starting and restart it after if the reconfigure fail.
Fixing the state machine is a bit risky for this PR specially with the fact that we have state == caStateInitialized || at the validation check and removing it impact other use cases. also most of the calls use validateState = false which bypass the state machine check and remove any guaranty on the transitions

I propose to:

keep this PR as is to fix that specific case while limiting the risk
investigate the possible issues around having a retry loop run while we do reconfigure
start a series of PRs to guarantee the consistency of the state machine as proposed by @dnephin in ca: general improvements to the CAMananger state machine #10634 and remove completely the option to disable the validation check.

banks · 2021-07-19T19:41:22Z

That plan sounds good to me Dhia. I think to be clear the disable validation flag is intended to be used in places like Defer callbacks where we can't actually do anything useful with the error anyway and we'd rather be sure that the state ends up back at something reasonable even if something strange has happened. It would be great if there is a cleaner pattern for that though.

…

On Mon, Jul 19, 2021 at 8:36 PM Dhia Ayachi ***@***.***> wrote: @banks <https://github.com/banks> @dnephin <https://github.com/dnephin> I went through the code again today taking into account the fact that the issue happen when the retry loop is running and we perform a reconfigure at the same time and I have 2 observations: - I think the root issue is that we let a retry loop run while we do reconfigure concurrently, this can lead to multiple race conditions other then the one we caught (Could explain the fact that the key was not updated in raft in our case). In my opinion reconfigure should stop the retry loop before starting and restart it after if the reconfigure fail. - Fixing the state machine is a bit risky for this PR specially with the fact that we have state == caStateInitialized || at the validation check and removing it impact other use cases. also most of the calls use validateState = false which bypass the state machine check and remove any guaranty on the transitions I propose to: - keep this PR as is to fix that specific case while limiting the risk - investigate the possible issues around having a retry loop run while we do reconfigure - start a series of PRs to guarantee the consistency of the state machine as proposed by @dnephin <https://github.com/dnephin> in #10634 <#10634> and remove completely the option to disable the validation check. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#10630 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAA5QU5JOB6FH2FUXPXIK4TTYR5CTANCNFSM5APYRH6A> .

dhiaayachi · 2021-07-19T20:02:30Z

My thinking is that we can have a state machine with a clear transition path combined with an init that return the state machine to an initial state if something bad happen and we want to get back to a known state.

dhiaayachi

Thanks @dnephin for changing this. I think this way of fixing the bug bring us closer to fixing the state machine and is less error prone.
I have a comment inline about the signature of SetState that became a bit weird.

agent/consul/leader_connect_ca.go

banks

Thanks Daniel, that does look cleaner!

One minor nit inline and I think @dhiaayachi has a good point provided no other callers use that previous state argument that it's now unnecessary. I didn't check other callers though so there may be a reason to leave it?

agent/consul/leader_connect_ca.go

dnephin · 2021-07-26T16:16:42Z

provided no other callers use that previous state argument that it's now unnecessary. I didn't check other callers though so there may be a reason to leave it?

There is another caller that uses it to restore the previous state on error, so I think we still need it in both places until we make a larger change.

we should probably update the comment above that lists the allowed state transitions

That comment is pretty misleading because it doesn't account for all the calls with validateState=false, as we can see from the diagram in #10634, the actual state transitions are more involved. Maybe it would be better to delete the comment instead of having it be misleading.

dhiaayachi · 2021-08-05T13:47:28Z

There is another caller that uses it to restore the previous state on error, so I think we still need it in both places until we make a larger change.

What about doing something like this?

func (c *CAManager) setState(newState caState, validateState bool) error {
	c.stateLock.Lock()
	defer c.stateLock.Unlock()
	state := c.state

	if !validateState ||
		(state == caStateInitialized && newState != caStateInitializing) ||
		(state == caStateUninitialized && newState == caStateInitializing) ||
		(state == caStateUninitialized && newState == caStateReconfig) {
		c.previousState = c.state
		c.state = newState
	} else {
		return &caStateError{Current: state}
	}
	return nil
}
func (c *CAManager) revertState(state caState) error {
	c.stateLock.Lock()
	defer c.stateLock.Unlock()
	if c.state != state {
		return &caStateError{Current: c.state}
	}
	previousState := c.previousState
	c.previousState = c.state
	c.state = previousState
	return nil
}

I think we should have the state managed internally through clear interfaces (set,reset,revert) and not let the state user meddle with it, and this would be a step toward it.
That said, I agree to keep this as a future step in enhancing this code.

dnephin

LGTM

dnephin · 2021-08-05T15:59:47Z

What about doing something like this?

I think the interface proposed at the bottom of #10634 does a better job of demonstrating the operations, and lets us keep more of the logic in a centralized place. With this approach the caller doesn't need to know anything about the states, or what transitions are allowed.

I think there are definitely similarities between the two proposals.

I think we should have the state managed internally through clear interfaces

Definitely agree with this, but I think set, revert is too low level. It forces the caller to be much more aware of the states.

Instead of validateState and revertState we should provide the caller a function to restore to the appropriate state. That is called Complete in the proposal in #10634

hc-github-team-consul-core · 2021-08-05T18:51:54Z

🍒 If backport labels were added before merging, cherry-picking will start automatically.

To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/422659.

hc-github-team-consul-core · 2021-08-05T18:54:36Z

🍒 If backport labels were added before merging, cherry-picking will start automatically.

To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/422703.

hc-github-team-consul-core · 2021-08-05T18:54:41Z

🍒✅ Cherry pick of commit b495036 onto release/1.10.x succeeded!

…ING` state (#10630) * defer setting the state before returning to avoid being stuck in `INITIALIZING` state * add changelog * move comment with the right if statement * ca: report state transition error from setSTate * update comment to reflect state transition Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>

hc-github-team-consul-core · 2021-08-05T18:54:44Z

🍒✅ Cherry pick of commit b495036 onto release/1.9.x succeeded!

…ING` state (#10630) * defer setting the state before returning to avoid being stuck in `INITIALIZING` state * add changelog * move comment with the right if statement * ca: report state transition error from setSTate * update comment to reflect state transition Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>

hc-github-team-consul-core · 2021-08-05T18:54:47Z

🍒✅ Cherry pick of commit b495036 onto release/1.8.x succeeded!

…ING` state (#10630) * defer setting the state before returning to avoid being stuck in `INITIALIZING` state * add changelog * move comment with the right if statement * ca: report state transition error from setSTate * update comment to reflect state transition Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>

dhiaayachi added 2 commits July 16, 2021 11:14

defer setting the state before returning to avoid being stuck in `INI…

bb4d148

…TIALIZING` state

add changelog

5879ce9

vercel bot temporarily deployed to Preview – consul July 16, 2021 15:18 Inactive

vercel bot temporarily deployed to Preview – consul-ui-staging July 16, 2021 15:18 Inactive

dhiaayachi requested review from banks and a team July 16, 2021 15:20

dnephin reviewed Jul 16, 2021

View reviewed changes

agent/consul/leader_connect_ca.go Outdated Show resolved Hide resolved

dnephin added theme/certificates Related to creating, distributing, and rotating certificates in Consul type/bug Feature does not function as expected labels Jul 16, 2021

move comment with the right if statement

9052975

vercel bot temporarily deployed to Preview – consul July 16, 2021 20:15 Inactive

vercel bot temporarily deployed to Preview – consul-ui-staging July 16, 2021 20:15 Inactive

ca: report state transition error from setSTate

584e4c5

vercel bot temporarily deployed to Preview – consul July 22, 2021 16:05 Inactive

vercel bot temporarily deployed to Preview – consul-ui-staging July 22, 2021 16:05 Inactive

dhiaayachi commented Jul 22, 2021

View reviewed changes

agent/consul/leader_connect_ca.go Show resolved Hide resolved

banks reviewed Jul 23, 2021

View reviewed changes

agent/consul/leader_connect_ca.go Show resolved Hide resolved

dnephin approved these changes Aug 5, 2021

View reviewed changes

update comment to reflect state transition

ed59747

vercel bot temporarily deployed to Preview – consul August 5, 2021 17:56 Inactive

vercel bot temporarily deployed to Preview – consul-ui-staging August 5, 2021 17:56 Inactive

dhiaayachi merged commit b495036 into main Aug 5, 2021

dhiaayachi deleted the dhia/ca-provider-init-state branch August 5, 2021 18:51

dnephin added backport/1.10 labels Aug 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

defer setting the state before returning to avoid stuck in `INITIALIZING` state #10630

defer setting the state before returning to avoid stuck in `INITIALIZING` state #10630

dhiaayachi commented Jul 16, 2021

dnephin left a comment

dhiaayachi commented Jul 16, 2021

banks commented Jul 16, 2021 •

edited

Loading

dnephin commented Jul 16, 2021 •

edited

Loading

dhiaayachi commented Jul 16, 2021

banks commented Jul 19, 2021 •

edited

Loading

dnephin commented Jul 19, 2021 •

edited

Loading

dhiaayachi commented Jul 19, 2021

banks commented Jul 19, 2021 via email

dhiaayachi commented Jul 19, 2021

dhiaayachi left a comment

banks left a comment

dnephin commented Jul 26, 2021

dhiaayachi commented Aug 5, 2021

dnephin left a comment

dnephin commented Aug 5, 2021 •

edited

Loading

hc-github-team-consul-core commented Aug 5, 2021

hc-github-team-consul-core commented Aug 5, 2021

hc-github-team-consul-core commented Aug 5, 2021

hc-github-team-consul-core commented Aug 5, 2021

hc-github-team-consul-core commented Aug 5, 2021

defer setting the state before returning to avoid stuck in INITIALIZING state #10630

defer setting the state before returning to avoid stuck in INITIALIZING state #10630

Conversation

dhiaayachi commented Jul 16, 2021

dnephin left a comment

Choose a reason for hiding this comment

dhiaayachi commented Jul 16, 2021

banks commented Jul 16, 2021 • edited Loading

dnephin commented Jul 16, 2021 • edited Loading

dhiaayachi commented Jul 16, 2021

banks commented Jul 19, 2021 • edited Loading

dnephin commented Jul 19, 2021 • edited Loading

dhiaayachi commented Jul 19, 2021

banks commented Jul 19, 2021 via email

dhiaayachi commented Jul 19, 2021

dhiaayachi left a comment

Choose a reason for hiding this comment

banks left a comment

Choose a reason for hiding this comment

dnephin commented Jul 26, 2021

dhiaayachi commented Aug 5, 2021

dnephin left a comment

Choose a reason for hiding this comment

dnephin commented Aug 5, 2021 • edited Loading

hc-github-team-consul-core commented Aug 5, 2021

hc-github-team-consul-core commented Aug 5, 2021

hc-github-team-consul-core commented Aug 5, 2021

hc-github-team-consul-core commented Aug 5, 2021

hc-github-team-consul-core commented Aug 5, 2021

defer setting the state before returning to avoid stuck in `INITIALIZING` state #10630

defer setting the state before returning to avoid stuck in `INITIALIZING` state #10630

banks commented Jul 16, 2021 •

edited

Loading

dnephin commented Jul 16, 2021 •

edited

Loading

banks commented Jul 19, 2021 •

edited

Loading

dnephin commented Jul 19, 2021 •

edited

Loading

dnephin commented Aug 5, 2021 •

edited

Loading