Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

defer setting the state before returning to avoid stuck in INITIALIZING state #10630

Merged
merged 5 commits into from
Aug 5, 2021

Conversation

dhiaayachi
Copy link
Contributor

fix #10625

@vercel vercel bot temporarily deployed to Preview – consul July 16, 2021 15:18 Inactive
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging July 16, 2021 15:18 Inactive
@dhiaayachi dhiaayachi requested review from banks and a team July 16, 2021 15:20
Copy link
Contributor

@dnephin dnephin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code being changed here was introduced in d9d4c49, and looking at the changelog entries for that commit it seems to suggest it was attempting to fix this very bug.

Maybe the user experiencing this problem was not running v1.10 yet (where this fix was first introduced) ?

I think this PR may fix the issue, but I'm not confident I understand all the repercussions of this change. It's possible we are reverting a fix for some other bug.

I think it might be valuable to create a state diagram. I suspect setState could be changed to better model valid states which would help make the correct behaviour easier to reason about.

agent/consul/leader_connect_ca.go Outdated Show resolved Hide resolved
@dnephin dnephin added theme/certificates Related to creating, distributing, and rotating certificates in Consul type/bug Feature does not function as expected labels Jul 16, 2021
@dhiaayachi
Copy link
Contributor Author

The code being changed here was introduced in d9d4c49, and looking at the changelog entries for that commit it seems to suggest it was attempting to fix this very bug.

Maybe the user experiencing this problem was not running v1.10 yet (where this fix was first introduced) ?

I think this PR may fix the issue, but I'm not confident I understand all the repercussions of this change. It's possible we are reverting a fix for some other bug.

I think it might be valuable to create a state diagram. I suspect setState could be changed to better model valid states which would help make the correct behaviour easier to reason about.

Thank you for looking at this @dnephin

My understanding is we should never leave InitializeCA method with the provider state other then INITIALIZED or UNINITIALIZED which make sense as InitializeCA is supposed to do all the ca initializing and set the state to INITIALIZED or fail and set the state to UNINITIALIZED.
The current implementation have a possibility to leave the InitializeCA method with the state INITIALIZING which will break the state machine and keep it stuck in INITIALIZING, this happen in the following conditions:

  • we start InitializeCA with state INITIALIZED
  • set the state to INITIALIZING which return oldState = INITIALIZED
  • leave InitializeCA in the if statement

@banks
Copy link
Member

banks commented Jul 16, 2021

[edit] posted this before I saw @dhiaayachi's response which perfectly sums it up...

@dnephin yeah I think it was the fixes for those other issues that introduced this.

The user was running 1.9.7 which contains the same lines

// Update the state before doing anything else.
oldState, err := c.setState(caStateInitializing, true)
// if we were already in the initialized state then there is nothing to be done.
if oldState == caStateInitialized {
return nil
}
if err != nil {
return err
}

I think this specific issue was not the same as any of the ones Matt found and was trying to fix there although it's a similar class of problem. In general, I think his fixes made this code much more robust and worked well, I think this one issue of returning before the defer has introduced this new broken state since we are breaking the invariant that we always exit InitializeCA in a valid state for further changes (i.e. either it worked and we should now be Initialized or it failed and we should be back to uninitialized.

@vercel vercel bot temporarily deployed to Preview – consul July 16, 2021 20:15 Inactive
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging July 16, 2021 20:15 Inactive
@dnephin
Copy link
Contributor

dnephin commented Jul 16, 2021

I added some thoughts and a diagram in #10634.

I forgot that the commit would not show if the cherry-pick was needed for a backport, good call looking at the actual release branch.

I think on the surface this change does fix the problem, but is it the right fix? Looking at the model I put into #10634, it seems like this is one of the "INVALID" state transitions. Maybe setState should actually reject this state transition, in which case the code in InitializeCA would already be correct.

@dhiaayachi
Copy link
Contributor Author

I added some thoughts and a diagram in #10634.

I forgot that the commit would not show if the cherry-pick was needed for a backport, good call looking at the actual release branch.

I think on the surface this change does fix the problem, but is it the right fix? Looking at the model I put into #10634, it seems like this is one of the "INVALID" state transitions. Maybe setState should actually reject this state transition, in which case the code in InitializeCA would already be correct.

I agree, we need to understand more what got us to call InitializeCA while we are at INITIALIZED state first and then we can decide if it's better to forbid that transition in setState and remove the if statement (because it should never happen) or just move the if statement

@banks
Copy link
Member

banks commented Jul 19, 2021

I think on the surface this change does fix the problem, but is it the right fix? Looking at the model I put into #10634, it seems like this is one of the "INVALID" state transitions. Maybe setState should actually reject this state transition, in which case the code in InitializeCA would already be correct.

Hmm yeah, I think cleaning up the state machine is a great idea. One thing I'd say is that even if setState disallowed this transition, I'd still prefer to move the return as in this PR - having it be "correct" to exit early before the defer only because of the implementation of another method somewhere else feels unnecessarily fragile when there are no downsides to moving it afterwards.

we need to understand more what got us to call InitializeCA while we are at INITIALIZED state

The thing that causes it is the background retrying of Initialize due to a misconfiguration which will only be resolved by the operator reconfiguring the CA. If we fix this by changing the state machine rules to be more restrictive then we will need to be very careful that the background Initializer is modified too. Right now if you only change setState then I think the background init routine will get into an infinite retry backoff and keep trying to initialize forever but fail because it's not allowed to initialize when the state is already "initialized". Provided that case is handled specially then it seems fine to me.

I'd still also be fine with merging the PR as it is and continuing state machine cleanup as a separate task though as I think the defer is meant to make it easier to reason about the exit state regardless of code path and this ordering violates that unnecessarily but see what you think @dhiaayachi.

@dnephin
Copy link
Contributor

dnephin commented Jul 19, 2021

One thing I'd say is that even if setState disallowed this transition, I'd still prefer to move the return as in this PR - having it be "correct" to exit early before the defer only because of the implementation of another method somewhere else feels unnecessarily fragile

Why would we allow an invalid state transition, just to check the state, then transition back? That's really unnecessarily confusing.

The way this should work (without having to do all the cleanup proposed by #10634) is that setState should return an error. The error should indicate the current state. In InitializeCA we should handle the error and check the previous state to determine if we return nil or err.

err := c.setState(caStateInitializing, true)
var errState stateError
switch {
case errors.As(err, &errState) && errState.Current == caStateInitialized:
    return nil
case err != nil:
    return err
}
defer func() { ... }

This way we never perform unnecessary and invalid state transitions. And the logic is in the right places. The state transition logic is in a single place (setState) and the "stop looping when initialized" logic is in InitializeCA. defer would only be used when there is a nil error, which is consistent with how defer is generally used in most Go code.

@dhiaayachi
Copy link
Contributor Author

@banks @dnephin
I went through the code again today taking into account the fact that the issue happen when the retry loop is running and we perform a reconfigure at the same time and I have 2 observations:

  • I think the root issue is that we let a retry loop run while we do reconfigure concurrently, this can lead to multiple race conditions other then the one we caught (Could explain the fact that the key was not updated in raft in our case). In my opinion reconfigure should stop the retry loop before starting and restart it after if the reconfigure fail.
  • Fixing the state machine is a bit risky for this PR specially with the fact that we have state == caStateInitialized || at the validation check and removing it impact other use cases. also most of the calls use validateState = false which bypass the state machine check and remove any guaranty on the transitions

I propose to:

  • keep this PR as is to fix that specific case while limiting the risk
  • investigate the possible issues around having a retry loop run while we do reconfigure
  • start a series of PRs to guarantee the consistency of the state machine as proposed by @dnephin in ca: general improvements to the CAMananger state machine #10634 and remove completely the option to disable the validation check.

@banks
Copy link
Member

banks commented Jul 19, 2021 via email

@dhiaayachi
Copy link
Contributor Author

My thinking is that we can have a state machine with a clear transition path combined with an init that return the state machine to an initial state if something bad happen and we want to get back to a known state.

@vercel vercel bot temporarily deployed to Preview – consul July 22, 2021 16:05 Inactive
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging July 22, 2021 16:05 Inactive
Copy link
Contributor Author

@dhiaayachi dhiaayachi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dnephin for changing this. I think this way of fixing the bug bring us closer to fixing the state machine and is less error prone.
I have a comment inline about the signature of SetState that became a bit weird.

agent/consul/leader_connect_ca.go Show resolved Hide resolved
Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Daniel, that does look cleaner!

One minor nit inline and I think @dhiaayachi has a good point provided no other callers use that previous state argument that it's now unnecessary. I didn't check other callers though so there may be a reason to leave it?

agent/consul/leader_connect_ca.go Show resolved Hide resolved
@dnephin
Copy link
Contributor

dnephin commented Jul 26, 2021

provided no other callers use that previous state argument that it's now unnecessary. I didn't check other callers though so there may be a reason to leave it?

There is another caller that uses it to restore the previous state on error, so I think we still need it in both places until we make a larger change.

we should probably update the comment above that lists the allowed state transitions

That comment is pretty misleading because it doesn't account for all the calls with validateState=false, as we can see from the diagram in #10634, the actual state transitions are more involved. Maybe it would be better to delete the comment instead of having it be misleading.

@dhiaayachi
Copy link
Contributor Author

There is another caller that uses it to restore the previous state on error, so I think we still need it in both places until we make a larger change.

What about doing something like this?

func (c *CAManager) setState(newState caState, validateState bool) error {
	c.stateLock.Lock()
	defer c.stateLock.Unlock()
	state := c.state

	if !validateState ||
		(state == caStateInitialized && newState != caStateInitializing) ||
		(state == caStateUninitialized && newState == caStateInitializing) ||
		(state == caStateUninitialized && newState == caStateReconfig) {
		c.previousState = c.state
		c.state = newState
	} else {
		return &caStateError{Current: state}
	}
	return nil
}
func (c *CAManager) revertState(state caState) error {
	c.stateLock.Lock()
	defer c.stateLock.Unlock()
	if c.state != state {
		return &caStateError{Current: c.state}
	}
	previousState := c.previousState
	c.previousState = c.state
	c.state = previousState
	return nil
}

I think we should have the state managed internally through clear interfaces (set,reset,revert) and not let the state user meddle with it, and this would be a step toward it.
That said, I agree to keep this as a future step in enhancing this code.

Copy link
Contributor

@dnephin dnephin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dnephin
Copy link
Contributor

dnephin commented Aug 5, 2021

What about doing something like this?

I think the interface proposed at the bottom of #10634 does a better job of demonstrating the operations, and lets us keep more of the logic in a centralized place. With this approach the caller doesn't need to know anything about the states, or what transitions are allowed.

I think there are definitely similarities between the two proposals.

I think we should have the state managed internally through clear interfaces

Definitely agree with this, but I think set, revert is too low level. It forces the caller to be much more aware of the states.

Instead of validateState and revertState we should provide the caller a function to restore to the appropriate state. That is called Complete in the proposal in #10634

@vercel vercel bot temporarily deployed to Preview – consul August 5, 2021 17:56 Inactive
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging August 5, 2021 17:56 Inactive
@dhiaayachi dhiaayachi merged commit b495036 into main Aug 5, 2021
@dhiaayachi dhiaayachi deleted the dhia/ca-provider-init-state branch August 5, 2021 18:51
@hc-github-team-consul-core
Copy link
Contributor

🍒 If backport labels were added before merging, cherry-picking will start automatically.

To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/422659.

@hc-github-team-consul-core
Copy link
Contributor

🍒 If backport labels were added before merging, cherry-picking will start automatically.

To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/422703.

@hc-github-team-consul-core
Copy link
Contributor

🍒✅ Cherry pick of commit b495036 onto release/1.10.x succeeded!

hc-github-team-consul-core pushed a commit that referenced this pull request Aug 5, 2021
…ING` state (#10630)

* defer setting the state before returning to avoid being stuck in `INITIALIZING` state

* add changelog

* move comment with the right if statement

* ca: report state transition error from setSTate

* update comment to reflect state transition

Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
@hc-github-team-consul-core
Copy link
Contributor

🍒✅ Cherry pick of commit b495036 onto release/1.9.x succeeded!

hc-github-team-consul-core pushed a commit that referenced this pull request Aug 5, 2021
…ING` state (#10630)

* defer setting the state before returning to avoid being stuck in `INITIALIZING` state

* add changelog

* move comment with the right if statement

* ca: report state transition error from setSTate

* update comment to reflect state transition

Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
@hc-github-team-consul-core
Copy link
Contributor

🍒✅ Cherry pick of commit b495036 onto release/1.8.x succeeded!

hc-github-team-consul-core pushed a commit that referenced this pull request Aug 5, 2021
…ING` state (#10630)

* defer setting the state before returning to avoid being stuck in `INITIALIZING` state

* add changelog

* move comment with the right if statement

* ca: report state transition error from setSTate

* update comment to reflect state transition

Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/certificates Related to creating, distributing, and rotating certificates in Consul type/bug Feature does not function as expected
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CA provider state stuck in INITIALIZING after provider initialization fail
4 participants