Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add MustRevalidate flag to connect_ca_leaf cache type; always use on non-blocking queries #11693

Merged
merged 3 commits into from
Dec 2, 2021

Conversation

acpana
Copy link
Contributor

@acpana acpana commented Nov 30, 2021

Overview

PR addresses #10871 / #9862 .

Today, once we HIT/ load up a leaf cert in the agent cache, we can return expired certs or certs signed by a previously rotated CA.

These changes introduces a MustRevalidate flag to the connect_ca_leaf cache type that we always use on non-blocking queries.

Notes

Always setting MustRevalidate on non blocking queries means we always "MISS" the cache on non blocking queries. In effect, the headers are now redundant.

PR checklist

  • gofmt
  • testing
  • pr changelog
  • [ ] docs update --> I want to pull this into a separate PR bc I want to change some of the docs wording

@dnephin dnephin added the theme/certificates Related to creating, distributing, and rotating certificates in Consul label Nov 30, 2021
Copy link
Contributor

@dnephin dnephin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Some thoughts/suggestions

Comment on lines 1523 to 1528
// We don't want non-blocking queries to return expired leaf certs
// or leaf certs not valid under the current CA. So always revalidate
// the leaf cert on non-blocking queries
if args.MinQueryIndex == 0 {
args.MustRevalidate = true
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this logic might be better in ConnectCALeafRequest.CacheInfo(). I believe that method should have access to all the same data.

By moving it there we can be sure that all users of the cache do the right thing, instead of only this one HTTP API endpoint.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm going to walk this back.

Here's what I ran into -- there's a non trivial amount of ConnectCALeaf testing that would need to change as a result of always passing in MustRevalidate assignment logic in the CacheInfo(). It kinda feels like this behavior is "breaking" with previous one. It seems, ergonomically, this func was designed more as a "getter".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, that is unfortunate. I think those tests must not be accurately testing the production behaviour very well, but that is cleanup we can leave for another time I think. Thanks for trying.

Maybe leave a TODO here instead? Not critical

IPSAN []net.IP
MinQueryIndex uint64
MaxQueryTime time.Duration
MustRevalidate bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we move the logic into CacheInfo (suggested in the other comment) then we may be able to avoid adding this field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed walking back this change here --> #11693 (comment)

Comment on lines 408 to 416
// If we called Fetch() with MustRevalidate then this call came from a non-blocking query.
// To pickup any CA rotations that may have happened prior to this call, we will reduce the timeout
// to ~ 1 sec. Otherwise, we risk non-blocking queries taking ~ 10 minutes to wait on a CA rotation
// that could never happen.
// see https://github.com/hashicorp/consul/issues/10871, https://github.com/hashicorp/consul/issues/9862
// This is a hack, so, ideally, we can revisit the caching/ fetching logic in this case
if req.CacheInfo().MustRevalidate {
timeoutCh = time.After(time.Second)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this works, but I also wonder if we could make the non-blocking case a bit more obvious by avoiding the select entirely.

What if we were to:

  • move this if above where we create rootUpdateCh around line 382
  • remove the second if below (we'll handle it in this same if block)
  • instad of timeoutCh = time.After(time.Second) we do something like this:
roots, err := c.rootsFromCache()
if err != nil {
    return lastResultWithNewState(), err
}
if activeRootHasKey(roots, state.authorityKeyID) {
        return lastResultWithNewState(), err
}
... // some shared logic here to calculate expiresAt
if state.forceExpireAfter.Before(expiresAt) {
    return c.generateNewLeaf(reqReal, lastResultWithNewState())
}
return lastResultWithNewState(), err

The ... about calculating expiresAt could be extracted from the 461-470 into a new function and called from both places.

I haven't fully gone through that to be sure that works, but my impression is that something like that would work and allow us to avoid the select which makes it a bit more obvious that we are handling a non-blocking request, without having to set a short timeout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey thanks for bringing this up again -- I recall we chatted about this approach and dropped the ball as I was debugging the "vault issue". 💯

I think it makes a lot of sense (readability, maintainability) to treat the MustRevalidate/ non-blocking case as "separate" from the select stmt.

Immediately after we check :

if expiresAt.Equal(now) || expiresAt.Before(now) {
// Already expired, just make a new one right away
return c.generateNewLeaf(reqReal, lastResultWithNewState())
}

If we add something like ~:

	if req.CacheInfo().MustRevalidate {
		roots, err := c.rootsFromCache()
		if err != nil {
			return lastResultWithNewState(), err
		}
		if activeRootHasKey(roots, state.authorityKeyID) {
			return lastResultWithNewState(), nil
		}

		// if we reach here then the current leaf was not signed by the same CAs, just regen
		return c.generateNewLeaf(reqReal, lastResultWithNewState())
	}

That will catch most occurrences of #10871. Sometimes, the CA may take a bit to be rotated but that's ok. "Certainty" / " strong consistency" is what the blocking queries are for.

Also note, that we don't need to "calculate" when to regen the cert. If for whatever reason we don't have a CA that signed the leaf, we need to regen the leaf cert.

Wdyt?

@vercel vercel bot temporarily deployed to Preview – consul December 1, 2021 02:31 Inactive
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging December 1, 2021 02:31 Inactive
@vercel vercel bot temporarily deployed to Preview – consul December 1, 2021 02:32 Inactive
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging December 1, 2021 02:32 Inactive
Signed-off-by: FFMMM <FFMMM@users.noreply.github.com>
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging December 1, 2021 18:07 Inactive
@vercel vercel bot temporarily deployed to Preview – consul December 1, 2021 18:07 Inactive
Copy link
Contributor

@dnephin dnephin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, LGTM!

One small suggestion for the loop in the test. I think it might be worth to look into moving the MustRevalidate into the CacheInfo stuff, but that is not urgent, so would be fine as a TODO I think.

agent/agent_endpoint_test.go Outdated Show resolved Hide resolved
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
@vercel vercel bot temporarily deployed to Preview – consul December 2, 2021 18:46 Inactive
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging December 2, 2021 18:46 Inactive
@vercel vercel bot temporarily deployed to Preview – consul December 2, 2021 18:53 Inactive
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging December 2, 2021 18:53 Inactive
Signed-off-by: FFMMM <FFMMM@users.noreply.github.com>
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging December 2, 2021 19:04 Inactive
@vercel vercel bot temporarily deployed to Preview – consul December 2, 2021 19:04 Inactive
@acpana acpana merged commit 384d497 into main Dec 2, 2021
@acpana acpana deleted the ffmmm/b-10871-2 branch December 2, 2021 19:32
@hc-github-team-consul-core
Copy link
Contributor

🍒 If backport labels were added before merging, cherry-picking will start automatically.

To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/513605.

@hc-github-team-consul-core
Copy link
Contributor

🍒❌ Cherry pick of commit 384d497 onto release/1.10.x failed! Build Log

@hc-github-team-consul-core
Copy link
Contributor

🍒❌ Cherry pick of commit 384d497 onto release/1.9.x failed! Build Log

acpana pushed a commit that referenced this pull request Dec 2, 2021
…non-blocking queries (#11693)

* always use MustRevalidate on non-blocking queries for connect ca leaf

Signed-off-by: FFMMM <FFMMM@users.noreply.github.com>

* Update agent/agent_endpoint_test.go

Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>

* pr feedback

Signed-off-by: FFMMM <FFMMM@users.noreply.github.com>

Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
acpana pushed a commit that referenced this pull request Dec 2, 2021
…non-blocking queries (#11693)

* always use MustRevalidate on non-blocking queries for connect ca leaf

Signed-off-by: FFMMM <FFMMM@users.noreply.github.com>

* Update agent/agent_endpoint_test.go

Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>

* pr feedback

Signed-off-by: FFMMM <FFMMM@users.noreply.github.com>

Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
acpana pushed a commit that referenced this pull request Dec 2, 2021
…non-blocking queries (#11693) (#11714)

* always use MustRevalidate on non-blocking queries for connect ca leaf

Signed-off-by: FFMMM <FFMMM@users.noreply.github.com>

* Update agent/agent_endpoint_test.go

Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>

* pr feedback

Signed-off-by: FFMMM <FFMMM@users.noreply.github.com>

Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>

Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
acpana pushed a commit that referenced this pull request Dec 2, 2021
…non-blocking queries (#11693) (#11715)

* always use MustRevalidate on non-blocking queries for connect ca leaf

Signed-off-by: FFMMM <FFMMM@users.noreply.github.com>

* Update agent/agent_endpoint_test.go

Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>

* pr feedback

Signed-off-by: FFMMM <FFMMM@users.noreply.github.com>

Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>

Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
rboyer pushed a commit that referenced this pull request Apr 19, 2022
…king leaf cert query to block

Fixes #12048
Fixes #12319

Regression introduced in #11693
rboyer added a commit that referenced this pull request Apr 20, 2022
…king leaf cert query to block (#12820)

Fixes #12048

Fixes #12319

Regression introduced in #11693

Local reproduction steps:
1. `consul agent -dev`
2. `curl -sLiv 'localhost:8500/v1/agent/connect/ca/leaf/web'`
3. make note of the `X-Consul-Index` header returned 
4. `curl -sLi 'localhost:8500/v1/agent/connect/ca/leaf/web?index=<VALUE_FROM_STEP_3>'`
5. Kill the above curl when it hangs with Ctrl-C
6. Repeat (2) and it should not hang.
hc-github-team-consul-core pushed a commit that referenced this pull request Apr 20, 2022
…king leaf cert query to block (#12820)

Fixes #12048

Fixes #12319

Regression introduced in #11693

Local reproduction steps:
1. `consul agent -dev`
2. `curl -sLiv 'localhost:8500/v1/agent/connect/ca/leaf/web'`
3. make note of the `X-Consul-Index` header returned 
4. `curl -sLi 'localhost:8500/v1/agent/connect/ca/leaf/web?index=<VALUE_FROM_STEP_3>'`
5. Kill the above curl when it hangs with Ctrl-C
6. Repeat (2) and it should not hang.
hc-github-team-consul-core pushed a commit that referenced this pull request Apr 20, 2022
…king leaf cert query to block (#12820)

Fixes #12048

Fixes #12319

Regression introduced in #11693

Local reproduction steps:
1. `consul agent -dev`
2. `curl -sLiv 'localhost:8500/v1/agent/connect/ca/leaf/web'`
3. make note of the `X-Consul-Index` header returned 
4. `curl -sLi 'localhost:8500/v1/agent/connect/ca/leaf/web?index=<VALUE_FROM_STEP_3>'`
5. Kill the above curl when it hangs with Ctrl-C
6. Repeat (2) and it should not hang.
hc-github-team-consul-core pushed a commit that referenced this pull request Apr 20, 2022
…king leaf cert query to block (#12820)

Fixes #12048

Fixes #12319

Regression introduced in #11693

Local reproduction steps:
1. `consul agent -dev`
2. `curl -sLiv 'localhost:8500/v1/agent/connect/ca/leaf/web'`
3. make note of the `X-Consul-Index` header returned 
4. `curl -sLi 'localhost:8500/v1/agent/connect/ca/leaf/web?index=<VALUE_FROM_STEP_3>'`
5. Kill the above curl when it hangs with Ctrl-C
6. Repeat (2) and it should not hang.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/certificates Related to creating, distributing, and rotating certificates in Consul
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants