-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop processing ACME verifications when active node is stepped down #23278
Stop processing ACME verifications when active node is stepped down #23278
Conversation
- If a node is stepped down from being the active node while it is actively processing a verification, we could get into an infinite loop due to an ErrReadOnly error attempting to clean up a challenge entry - Add a maximum number of retries for errors around attempting to decode,fetch challenge/authorization entries from disk. We use double the number of "normal" max attempts for these types of errors, than we would for normal ACME retry attempts to avoid collision issues. Note that these additional retry attempts are not persisted to disk and will restart on every node start - Add a 5 second backoff to any disk related error to not immediately spin on disk/io errors for challenges.
Build Results: |
CI Results: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! I asked some clarifying questions.
// data store, we would re-attempt them forever and never clean them up. We want to leverage | ||
// the retry counts within the authorization/challenge objects first which is why we give ourselves | ||
// twice the attempts here. | ||
if retries > MaxRetryAttempts*2 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not super familiar with the internals of acme challenge engine, so please bear with me with my questions. What is the correlation between MaxRetryAttempt and retry count within the authorization/challenge object? Would this doubling here cause confusion in the future as in it may cause two different behaviours for a single thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not super familiar with the internals of acme challenge engine, so please bear with me with my questions.
No worries, I appreciate the feedback!
What is the correlation between MaxRetryAttempt and retry count within the authorization/challenge object?
There isn't really a correlation, but you are right that it may lead to confusion. Basically this new one I'm just trying to track are we looping continuously on a challenge that we can't seem to decode/process out of the normal ACME processing pipeline. Kinda a catch 22 in if we can't decode/read the event from storage how are we supposed to know how many times we've attempted it...
I realized if we don't do this and for some reason these start to increase we may get stuck continuously processing these and nothing new incoming.
Would this doubling here cause confusion in the future as in it may cause two different behaviours for a single thing?
Maybe but again this is here as a stop gap, "should" never fire technically.
One thing to keep in mind with ACME challenges/orders is, it isn't the end of the world to fail/lose them, not great mind you, but all the clients I've seen will/can just restart the process again with minimal effort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good context! Thanks!
@@ -343,23 +372,23 @@ func (ace *ACMEChallengeEngine) VerifyChallenge(runnerSc *storageContext, id str | |||
|
|||
func (ace *ACMEChallengeEngine) _verifyChallenge(sc *storageContext, id string, config *acmeConfigEntry) (bool, time.Time, error) { | |||
now := time.Now() | |||
backoffTime := now.Add(5 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this backoff be exponentially growing instead of a flat 5 seconds delay? maybe there is a glitch in the IO or decoding the json, etc. should it wait that long to retry? Could this cause ace.Validations
to be filled up without doing the actual work and cause issues down the road?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this backoff be exponentially growing instead of a flat 5 seconds delay?
The original design was to try to get through the ACME challenges as quickly as possible to avoid long standing entries within that Validation queue in memory. I did pick 5 second delay kinda randomly but I wouldn't want it to grow excessively either.
maybe there is a glitch in the IO or decoding the json, etc. should it wait that long to retry?
It wouldn't be great, but then again IO glitches aren't really something I'd special case for. Note that there is an exponential backoff for the "normal" validation failure workflow, just not this. I don't think 5 seconds is overly excessive for us to come back to the item either, but I'm happy to be proven wrong here as well.
Could this cause ace.Validations to be filled up without doing the actual work and cause issues down the road?
I don't think so, to get into this queue in the first place an ACME client needs to create a new order and then acknowledge they have solved the given challenge. All those things require I/O so I think they would fail as well.
But that is also kinda why I have this new retry mechanism/timeout in place, in case there is something I'm not seeing to have an upper bound on how many times we do try to process them before we give up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of using this https://github.com/cenkalti/backoff which you can set the backoff to backoff slowly at first but exponentially increases until a certain cap. This package is used elsewhere in Vault as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this last night, I don't think we want the exponential backoff here as we'd like to clear that in-memory queue fast, so even if we'd use it I'd limit it's maximum to 5 seconds. That said I'm not sure beyond say a randomization for multiple items, which doesn't really matter in this use-case, I don't see the benefit.
I've lowered the backoffs after more thought to 1 second, the main purpose to bump them was to avoid an immediate retry of the same thing.
Or is there some aspect I'm missing?
|
||
// Remove our ChallengeValidation entry only. | ||
if deleteErr := sc.Storage.Delete(sc.Context, acmeValidationPrefix+id); deleteErr != nil { | ||
return true, now.Add(-1 * time.Second), fmt.Errorf("error deleting challenge %v (error prior to cleanup, if any: %v): %w", id, err, deleteErr) | ||
return true, backoff, fmt.Errorf("error deleting challenge %v (error prior to cleanup, "+"if any: %v): %w", id, err, deleteErr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if we would need to change this bit? If we are not successful we should get rid of that entry, and if we hit the readonly error we bail anyway.
Actually, it does not seem to me that we would need to change anything in this function. The last return, returns false
for retry, so we don't retry at all. Again, please take these comments with a grain of salt as I am not familiar with the workflow that much!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if we would need to change this bit? If we are not successful we should get rid of that entry, and if we hit the readonly error we bail anyway.
By get rid of that entry, I assume you mean the in memory version of it? We could do that, but then the item is still within storage which will be reloaded on next restart in theory. So if the error we got back from the Delete wasn't an ErrReadOnly but say an I/O error we wouldn't re-attempt to delete it again until the next restart.
Actually, it does not seem to me that we would need to change anything in this function. The last return, returns false for retry, so we don't retry at all.
You are correct, I wanted the backoff behavior to apply to the I/O error use-case, and ideally this would force someone in the future to apply the backoff to other statements added in the future. In general it would have been better in retrospect to return the actual errors and leave the retry/backoff logic into a single location within VerifyChallenge but that was a bigger refactoring I didn't want to get into for a backport.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually meant from storage. So, the original code means that get rid of that storage entry as soon as possible since the retryAfter is set to the past! Then, not changing anything here is fine for all intents and purposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry perhaps I'm not understanding, but if we don't change anything in this function, we would be looping very quickly on I/O errors on the deletion call no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right about the looping. So far, what we have been seeing is the cannot write to readonly storage
.
2023-09-22T18:23:34.4464531Z 2023-09-22T18:23:00.674Z [ERROR] Test_ACME.core-2.secrets.pki.pki_7114d24b: ACME validation failed for 5dcc6663-8b3c-5826-7f62-b9018987fc5a-dns-01: error deleting challenge 5dcc6663-8b3c-5826-7f62-b9018987fc5a-dns-01 (error prior to cleanup, if any: <nil>): cannot write to readonly storage
So, other than that error would you think we should be backing off that much?
Hmm, maybe we should! What I like about the original behaviour is the priority to which we give to cleaning up.
Maybe we should also consider the work load here. How much of workload would you think that we might expect? Is there a use case in which there are going to be millions of challenge cleanup burst all at once? If so, maybe we should reconsider the design. We have similar issue with lease revocation right now and that has caused several issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The in-memory queue contains all the in-flight ACME validation requests, if we were servicing say LetsEncrypt workloads this definitely would not fly. This isn't the entire ACME lifecycle, just the validation of a request so it should be relatively quick and we cleanup the requests within the same processing.
Now we did have that issue with possibly corrupted items that would never be removed which should be resolved within this PR.
runnerSc.Backend.Logger().Debug("Starting verification of challenge", "id", id) | ||
|
||
if retry, retryAfter, err := ace._verifyChallenge(sc, id, config); err != nil { | ||
if errors.Is(err, logical.ErrReadOnly) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that I am wondering about is that would the caller of VerifyChallenge
need to check if core is an active node or not? I don't see that anywhere there, delaying it until here seems unnecessary. The ACME work load might be large enough that it could be taking some time to get rid of every item in the list with the readonly error (at least this is what I understood reading the caller code).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that I am wondering about is that would the caller of VerifyChallenge need to check if core is an active node or not?
Yup you are correct, this bug is still here I think if we have just enough read-only errors to keep us within the second for loop, great catch. The backoffs might not be enough to pop us out of that loop to then re-evaluate the ace.Closing status to shutdown the thread completely.
ACME work load might be large enough that it could be taking some time to get rid of every item in the list with the readonly
It's a little weird but yeah we only have a maximum of 5 items in flight at any given time. The plugin's backend Clean method signals the Closing channel so we don't need to process all items within in the queue, but I do think we should add a check for Closing and the current state to break out the second for loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup you were correct, I've removed the special handling for ErrReadOnly here and added a check for us running on the active node instead within the loop as well as not starting up the ACME thread and loading up any persisted challenges to attempt processing on those nodes.
|
- This reverts some of the previous changes in the PR, mainly around trapping the ErrReadOnly - Do not schedule the ACME worker thread if we aren't an active node, which will prevent us from reading the queue from disk - Bail out early within the for loop prior to scheduling the verification if we are no longer an active node - Readjust the retries for deletion errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only question is if a secondary node becomes the primary, will the challenge engine start up?
Yes, confirmed that when a secondary node becomes the active node, the plugin's Initialize function is called, so we will re-evaluate and start it up. I've also manually tested it works when stepping down nodes in a main cluster as well as nodes within a performance replica cluster. |
Backport labels? |
LogConsumer: func(s string) { | ||
ts.log.Info(s) | ||
}, | ||
// DNS container logging was disabled to reduce content within CI logs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we will try it for now, it was another request to reduce logging in CI from Paul Banks.
Do not load existing ACME challenges persisted within storage on non-active nodes. This was the main culprit of the issues, secondary nodes would load existing persisted challenges trying to resolve them but writes would fail leading to the excessive logging.
If a node is stepped down from being the active node while it is actively processing a verification, we could get into an infinite loop due to an ErrReadOnly error attempting to clean up a challenge entry
Add a maximum number of retries for errors around attempting to decode,fetch challenge/authorization entries from disk. We use double the number of "normal" max attempts for these types of errors, than we would for normal ACME retry attempts to avoid collision issues. Note that these additional retry attempts are not persisted to disk and will restart on every node start
Add a 1 second backoff to any disk related error to not immediately spin on disk/io errors for challenges.