fix: Retry check workflow query to be more resilient to backend failures #1740
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview
The check_workflow commands currently have an assumption that the request has a success value. If there is a transient issue or blip on the backend when its unavailable the entire check polling failings which appears to the end user as a failure to run a check workflow, despite the check being submitted successfully. The other adverse effect is the gateway is exposing backend/subgraph error messages to the client when it probably shouldn't be.
This change augments the loop a bit and if there is an error, it sleeps and retries until hitting the configured
checks_timeout_seconds
(default 5 min) at which point it returned an E031 timeout. This should allow us to be a little more resilient to backend failures since this query is retryable and can return the completed check if queried again.How this was tested
For testing, I added an artificial sleep after submitting the check request but before polling for the results started. This gave me time to disable my internet connection, simulating request failure, and the loop would continue to poll every 5 seconds until reaching the
checks_timeout_seconds
value (defaulted to 5 minutes). For a second test, I disabled, then re-enabled my internet connection during the polling and it was able to fetch the results response once the connection was re-established. The UX of this was the request appeared to be taking longer.