fix: Retry check workflow query to be more resilient to backend failures #1740

swcollard · 2023-09-12T17:44:07Z

Overview

The check_workflow commands currently have an assumption that the request has a success value. If there is a transient issue or blip on the backend when its unavailable the entire check polling failings which appears to the end user as a failure to run a check workflow, despite the check being submitted successfully. The other adverse effect is the gateway is exposing backend/subgraph error messages to the client when it probably shouldn't be.

This change augments the loop a bit and if there is an error, it sleeps and retries until hitting the configured checks_timeout_seconds (default 5 min) at which point it returned an E031 timeout. This should allow us to be a little more resilient to backend failures since this query is retryable and can return the completed check if queried again.

How this was tested

For testing, I added an artificial sleep after submitting the check request but before polling for the results started. This gave me time to disable my internet connection, simulating request failure, and the loop would continue to poll every 5 seconds until reaching the checks_timeout_seconds value (defaulted to 5 minutes). For a second test, I disabled, then re-enabled my internet connection during the polling and it was able to fetch the results response once the connection was re-established. The UX of this was the request appeared to be taking longer.

crates/rover-client/src/operations/graph/check_workflow/runner.rs

Co-authored-by: Avery Harnish <avery@apollographql.com>

Retry check workflow query to be more resiliant to backend failures

c512909

swcollard requested a review from EverlastingBugstopper as a code owner September 12, 2023 17:44

Lint fixes

3792ff0

EverlastingBugstopper reviewed Sep 15, 2023

View reviewed changes

crates/rover-client/src/operations/graph/check_workflow/runner.rs Outdated Show resolved Hide resolved

crates/rover-client/src/operations/graph/check_workflow/runner.rs Show resolved Hide resolved

Match result instead of if is_ok()

35888d4

Co-authored-by: Avery Harnish <avery@apollographql.com>

EverlastingBugstopper mentioned this pull request Sep 18, 2023

chore: store info about last error in check workflow retry loop #1745

Merged

EverlastingBugstopper added 2 commits September 18, 2023 17:33

chore: store info about last error in check workflow retry loop (#1745)

de0c741

Merge branch 'main' into swcollard/check_workflow_loop_retry_on_failure

7a47487

EverlastingBugstopper approved these changes Sep 19, 2023

View reviewed changes

EverlastingBugstopper enabled auto-merge (squash) September 19, 2023 13:59

EverlastingBugstopper disabled auto-merge September 19, 2023 14:16

EverlastingBugstopper added 2 commits September 19, 2023 10:29

chore: log error on each retry instead of only at the end

4918981

chore: make it compile

8c894a1

EverlastingBugstopper merged commit 5e06d54 into main Sep 19, 2023

EverlastingBugstopper deleted the swcollard/check_workflow_loop_retry_on_failure branch September 19, 2023 15:04

EverlastingBugstopper assigned swcollard and EverlastingBugstopper Sep 19, 2023

EverlastingBugstopper added the feature 🎉 new commands, flags, functionality, and improved error messages label Sep 19, 2023

EverlastingBugstopper added this to the v0.19.0 milestone Sep 19, 2023

EverlastingBugstopper mentioned this pull request Sep 19, 2023

release: v0.19.0 #1748

Merged

WontonSam mentioned this pull request Jul 16, 2024

[Snyk] Upgrade @apollo/rover from 0.14.2 to 0.23.0 WontonSam/apollo-federation-subgraph-compatibility#204

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Retry check workflow query to be more resilient to backend failures #1740

fix: Retry check workflow query to be more resilient to backend failures #1740

swcollard commented Sep 12, 2023

fix: Retry check workflow query to be more resilient to backend failures #1740

fix: Retry check workflow query to be more resilient to backend failures #1740

Conversation

swcollard commented Sep 12, 2023

Overview

How this was tested