Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gateway: retry logic for requests to GCS #3836

Merged
merged 6 commits into from
Mar 23, 2020

Conversation

trevor-scheer
Copy link
Member

@trevor-scheer trevor-scheer commented Feb 27, 2020

This PR utilizes make-fetch-happen's built-in retry capabilities in order to retry failed requests to GCS.

It's worth noting that retries only occur on certain types of failures. For additional details, please see the docs.

Additionally, now that we've added retries, this PR adjusts how polling is done in order to prevent the possibility of multiple in-flight updates. The next "tick" only begins after a full round of updating is completed rather than on a perfectly regular interval. Thanks to @abernix for suggesting this change.

To elaborate a bit: previously the gateway would fire off a series of fetches to GCS every 10s (unless specified otherwise). It was (pretty safely) assumed that this wouldn't be problematic - though technically a race condition exists if the fetches were to take a number of seconds each. With the introduction of retries, this becomes considerably more likely due to exponential backoff. To prevent this condition, the gateway starts the next 10s "tick" once it's finished its round of requests to GCS.

@trevor-scheer trevor-scheer force-pushed the abernix/gateway-minor-qol-improvements branch from be8e5a8 to 4f2fbff Compare March 3, 2020 19:27
@trevor-scheer trevor-scheer force-pushed the trevor/gateway-retries branch 2 times, most recently from 605f47c to 7389f98 Compare March 6, 2020 23:15
@trevor-scheer trevor-scheer changed the base branch from abernix/gateway-minor-qol-improvements to release-2.12.0 March 6, 2020 23:15
@trevor-scheer trevor-scheer marked this pull request as ready for review March 11, 2020 00:56
Copy link
Member

@abernix abernix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all of the implementation here looks good with a couple comments about clarity/intent. However, I'm concerned we're not doing anything explicit/programmatically to ensure that we don't have multiple concurrent retry-able requests being retried at the same time.

For example, the default polling interval for pollingTimer is 10000. Could there be more than one at a time, e.g. the second of the five requests to GCS to obtain a composed schema fails and starts retrying with 30 seconds to go, but another invocation of pollingTimer is kicked off before the retries elapse and starts its own process?

Perhaps the answer here is that the definitive resolution/rejection of the entirety of a fetch pass is what sets the next interval into motion. In other words, setInterval changes to setTimeout and the new timer is created by the rejection/resolution of the totality of the multiple fetches (or rather, fetchApolloGcses), after all retries are fully resolved realized, which all currently happens within updateComposition here:

await this.updateComposition();

Does that make sense?

packages/apollo-gateway/src/index.ts Outdated Show resolved Hide resolved
@trevor-scheer trevor-scheer force-pushed the trevor/gateway-retries branch from ed6446b to b13fdd3 Compare March 20, 2020 00:34
@trevor-scheer
Copy link
Member Author

@abernix I've incorporated your larger feedback into b13fdd3. Please take a look and let me know how you feel about the changes!

Copy link
Member

@abernix abernix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lovely!

Comment on lines +460 to +463
// Prevent the Node.js event loop from remaining active (and preventing,
// e.g. process shutdown) by calling `unref` on the `Timeout`. For more
// information, see https://nodejs.org/api/timers.html#timers_timeout_unref.
this.pollingTimer?.unref();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Substantially less critical to have this now that it's not a (forever) interval, but probably worth keeping.

packages/apollo-gateway/src/index.ts Show resolved Hide resolved
@trevor-scheer trevor-scheer force-pushed the trevor/gateway-retries branch from 16fef9a to ea10d29 Compare March 23, 2020 15:19
@trevor-scheer trevor-scheer changed the title [WIP] Gateway: retry logic for requests to GCS Gateway: retry logic for requests to GCS Mar 23, 2020
@abernix abernix added this to the Release 2.12.0 milestone Mar 23, 2020
@trevor-scheer trevor-scheer merged commit cdee9d6 into release-2.12.0 Mar 23, 2020
@trevor-scheer trevor-scheer deleted the trevor/gateway-retries branch March 23, 2020 23:27
abernix pushed a commit to apollographql/federation that referenced this pull request Sep 4, 2020
…#3836)

Implement gateway retry logic for requests to GCS. Failed requests will retry up to 5 times.

Additionally, this PR adjusts how polling is done in order to prevent the possibility of multiple in-flight updates. The next "tick" only begins after a full round of updating is completed rather than on a perfectly regular interval. Thanks to @abernix for suggesting this change. For more details please see the PR description.
Apollo-Orig-Commit-AS: apollographql/apollo-server@cdee9d6
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 16, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants