Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Retries for Transient Errors Connecting to Graphs/Subgraphs #1936

Merged
merged 4 commits into from
Jul 15, 2024

Conversation

jonathanrainer
Copy link
Contributor

@jonathanrainer jonathanrainer commented Jun 20, 2024

This PR fixes two separate issues identified with the way Rover handles retries for communicating with graphs/subgraphs when using rover dev.

  1. Under normal operations subgraphs will get disconnected from the federation upon the first receipt of an error, this is slightly too eager and now is configurable with the new subgraph-retries option.
    1. This had the side effect that composition was running a huge amount of the time causing CPU spikes.
  2. Under normal operations we did not have retries enabled for transient network issues, this is now fixed
    1. If this persists there is another solution which is to turn off connection pooling completely, however fixing these transient issues may render this unnecessary.

self.message_sender.update_subgraph(&subgraph_definition)?;
}
}
None => {
self.message_sender.add_subgraph(&subgraph_definition)?;
}
}
self.subgraph_retry_countdown = self.subgraph_retries;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Should we document the logic on how retry_countdown works?

For my understanding:

  1. It is normally at 0
  2. On success, we re-initialise it to subgraph_retries
  3. On every attempt, we decrement until it reaches 0

This means that if the first attempt is a failure, we stay at 0, thus go into the branch at line 225

Comment on lines 67 to 69
#[arg(long = "subgraph-retry", short = 'r', default_value = "1")]
#[serde(skip_serializing)]
pub subgraph_retry: u64,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: should this be pluralised? There is a bit of naming inconsistency in this PR between retry and retries

Suggested change
#[arg(long = "subgraph-retry", short = 'r', default_value = "1")]
#[serde(skip_serializing)]
pub subgraph_retry: u64,
#[arg(long = "subgraph-retries", short = 'r', default_value = "1")]
#[serde(skip_serializing)]
pub subgraph_retries: u64,

@@ -64,6 +64,7 @@ impl ClientBuilder {
.danger_accept_invalid_hostnames(self.accept_invalid_hostnames)
.timeout(self.timeout)
.user_agent(format!("{}/{}", PKG_NAME, PKG_VERSION))
.pool_max_idle_per_host(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: should we keep this change in this PR?

There might be unknown side-effects with this change, so I'd prefer to remove it, but re-add it at a later time if we see this is needed.

@jonathanrainer jonathanrainer force-pushed the jr/bug/ROVER-44 branch 2 times, most recently from 0f9b427 to e740f7f Compare June 20, 2024 11:54
@jonathanrainer jonathanrainer marked this pull request as ready for review June 20, 2024 12:04
@jonathanrainer jonathanrainer requested a review from a team as a code owner June 20, 2024 12:04
//
if self.subgraph_retry_countdown > 0 {
self.subgraph_retry_countdown -= 1;
eprintln!("{} error detected communicating with subgraph '{}', schema changes will not be reflected.\nWill retry but subgraph logs should be inspected", Emoji::Warn, &self.subgraph_key.0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to confirm, schema changes won't be saved because introspection couldn't run or because rover couldn't confirm the schema with the subgraph?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So more likely the former, essentially this happens when the introspection query fails for whatever reason (4xx, 5xx and so on), so we don't get back a new schema to check against.

Some(e.to_string())
} else {
eprintln!(
"{} retries exhausted for subgraph {}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine. how many retries does the user get?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's configurable with the --subgraph-retries parameter but I can easily add that into the log message so people know it's being respected?

@jonathanrainer jonathanrainer force-pushed the jr/bug/ROVER-44 branch 3 times, most recently from d7306c4 to 38bf054 Compare July 1, 2024 12:50
@jonathanrainer jonathanrainer added this to the vNext milestone Jul 10, 2024
@jonathanrainer jonathanrainer added the feature 🎉 new commands, flags, functionality, and improved error messages label Jul 13, 2024
@jonathanrainer jonathanrainer force-pushed the jr/bug/ROVER-44 branch 2 times, most recently from 5673f1a to 3bd20d1 Compare July 15, 2024 13:23
This way we avoid subgraphs potentially being
removed for spurious reasons. In addition enable
it for 'graphs` as well so that we don't end up
only partially fixing this
This should not be so eager to do so, and is
now in fact configurable.
@jonathanrainer jonathanrainer merged commit da07a74 into main Jul 15, 2024
18 checks passed
@jonathanrainer jonathanrainer deleted the jr/bug/ROVER-44 branch July 15, 2024 14:13
@jonathanrainer jonathanrainer mentioned this pull request Jul 19, 2024
jonathanrainer added a commit that referenced this pull request Jul 22, 2024
# [0.25.0] - 2024-07-22

## 🚀 Features

- **Enable Retries For Transient Errors Connecting To Graphs/Subgraphs -
@jonathanrainer PR #1936**

This turns on retries at the HTTP level for connections to
graphs/subgraphs to minimize connection resets and cancellations. Also,
a new --subgraph-retries flag for rover dev lets you set the number of
retries allowed when trying to re-establish a connection.

- **Add `--graph-ref` flag to `rover dev` - @dotdat PR #1984**

Introduces subgraph mirroring to rover dev. Subgraph mirroring inherits
the subgraph routing URLs and schemas from an existing Studio graphref.
This makes it easy to spin up a locally running supergraph without
maintaining a supergraph config. [See
here](https://www.apollographql.com/docs/rover/commands/dev#starting-a-session-from-a-graphos-studio-variant)
for more information.

## 🐛 Fixes

- **Fixes issues related to passing filenames to `--output` -
@jonathanrainer PR #1996**

An issue was raised whereby previous versions of Rover supported passing
filenames to the `--output` flag but this was
broken in v0.24.0. This has now been fixed and the previous
functionality restored.

## 🛠 Maintenance

- **Expand Smoke Tests To Run On All Supported Platforms -
@jonathanrainer PR #1980**
- **Fix cron expression, so it runs only once per day - @jonathanrainer
PR #1986**
- **Ensure we always use the correct version of Federation when testing
- @jonathanrainer PR #1987**
- **Add manual Smoke test invocation and pin Windows to `npm@9` for
testing - @jonathanrainer PR #1989**
- **Update apollographql/router to v1.51.0 - @jonathanrainer PR #1988**
- **Update node.js packages - @jonathanrainer PR #1979**

Includes `@eslint/compat` to v1.1.1, `eslint` to v9.7.0, `node.js` to
v20.15.1, `npm` to v10.8.2 and `prettier` to v3.3.3

- **Make sure x86 Mac Tests use 'latest' supergraph plugin version -
@jonathanrainer PR #1990**
- **Make sure homebrew runs `brew update` when we use it -
@jonathanrainer PR #1993**



## 📚 Documentation

- **Adds `graph-ref` flag to dev subcommand docs - @jackonawalk PR
#1945**
- **Update schema proposals capabilities docs - @Meschreiber PR #1949**
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature 🎉 new commands, flags, functionality, and improved error messages
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants