ref(project-upstream): Emit error on multiple project fetch failures #2700
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR makes Relay emit an error after continued project config fetch failures for an interval. Currently, Relay emits some errors on some specific failure cases, but there's a metric to tell whether it's failing all project config fetches or just a few of them. The interval can be configured with
http.project_failure_interval
.Approach
Relay tracks the time of a failed fetch, and resets it on a successful one. If some time elapses and Relay hasn't reset the time, it emits an error for each failed request. This should create a big spike of errors after a continued time of failed fetches, demanding attention to the issue.
The fetch time is represented as
Option<Instant>
. Without the option, resetting the instant means setting the most recent time. In that case, Relay would emit an error if a fetch fails after some time of not emitting any fetches at all. This scenario is more likely to happen in low-volume environments, like some Self-Hosted instances.Default interval
I've set an arbitrary default value of 90 seconds. This should be short enough to determine Relay is having issues without false positives, and long enough to retry the same request twice with default values (
max_retry_interval = 60s
).