-
Notifications
You must be signed in to change notification settings - Fork 805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add retries into Scanner BlobWriter #5471
Conversation
retryDelay := InitialRetryDelay | ||
// The idea is to implement a loop that retries the write operation a certain number of times before finally failing. | ||
// We'll also include a delay between retries to give transient issues (like temporary network glitches) a chance to resolve. | ||
for attempt := 0; attempt < MaxRetries; attempt++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't we have a retry library for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if not we can try this one. I used it before without any issues
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that I know of. Will try it. Thanks for the suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have one or two backoff-things through the code, backoff-for-retry is used very heavily throughout cadence.
// This looks like one of the contributors of the high number of skipped_scans in the system. | ||
// Upon closer investigation we discovered that there is no retry mechanism in the Blobstore write function, which might | ||
// result in unnecessary skip of some scans. Especially, the ones that might have happened due to network blip. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably worth removing this, as it's (at least partially) no longer true after this merges.
useful as a commit message / review context, but not really for long-term purposes
|
||
"github.com/uber/cadence/common/metrics/mocks" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inconsistent new lines in the import section. One common practice that I followed in the past is to group imports like this:
- builtin imports (e.g. fmt)
- local imports (stuff from current repo)
- external imports (vendor libraries)
There would be new line between each group.
I know we don't follow everywhere this but it can help organize imports when unsure.
What changed?
The scanner environment consistently encounters issues with broken shards, which remain unresolved and likely contain erroneous data. This issue is highlighted by the
cadence_shard_failure metric
, which not only signals these specific shard failures but also other scan-related issues.A primary contributor to these failures, among various factors, is the malfunctioning of the Blobwriter during the process of writing scan outputs into the database. The root causes for this malfunction could be varied, but a notable and addressable concern is the writer's inability to handle transient errors or network issues effectively.
To mitigate this, we are implementing a retry mechanism in the writer.
Why?
This enhancement aims to offer the writer additional opportunities to successfully upload data, especially in scenarios where failures are due to temporary network disruptions or similar transient issues. By allowing retries, we can potentially reduce the frequency of these scan failures and improve the overall stability of the system.
How did you test it?
Tested locally.
Potential risks
might cause delay in the blobstore uploads.
Release notes
NA
Documentation Changes
NA