Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add retries into Scanner BlobWriter #5471

Merged
merged 10 commits into from
Dec 11, 2023

Conversation

agautam478
Copy link
Contributor

What changed?
The scanner environment consistently encounters issues with broken shards, which remain unresolved and likely contain erroneous data. This issue is highlighted by the cadence_shard_failure metric, which not only signals these specific shard failures but also other scan-related issues.

A primary contributor to these failures, among various factors, is the malfunctioning of the Blobwriter during the process of writing scan outputs into the database. The root causes for this malfunction could be varied, but a notable and addressable concern is the writer's inability to handle transient errors or network issues effectively.

To mitigate this, we are implementing a retry mechanism in the writer.

  • Inside the getBlobstoreWriteFn, we added a loop that attempts the write operation up to MaxRetries times.
  • After each failed attempt (if an error occurs), we wait for RetryDelay before retrying.
  • If all attempts fail, we return the last error encountered.

Why?
This enhancement aims to offer the writer additional opportunities to successfully upload data, especially in scenarios where failures are due to temporary network disruptions or similar transient issues. By allowing retries, we can potentially reduce the frequency of these scan failures and improve the overall stability of the system.

How did you test it?
Tested locally.

Potential risks
might cause delay in the blobstore uploads.

Release notes
NA

Documentation Changes
NA

retryDelay := InitialRetryDelay
// The idea is to implement a loop that retries the write operation a certain number of times before finally failing.
// We'll also include a delay between retries to give transient issues (like temporary network glitches) a chance to resolve.
for attempt := 0; attempt < MaxRetries; attempt++ {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we have a retry library for this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not we can try this one. I used it before without any issues

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that I know of. Will try it. Thanks for the suggestion.

Copy link
Member

@Groxx Groxx Dec 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have one or two backoff-things through the code, backoff-for-retry is used very heavily throughout cadence.

Comment on lines 165 to 167
// This looks like one of the contributors of the high number of skipped_scans in the system.
// Upon closer investigation we discovered that there is no retry mechanism in the Blobstore write function, which might
// result in unnecessary skip of some scans. Especially, the ones that might have happened due to network blip.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably worth removing this, as it's (at least partially) no longer true after this merges.

useful as a commit message / review context, but not really for long-term purposes

@agautam478 agautam478 requested a review from Groxx December 8, 2023 22:04
@agautam478 agautam478 requested a review from Groxx December 8, 2023 23:57
Comment on lines +32 to +34

"github.com/uber/cadence/common/metrics/mocks"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inconsistent new lines in the import section. One common practice that I followed in the past is to group imports like this:

  • builtin imports (e.g. fmt)
  • local imports (stuff from current repo)
  • external imports (vendor libraries)

There would be new line between each group.
I know we don't follow everywhere this but it can help organize imports when unsure.

@agautam478 agautam478 merged commit f6433e0 into cadence-workflow:master Dec 11, 2023
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants