Add retries into Scanner BlobWriter #5471

agautam478 · 2023-12-07T21:03:06Z

What changed?
The scanner environment consistently encounters issues with broken shards, which remain unresolved and likely contain erroneous data. This issue is highlighted by the cadence_shard_failure metric, which not only signals these specific shard failures but also other scan-related issues.

A primary contributor to these failures, among various factors, is the malfunctioning of the Blobwriter during the process of writing scan outputs into the database. The root causes for this malfunction could be varied, but a notable and addressable concern is the writer's inability to handle transient errors or network issues effectively.

To mitigate this, we are implementing a retry mechanism in the writer.

Inside the getBlobstoreWriteFn, we added a loop that attempts the write operation up to MaxRetries times.
After each failed attempt (if an error occurs), we wait for RetryDelay before retrying.
If all attempts fail, we return the last error encountered.

Why?
This enhancement aims to offer the writer additional opportunities to successfully upload data, especially in scenarios where failures are due to temporary network disruptions or similar transient issues. By allowing retries, we can potentially reduce the frequency of these scan failures and improve the overall stability of the system.

How did you test it?
Tested locally.

Potential risks
might cause delay in the blobstore uploads.

Release notes
NA

Documentation Changes
NA

taylanisikdemir · 2023-12-07T22:18:57Z

common/reconciliation/store/blobstoreWriter.go

+		retryDelay := InitialRetryDelay
+		// The idea is to implement a loop that retries the write operation a certain number of times before finally failing.
+		// We'll also include a delay between retries to give transient issues (like temporary network glitches) a chance to resolve.
+		for attempt := 0; attempt < MaxRetries; attempt++ {


don't we have a retry library for this?

if not we can try this one. I used it before without any issues

Not that I know of. Will try it. Thanks for the suggestion.

we have one or two backoff-things through the code, backoff-for-retry is used very heavily throughout cadence.

Groxx · 2023-12-08T21:05:35Z

service/worker/scanner/shardscanner/scanner.go

+				// This looks like one of the contributors of the high number of skipped_scans in the system.
+				// Upon closer investigation we discovered that there is no retry mechanism in the Blobstore write function, which might
+				// result in unnecessary skip of some scans. Especially, the ones that might have happened due to network blip.


probably worth removing this, as it's (at least partially) no longer true after this merges.

useful as a commit message / review context, but not really for long-term purposes

common/reconciliation/store/blobstoreWriter.go

taylanisikdemir · 2023-12-11T19:21:39Z

service/history/handler_test.go

+
+	"github.com/uber/cadence/common/metrics/mocks"
+


inconsistent new lines in the import section. One common practice that I followed in the past is to group imports like this:

builtin imports (e.g. fmt)

local imports (stuff from current repo)

external imports (vendor libraries)

There would be new line between each group.
I know we don't follow everywhere this but it can help organize imports when unsure.

agautam478 added 2 commits December 7, 2023 12:23

resolved merge conflicts

41b568d

Adding retries into the Scanner Blobstore Writer

193ed85

agautam478 requested review from Shaddoll, neil-xie, davidporter-id-au, Groxx, shijiesheng, jakobht, 3vilhamster, sankari165, dkrotx, taylanisikdemir and demirkayaender as code owners December 7, 2023 21:03

taylanisikdemir reviewed Dec 7, 2023

View reviewed changes

agautam478 added 2 commits December 8, 2023 11:09

Added Cadence internal backoff and throttle for retries

0b323ed

Added max retries

9d4a58e

agautam478 requested a review from taylanisikdemir December 8, 2023 19:16

Merge branch 'master' into add_retries

e02867d

Groxx reviewed Dec 8, 2023

View reviewed changes

common/reconciliation/store/blobstoreWriter.go Outdated Show resolved Hide resolved

Groxx reviewed Dec 8, 2023

View reviewed changes

common/reconciliation/store/blobstoreWriter.go Outdated Show resolved Hide resolved

agautam478 and others added 2 commits December 8, 2023 13:21

Merge branch 'master' into add_retries

dbf5a40

Resolved code comments

3e95814

agautam478 requested a review from Groxx December 8, 2023 22:04

davidporter-id-au reviewed Dec 8, 2023

View reviewed changes

common/reconciliation/store/blobstoreWriter.go Show resolved Hide resolved

agautam478 requested a review from davidporter-id-au December 8, 2023 22:53

Injected the policies for that each write attempt has new policy

d8caf22

Groxx reviewed Dec 8, 2023

View reviewed changes

common/reconciliation/store/blobstoreWriter.go Outdated Show resolved Hide resolved

Added totalretryduration

6a143c2

agautam478 requested a review from Groxx December 8, 2023 23:57

Merge branch 'master' into add_retries

598b2a3

neil-xie approved these changes Dec 11, 2023

View reviewed changes

taylanisikdemir approved these changes Dec 11, 2023

View reviewed changes

Groxx approved these changes Dec 11, 2023

View reviewed changes

agautam478 merged commit f6433e0 into cadence-workflow:master Dec 11, 2023
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add retries into Scanner BlobWriter #5471

Add retries into Scanner BlobWriter #5471

agautam478 commented Dec 7, 2023

taylanisikdemir Dec 7, 2023

taylanisikdemir Dec 7, 2023

agautam478 Dec 7, 2023

Groxx Dec 7, 2023 •

edited

Loading

Groxx Dec 8, 2023

taylanisikdemir Dec 11, 2023

Add retries into Scanner BlobWriter #5471

Add retries into Scanner BlobWriter #5471

Conversation

agautam478 commented Dec 7, 2023

taylanisikdemir Dec 7, 2023

Choose a reason for hiding this comment

taylanisikdemir Dec 7, 2023

Choose a reason for hiding this comment

agautam478 Dec 7, 2023

Choose a reason for hiding this comment

Groxx Dec 7, 2023 • edited Loading

Choose a reason for hiding this comment

Groxx Dec 8, 2023

Choose a reason for hiding this comment

taylanisikdemir Dec 11, 2023

Choose a reason for hiding this comment

Groxx Dec 7, 2023 •

edited

Loading