changefeedccl: retry posting to confluent registry up to 10 times #48759

dt · 2020-05-12T23:43:26Z

Sometimes we see timeouts or other issues contacting the schema registry.
However it would be nice if transient errors could be handled gracefully
without bubbling up to the user.

Release note (enterprise change): Changefeeds now retry after encountering transient errors contacting the Confluent Schema Registry.

cockroach-teamcity · 2020-05-12T23:43:32Z

This change is

ajwerner · 2020-05-12T23:56:40Z

pkg/ccl/changefeedccl/encoder.go

-	}
-	var res confluentSchemaVersionResponse
-	if err := gojson.NewDecoder(resp.Body).Decode(&res); err != nil {
+	var id int32


Another option here is to mark the error as retriable and bubble back up to the retry loop of the job.

blathers-crl · 2020-05-13T00:03:32Z

❌ The GitHub CI (Cockroach) build has failed on d3c71fc0.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

ajwerner

Reviewed 1 of 1 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt and @miretskiy)

pkg/ccl/changefeedccl/encoder.go, line 514 at r1 (raw file):

		return nil
	}); err != nil {
		return 0, err

To make this more concrete:

Consider wrapping the returned error here with MarkRetryableError. That being said, we may want to do some deeper inspection.

Another concern I have is about visibility. Should we have a log.Every on the encoder and then log on errors?

dt · 2020-05-16T03:18:40Z

I think I'm fine with retrying here, around the specific request, instead of just bubbling the failure up and making it tear everything down and try again -- if we know we have a localized source of flakes why not deploy a local solution?

We have similar reties around other sometimes-flaky connections to external network resources (e.g. in cloud storage connectors) so I'm ~fine sticking with the pattern. I'll can add a log-line but if there is a transient error, won't we just move on? and if it is a persistent one, won't it exhaust the retry limit and bubble up after 30s? I think eventually what we want is a counter that can go into the job status itself in case you're consistently getting, say, 5 retries that look like silent slowness, but I'm dubious that you'd have such a consistent issue that notice the slowness but never get one bubbling up bubbling up.

ajwerner

and if it is a persistent one, won't it exhaust the retry limit and bubble up after 30s?

Well, I think there are classes of errors where the registry may be down for minutes to hours where we don't obviously want to kill the changefeed (i.e. move it to FAILED). I am totally on board with some internal retries here but I still think we want to mark the error as retriable for the higher level.

I think there's a question around what sort of error should move a changefeed to failed. My sense is that it shouldn't be that many. You should be able to kill the sink server for a relatively extended period of time (hours?) and have the changefeed recover on its own.

That's my 2c, if you feel differently, I'm happy to acquiesce.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @dt, and @miretskiy)

dt · 2020-05-18T14:17:34Z

hm. I'm not sure I currently agree: right now a "running" job means "i'm doing the thing you asked", but if something is broken and we can't do that thing, claiming to be running while we sit a never-ending retry loop to seems bad. The user may justifiably blame us for the fact they're not seeing rows in their sink. If something outside our control is busted and means we can't do what we were asked to do, I'd rather throw up our hands and explain the situation than appear to just do nothing. Hopefully a user monitors their jobs and will be alerted by a failure, or if they have other monitoring that causes them to come look at the jobs, a failure should clearly identify what went wrong. Maybe in the future we can add more fine-grained job states to communicate failed-but-will-try-again, but right now given the boolean choice between running fine and failed, I think if we can't contact the registry after 10 attempts in a row, we should assume we're not going to be able to until an operator intervenes and say so.

ajwerner · 2020-05-18T14:24:59Z

My counter argument there is that the logic I’m proposing to retry forever exists for the Kafka sink if your error is talking to sinks rather than the schema registry. I fully agree that the visibility story on internally retrying changefeeds is abysmal - we should have some state for changefeeds which are hitting these errors and surface them to the user. That being said, I don’t think we want to roll back internal retry behavior for sinks so to not retry here would be inconsistent.

See this PR from Dan: danhhz@7f33dd7 and specifically the errorWrapperSink

The reason that logic doesn’t apply here is that we abstracted the confluent schema registry under the encoder rather than the sink.

dt · 2020-05-29T01:52:09Z

Okay, tweaked this to mark the error returned, if it makes it out of the the retry loop, as retryable. Kept the local loop though since if we can recover from a transient issue with just a local retry, we can avoid tearing the whole thing down and starting over which seems nice. Also put my thoughts on why we should do something better than retry forever in a comment to make myself feel better about adding it.

ajwerner

- thanks

I'd like to update the job status on display in SHOW JOBS every time we fail in the outermost retry loop at some point in this release. Also, we have a chart showing the restarts in the Changefeed dashboard in the Admin UI if it makes you feel a bit better.

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @dt, and @miretskiy)

Sometimes we see timeouts or other issues contacting the schema registry. However it would be nice if transient errors could be handled gracefully without bubbling up to the user. Release note (enterprise change): Changefeeds now retry after encountering transient errors contacting the Confluent Schema Registry.

dt · 2020-05-30T18:40:33Z

bors r+

craig · 2020-05-30T19:23:59Z

Build succeeded

GitHub CI (Cockroach)

dt requested review from ajwerner, a team and miretskiy and removed request for a team May 12, 2020 23:43

ajwerner reviewed May 12, 2020

View reviewed changes

ajwerner reviewed May 13, 2020

View reviewed changes

tbg mentioned this pull request May 13, 2020

changefeedccl: use generous HTTP timeout #48710

Closed

dt force-pushed the retry-confluent branch 3 times, most recently from 710beb0 to 8a71dcf Compare May 17, 2020 22:43

ajwerner reviewed May 18, 2020

View reviewed changes

dt force-pushed the retry-confluent branch 2 times, most recently from c4f567e to 526cb99 Compare May 29, 2020 01:49

ajwerner approved these changes May 29, 2020

View reviewed changes

dt force-pushed the retry-confluent branch from 526cb99 to fc8d4c0 Compare May 30, 2020 03:51

craig bot merged commit 47a6a02 into cockroachdb:master May 30, 2020

mwang1026 mentioned this pull request Jun 29, 2020

Timeout to Confluent Schema Registry from changefeed #48730

Closed

dt deleted the retry-confluent branch July 12, 2020 13:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

changefeedccl: retry posting to confluent registry up to 10 times #48759

changefeedccl: retry posting to confluent registry up to 10 times #48759

dt commented May 12, 2020

cockroach-teamcity commented May 12, 2020

ajwerner May 12, 2020

blathers-crl bot commented May 13, 2020

ajwerner left a comment

dt commented May 16, 2020

ajwerner left a comment

dt commented May 18, 2020

ajwerner commented May 18, 2020

dt commented May 29, 2020

ajwerner left a comment

dt commented May 30, 2020

craig bot commented May 30, 2020

changefeedccl: retry posting to confluent registry up to 10 times #48759

changefeedccl: retry posting to confluent registry up to 10 times #48759

Conversation

dt commented May 12, 2020

cockroach-teamcity commented May 12, 2020

ajwerner May 12, 2020

Choose a reason for hiding this comment

blathers-crl bot commented May 13, 2020

ajwerner left a comment

Choose a reason for hiding this comment

dt commented May 16, 2020

ajwerner left a comment

Choose a reason for hiding this comment

dt commented May 18, 2020

ajwerner commented May 18, 2020

dt commented May 29, 2020

ajwerner left a comment

Choose a reason for hiding this comment

dt commented May 30, 2020

craig bot commented May 30, 2020

Build succeeded