Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changefeedccl: retry posting to confluent registry up to 10 times #48759

Merged
merged 1 commit into from
May 30, 2020

Conversation

dt
Copy link
Member

@dt dt commented May 12, 2020

Sometimes we see timeouts or other issues contacting the schema registry.
However it would be nice if transient errors could be handled gracefully
without bubbling up to the user.

Release note (enterprise change): Changefeeds now retry after encountering transient errors contacting the Confluent Schema Registry.

@dt dt requested review from ajwerner, a team and miretskiy and removed request for a team May 12, 2020 23:43
@cockroach-teamcity
Copy link
Member

This change is Reviewable

}
var res confluentSchemaVersionResponse
if err := gojson.NewDecoder(resp.Body).Decode(&res); err != nil {
var id int32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option here is to mark the error as retriable and bubble back up to the retry loop of the job.

@blathers-crl
Copy link

blathers-crl bot commented May 13, 2020

❌ The GitHub CI (Cockroach) build has failed on d3c71fc0.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt and @miretskiy)


pkg/ccl/changefeedccl/encoder.go, line 514 at r1 (raw file):

		return nil
	}); err != nil {
		return 0, err

To make this more concrete:

Consider wrapping the returned error here with MarkRetryableError. That being said, we may want to do some deeper inspection.

Another concern I have is about visibility. Should we have a log.Every on the encoder and then log on errors?

@dt
Copy link
Member Author

dt commented May 16, 2020

I think I'm fine with retrying here, around the specific request, instead of just bubbling the failure up and making it tear everything down and try again -- if we know we have a localized source of flakes why not deploy a local solution?

We have similar reties around other sometimes-flaky connections to external network resources (e.g. in cloud storage connectors) so I'm ~fine sticking with the pattern. I'll can add a log-line but if there is a transient error, won't we just move on? and if it is a persistent one, won't it exhaust the retry limit and bubble up after 30s? I think eventually what we want is a counter that can go into the job status itself in case you're consistently getting, say, 5 retries that look like silent slowness, but I'm dubious that you'd have such a consistent issue that notice the slowness but never get one bubbling up bubbling up.

@dt dt force-pushed the retry-confluent branch 3 times, most recently from 710beb0 to 8a71dcf Compare May 17, 2020 22:43
Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and if it is a persistent one, won't it exhaust the retry limit and bubble up after 30s?

Well, I think there are classes of errors where the registry may be down for minutes to hours where we don't obviously want to kill the changefeed (i.e. move it to FAILED). I am totally on board with some internal retries here but I still think we want to mark the error as retriable for the higher level.

I think there's a question around what sort of error should move a changefeed to failed. My sense is that it shouldn't be that many. You should be able to kill the sink server for a relatively extended period of time (hours?) and have the changefeed recover on its own.

That's my 2c, if you feel differently, I'm happy to acquiesce.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @dt, and @miretskiy)

@dt
Copy link
Member Author

dt commented May 18, 2020

hm. I'm not sure I currently agree: right now a "running" job means "i'm doing the thing you asked", but if something is broken and we can't do that thing, claiming to be running while we sit a never-ending retry loop to seems bad. The user may justifiably blame us for the fact they're not seeing rows in their sink. If something outside our control is busted and means we can't do what we were asked to do, I'd rather throw up our hands and explain the situation than appear to just do nothing. Hopefully a user monitors their jobs and will be alerted by a failure, or if they have other monitoring that causes them to come look at the jobs, a failure should clearly identify what went wrong. Maybe in the future we can add more fine-grained job states to communicate failed-but-will-try-again, but right now given the boolean choice between running fine and failed, I think if we can't contact the registry after 10 attempts in a row, we should assume we're not going to be able to until an operator intervenes and say so.

@ajwerner
Copy link
Contributor

My counter argument there is that the logic I’m proposing to retry forever exists for the Kafka sink if your error is talking to sinks rather than the schema registry. I fully agree that the visibility story on internally retrying changefeeds is abysmal - we should have some state for changefeeds which are hitting these errors and surface them to the user. That being said, I don’t think we want to roll back internal retry behavior for sinks so to not retry here would be inconsistent.

See this PR from Dan: danhhz@7f33dd7 and specifically the errorWrapperSink

The reason that logic doesn’t apply here is that we abstracted the confluent schema registry under the encoder rather than the sink.

@dt dt force-pushed the retry-confluent branch 2 times, most recently from c4f567e to 526cb99 Compare May 29, 2020 01:49
@dt
Copy link
Member Author

dt commented May 29, 2020

Okay, tweaked this to mark the error returned, if it makes it out of the the retry loop, as retryable. Kept the local loop though since if we can recover from a transient issue with just a local retry, we can avoid tearing the whole thing down and starting over which seems nice. Also put my thoughts on why we should do something better than retry forever in a comment to make myself feel better about adding it.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: - thanks

I'd like to update the job status on display in SHOW JOBS every time we fail in the outermost retry loop at some point in this release. Also, we have a chart showing the restarts in the Changefeed dashboard in the Admin UI if it makes you feel a bit better.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @dt, and @miretskiy)

Sometimes we see timeouts or other issues contacting the schema registry.
However it would be nice if transient errors could be handled gracefully
without bubbling up to the user.

Release note (enterprise change): Changefeeds now retry after encountering transient errors contacting the Confluent Schema Registry.
@dt
Copy link
Member Author

dt commented May 30, 2020

bors r+

@craig
Copy link
Contributor

craig bot commented May 30, 2020

Build succeeded

@craig craig bot merged commit 47a6a02 into cockroachdb:master May 30, 2020
@dt dt deleted the retry-confluent branch July 12, 2020 13:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants