-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
changefeedccl: retry posting to confluent registry up to 10 times #48759
Conversation
} | ||
var res confluentSchemaVersionResponse | ||
if err := gojson.NewDecoder(resp.Body).Decode(&res); err != nil { | ||
var id int32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option here is to mark the error as retriable and bubble back up to the retry loop of the job.
❌ The GitHub CI (Cockroach) build has failed on d3c71fc0. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt and @miretskiy)
pkg/ccl/changefeedccl/encoder.go, line 514 at r1 (raw file):
return nil }); err != nil { return 0, err
To make this more concrete:
Consider wrapping the returned error here with MarkRetryableError
. That being said, we may want to do some deeper inspection.
Another concern I have is about visibility. Should we have a log.Every on the encoder and then log on errors?
I think I'm fine with retrying here, around the specific request, instead of just bubbling the failure up and making it tear everything down and try again -- if we know we have a localized source of flakes why not deploy a local solution? We have similar reties around other sometimes-flaky connections to external network resources (e.g. in cloud storage connectors) so I'm ~fine sticking with the pattern. I'll can add a log-line but if there is a transient error, won't we just move on? and if it is a persistent one, won't it exhaust the retry limit and bubble up after 30s? I think eventually what we want is a counter that can go into the job status itself in case you're consistently getting, say, 5 retries that look like silent slowness, but I'm dubious that you'd have such a consistent issue that notice the slowness but never get one bubbling up bubbling up. |
710beb0
to
8a71dcf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and if it is a persistent one, won't it exhaust the retry limit and bubble up after 30s?
Well, I think there are classes of errors where the registry may be down for minutes to hours where we don't obviously want to kill the changefeed (i.e. move it to FAILED). I am totally on board with some internal retries here but I still think we want to mark the error as retriable for the higher level.
I think there's a question around what sort of error should move a changefeed to failed. My sense is that it shouldn't be that many. You should be able to kill the sink server for a relatively extended period of time (hours?) and have the changefeed recover on its own.
That's my 2c, if you feel differently, I'm happy to acquiesce.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @dt, and @miretskiy)
hm. I'm not sure I currently agree: right now a "running" job means "i'm doing the thing you asked", but if something is broken and we can't do that thing, claiming to be running while we sit a never-ending retry loop to seems bad. The user may justifiably blame us for the fact they're not seeing rows in their sink. If something outside our control is busted and means we can't do what we were asked to do, I'd rather throw up our hands and explain the situation than appear to just do nothing. Hopefully a user monitors their jobs and will be alerted by a failure, or if they have other monitoring that causes them to come look at the jobs, a failure should clearly identify what went wrong. Maybe in the future we can add more fine-grained job states to communicate failed-but-will-try-again, but right now given the boolean choice between running fine and failed, I think if we can't contact the registry after 10 attempts in a row, we should assume we're not going to be able to until an operator intervenes and say so. |
My counter argument there is that the logic I’m proposing to retry forever exists for the Kafka sink if your error is talking to sinks rather than the schema registry. I fully agree that the visibility story on internally retrying changefeeds is abysmal - we should have some state for changefeeds which are hitting these errors and surface them to the user. That being said, I don’t think we want to roll back internal retry behavior for sinks so to not retry here would be inconsistent. See this PR from Dan: danhhz@7f33dd7 and specifically the errorWrapperSink The reason that logic doesn’t apply here is that we abstracted the confluent schema registry under the encoder rather than the sink. |
c4f567e
to
526cb99
Compare
Okay, tweaked this to mark the error returned, if it makes it out of the the retry loop, as retryable. Kept the local loop though since if we can recover from a transient issue with just a local retry, we can avoid tearing the whole thing down and starting over which seems nice. Also put my thoughts on why we should do something better than retry forever in a comment to make myself feel better about adding it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to update the job status on display in SHOW JOBS
every time we fail in the outermost retry loop at some point in this release. Also, we have a chart showing the restarts in the Changefeed dashboard in the Admin UI if it makes you feel a bit better.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @dt, and @miretskiy)
Sometimes we see timeouts or other issues contacting the schema registry. However it would be nice if transient errors could be handled gracefully without bubbling up to the user. Release note (enterprise change): Changefeeds now retry after encountering transient errors contacting the Confluent Schema Registry.
bors r+ |
Build succeeded |
Sometimes we see timeouts or other issues contacting the schema registry.
However it would be nice if transient errors could be handled gracefully
without bubbling up to the user.
Release note (enterprise change): Changefeeds now retry after encountering transient errors contacting the Confluent Schema Registry.