Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql,pgerror: improve IsSQLRetryableError and IsDistSQLRetryable #82847

Open
ajwerner opened this issue Jun 13, 2022 · 3 comments
Open

sql,pgerror: improve IsSQLRetryableError and IsDistSQLRetryable #82847

ajwerner opened this issue Jun 13, 2022 · 3 comments
Labels
A-error-handling Error messages, error propagations/annotations A-sql-pgwire pgwire protocol issues. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)

Comments

@ajwerner
Copy link
Contributor

ajwerner commented Jun 13, 2022

Describe the problem

The string error matching is brittle. The goals are good. Retrying when an error might be permanent no longer hurts in the way it once did: jobs use exponential backoff and record a log of failures. The user can cancel a job if the system makes the wrong decision systematically.

matched, merr := regexp.MatchString(
"(no inbound stream connection|connection reset by peer|connection refused|failed to send RPC|rpc error: code = Unavailable|EOF|result is ambiguous)",
errString)

Jira issue: CRDB-16692

@ajwerner ajwerner added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jun 13, 2022
@ajwerner
Copy link
Contributor Author

https://cockroachlabs.slack.com/archives/C0HM2DZ34/p1655150171201549 contains some internal discussion on the topic.

@knz knz added A-sql-pgwire pgwire protocol issues. A-error-handling Error messages, error propagations/annotations labels Jun 13, 2022
@blathers-crl blathers-crl bot added T-sql-queries SQL Queries Team T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) labels Jun 13, 2022
@knz
Copy link
Contributor

knz commented Jun 14, 2022

@dt's summary:

currently the jobs system has a retry loop with built-in backoff, and many jobs themselves have retry loops either around their entire work function or specific phases. The generic jobs system one and its exponential backoff is not always appropriate for all jobs, nor is its generic “should retry” logic easy to maintain to the satisfaction of all different jobs. On the other hand, if a job’s code crashes a node, the job’s retry/backoff/etc code doesn't have the option to do its own, more specific thing, and we need this system level backoff around even invoking the jobs code at all to avoid crashing all nodes in quick succession.

One proposed answer is to simplify/dumb down the system-builtin logic to have no retry on error, and thus no need to try to classify different errors from different jobs in one place. The system would instead just rate limit how often it is willing to call the job code between getting some sort of return value from it: record timestamp, invoke job. If you notice job needs to be invoked again (because the node that invoked it crashed), wait some minimum time (60s?). In any other situation, it is up to the job itself to elect to retry certain errors, how long to wait between those tries and when to emit a terminal success or failure. As one special case, emitting an error with a particular marker could be non-terminal and just surrender the execution lease which would clear the invoke backoff and allow another node to invoke immediately

@ajwerner
Copy link
Contributor Author

I think that summary is somewhat orthogonal to this issue; it is something of an aside related to retry responsibility of jobs. The function here is not generic for the whole job system, it's just used by a couple of jobs to decide whether whether to move to the failed state or not.

@exalate-issue-sync exalate-issue-sync bot removed the T-sql-queries SQL Queries Team label Jun 1, 2023
@yuzefovich yuzefovich changed the title sql,pgerror: improve IsSQLRetryableError sql,pgerror: improve IsSQLRetryableError and IsDistSQLRetryable Aug 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-error-handling Error messages, error propagations/annotations A-sql-pgwire pgwire protocol issues. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)
Projects
None yet
Development

No branches or pull requests

2 participants