release-24.3: colexec: harden eager cancellation in parallel unordered sync #134609

blathers-crl · 2024-11-08T02:06:44Z

Backport 1/1 commits from #133893 on behalf of @yuzefovich.

/cc @cockroachdb/release

This commit hardens the eager cancellation mechanism in the parallel unordered synchronizer. It was recently fixed in dda8b3a, but the newly added test exposed a bug where the eager cancellation in a child PUS could poison the query execution of another input of the parent PUS, incorrectly failing the query altogether. More detailed description can be found here, but in short, due sharing of the same leaf txn between most operators in a flow, eager cancellation of one operator could lead to poisoning the execution of another operator which could only happen with a hierarchy of PUSes. This commit fixes such situation by swallowing all context cancellation errors in draining state of a PUS even if that particular PUS didn't eagerly cancel its inputs.

The rationale for why this behavior is safe is the following:

if the query should result in an error, then some other error must have been propagated to the client, and this is what caused the sync to transition into the draining state in the first place. (We do replace errors for the client in one case - set DistSQLReceiver.SetError where some errors from KV have higher priority then others, but it isn't applicable here.)
if the query should not result in an error and should succeed, yet we have some pending context cancellation errors, then it must be the case that query execution was short-circuited (e.g. because of the LIMIT), so we can pretend the part of the execution that hit the pending error didn't actually run since clearly it wasn't necessary to compute the query result.

Note that we couldn't swallow all types of errors in the draining state (e.g. ReadWithinUncertaintyIntervalError that comes from the KV layer results in "poisoning" the txn, so we need to propagate it to the client), so we only have a single error type that we swallow.

Also note that having another PUS is needed for this problem to occur because we must have concurrency between the child PUS that performs the eager cancellation and another operator that gets poisoned, and while we have two sources of concurrency within a single flow, only PUS is applicable (the other being outboxes but we only have eager cancellation for local plans).

Additionally, while working on this change I realized another reason for why we don't want to lift the restriction for having eager cancellation only on "leaf" PUSes, so I extended the comment. This commit also adds a few more logic tests.

Fixes: #127942.

Release note: None

Release justification: bug fix.

This commit hardens the eager cancellation mechanism in the parallel unordered synchronizer. It was recently fixed in dda8b3a, but the newly added test exposed a bug where the eager cancellation in a child PUS could poison the query execution of another input of the parent PUS, incorrectly failing the query altogether. More detailed description can be found [here](#127942 (comment)), but in short, due sharing of the same leaf txn between most operators in a flow, eager cancellation of one operator could lead to poisoning the execution of another operator which could only happen with a hierarchy of PUSes. This commit fixes such situation by swallowing all context cancellation errors in draining state of a PUS _even if_ that particular PUS didn't eagerly cancel its inputs. The rationale for why this behavior is safe is the following: - if the query should result in an error, then some other error must have been propagated to the client, and this is what caused the sync to transition into the draining state in the first place. (We do replace errors for the client in one case - set `DistSQLReceiver.SetError` where some errors from KV have higher priority then others, but it isn't applicable here.) - if the query should not result in an error and should succeed, yet we have some pending context cancellation errors, then it must be the case that query execution was short-circuited (e.g. because of the LIMIT), so we can pretend the part of the execution that hit the pending error didn't actually run since clearly it wasn't necessary to compute the query result. Note that we couldn't swallow all types of errors in the draining state (e.g. ReadWithinUncertaintyIntervalError that comes from the KV layer results in "poisoning" the txn, so we need to propagate it to the client), so we only have a single error type that we swallow. Also note that having another PUS is needed for this problem to occur because we must have concurrency between the child PUS that performs the eager cancellation and another operator that gets poisoned, and while we have two sources of concurrency within a single flow, only PUS is applicable (the other being outboxes but we only have eager cancellation for local plans). Additionally, while working on this change I realized another reason for why we don't want to lift the restriction for having eager cancellation only on "leaf" PUSes, so I extended the comment. This commit also adds a few more logic tests. Release note: None

blathers-crl · 2024-11-08T02:06:47Z

cockroach-teamcity · 2024-11-08T02:06:58Z

This change is

blathers-crl bot force-pushed the blathers/backport-release-24.3-133893 branch from 0bf865a to 0818fe3 Compare November 8, 2024 02:06

blathers-crl bot requested a review from a team as a code owner November 8, 2024 02:06

blathers-crl bot requested review from mw5h and removed request for a team November 8, 2024 02:06

blathers-crl bot added blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot. labels Nov 8, 2024

blathers-crl bot assigned yuzefovich Nov 8, 2024

blathers-crl bot requested review from mgartner, michae2 and yuzefovich November 8, 2024 02:06

blathers-crl bot added the backport Label PR's that are backports to older release branches label Nov 8, 2024

yuzefovich removed the request for review from mw5h November 8, 2024 02:07

yuzefovich mentioned this pull request Nov 8, 2024

sql: hold for SQL queries GA blockers #133264

Closed

10 tasks

mgartner approved these changes Nov 8, 2024

View reviewed changes

yuzefovich merged commit 8c0dabf into release-24.3 Nov 8, 2024
20 of 21 checks passed

yuzefovich deleted the blathers/backport-release-24.3-133893 branch November 8, 2024 16:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release-24.3: colexec: harden eager cancellation in parallel unordered sync #134609

release-24.3: colexec: harden eager cancellation in parallel unordered sync #134609

blathers-crl bot commented Nov 8, 2024 •

edited by yuzefovich

Loading

blathers-crl bot commented Nov 8, 2024

cockroach-teamcity commented Nov 8, 2024

release-24.3: colexec: harden eager cancellation in parallel unordered sync #134609

release-24.3: colexec: harden eager cancellation in parallel unordered sync #134609

Conversation

blathers-crl bot commented Nov 8, 2024 • edited by yuzefovich Loading

blathers-crl bot commented Nov 8, 2024

cockroach-teamcity commented Nov 8, 2024

blathers-crl bot commented Nov 8, 2024 •

edited by yuzefovich

Loading