-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pkg/sql/logictest/tests/local-mixed-23.2/local-mixed-23_2_test: TestLogic_union failed [aborted in DistSender: result is ambiguous] #127942
Comments
I can hit this after about 400 runs using the following on a gceworker:
|
Ohh I see, these failures are all on branch-master. I got confused because the test is 23.2 mixed version (or 24.1 in the other issue). |
In that case, we should definitely address this issue before 24.3. I'll mark this as a GA-blocker and apply P-2. We might need to look into this before @yuzefovich gets back (especially if these test failures seem to be relatively common). |
This commit hardens the eager cancellation mechanism in the parallel unordered synchronizer. It was recently fixed in dda8b3a, but the newly added test exposed a bug where the eager cancellation in a child PUS could poison the query execution of another input of the parent PUS, incorrectly failing the query altogether. More detailed description can be found [here](cockroachdb#127942 (comment)), but in short, due sharing of the same leaf txn between most operators in a flow, eager cancellation of one operator could lead to poisoning the execution of another operator which could only happen with a hierarchy of PUSes. This commit fixes such situation by swallowing all context cancellation errors in draining state of the PUS _even if_ that particular PUS didn't eagerly cancel its inputs. The rationale for why this behavior is safe is the following: - if the query should result in an error, then some other error must have been propagated to the client, and this is what caused the sync to transition into the draining state in the first place. (We do replace errors for the client in one case - set `DistSQLReceiver.SetError` where some errors from KV have higher priority then others, but it isn't applicable here.) - if the query should not result in an error and should succeed, yet we have some pending context cancellation errors, then it must be the case that query execution was short-circuited (e.g. because of the LIMIT), so we can pretend the part of the execution that hit the pending error didn't actually run since clearly it wasn't necessary to compute the query result. Note that we couldn't swallow all types of errors in the draining state (e.g. ReadWithinUncertaintyIntervalError that comes from the KV layer results in "poisoning" the txn, so we need to propagate it to the client), so we only have a single error type that we swallow. Additionally, while working on this change I realized another reason for why we don't want to lift the restriction for having eager cancellation only on "leaf" PUSes, so I extended the comment. This commit also adds a few more logic tests. Release note: None
This commit hardens the eager cancellation mechanism in the parallel unordered synchronizer. It was recently fixed in dda8b3a, but the newly added test exposed a bug where the eager cancellation in a child PUS could poison the query execution of another input of the parent PUS, incorrectly failing the query altogether. More detailed description can be found [here](cockroachdb#127942 (comment)), but in short, due sharing of the same leaf txn between most operators in a flow, eager cancellation of one operator could lead to poisoning the execution of another operator which could only happen with a hierarchy of PUSes. This commit fixes such situation by swallowing all context cancellation errors in draining state of the PUS _even if_ that particular PUS didn't eagerly cancel its inputs. The rationale for why this behavior is safe is the following: - if the query should result in an error, then some other error must have been propagated to the client, and this is what caused the sync to transition into the draining state in the first place. (We do replace errors for the client in one case - set `DistSQLReceiver.SetError` where some errors from KV have higher priority then others, but it isn't applicable here.) - if the query should not result in an error and should succeed, yet we have some pending context cancellation errors, then it must be the case that query execution was short-circuited (e.g. because of the LIMIT), so we can pretend the part of the execution that hit the pending error didn't actually run since clearly it wasn't necessary to compute the query result. Note that we couldn't swallow all types of errors in the draining state (e.g. ReadWithinUncertaintyIntervalError that comes from the KV layer results in "poisoning" the txn, so we need to propagate it to the client), so we only have a single error type that we swallow. Additionally, while working on this change I realized another reason for why we don't want to lift the restriction for having eager cancellation only on "leaf" PUSes, so I extended the comment. This commit also adds a few more logic tests. Release note: None
This commit hardens the eager cancellation mechanism in the parallel unordered synchronizer. It was recently fixed in dda8b3a, but the newly added test exposed a bug where the eager cancellation in a child PUS could poison the query execution of another input of the parent PUS, incorrectly failing the query altogether. More detailed description can be found [here](cockroachdb#127942 (comment)), but in short, due sharing of the same leaf txn between most operators in a flow, eager cancellation of one operator could lead to poisoning the execution of another operator which could only happen with a hierarchy of PUSes. This commit fixes such situation by swallowing all context cancellation errors in draining state of a PUS _even if_ that particular PUS didn't eagerly cancel its inputs. The rationale for why this behavior is safe is the following: - if the query should result in an error, then some other error must have been propagated to the client, and this is what caused the sync to transition into the draining state in the first place. (We do replace errors for the client in one case - set `DistSQLReceiver.SetError` where some errors from KV have higher priority then others, but it isn't applicable here.) - if the query should not result in an error and should succeed, yet we have some pending context cancellation errors, then it must be the case that query execution was short-circuited (e.g. because of the LIMIT), so we can pretend the part of the execution that hit the pending error didn't actually run since clearly it wasn't necessary to compute the query result. Note that we couldn't swallow all types of errors in the draining state (e.g. ReadWithinUncertaintyIntervalError that comes from the KV layer results in "poisoning" the txn, so we need to propagate it to the client), so we only have a single error type that we swallow. Also note that having another PUS is needed for this problem to occur because we must have concurrency between the child PUS that performs the eager cancellation and another operator that gets poisoned, and while we have two sources of concurrency within a single flow, only PUS is applicable (the other being outboxes but we only have eager cancellation for local plans). Additionally, while working on this change I realized another reason for why we don't want to lift the restriction for having eager cancellation only on "leaf" PUSes, so I extended the comment. This commit also adds a few more logic tests. Release note: None
This commit hardens the eager cancellation mechanism in the parallel unordered synchronizer. It was recently fixed in dda8b3a, but the newly added test exposed a bug where the eager cancellation in a child PUS could poison the query execution of another input of the parent PUS, incorrectly failing the query altogether. More detailed description can be found [here](cockroachdb#127942 (comment)), but in short, due sharing of the same leaf txn between most operators in a flow, eager cancellation of one operator could lead to poisoning the execution of another operator which could only happen with a hierarchy of PUSes. This commit fixes such situation by swallowing all context cancellation errors in draining state of a PUS _even if_ that particular PUS didn't eagerly cancel its inputs. The rationale for why this behavior is safe is the following: - if the query should result in an error, then some other error must have been propagated to the client, and this is what caused the sync to transition into the draining state in the first place. (We do replace errors for the client in one case - set `DistSQLReceiver.SetError` where some errors from KV have higher priority then others, but it isn't applicable here.) - if the query should not result in an error and should succeed, yet we have some pending context cancellation errors, then it must be the case that query execution was short-circuited (e.g. because of the LIMIT), so we can pretend the part of the execution that hit the pending error didn't actually run since clearly it wasn't necessary to compute the query result. Note that we couldn't swallow all types of errors in the draining state (e.g. ReadWithinUncertaintyIntervalError that comes from the KV layer results in "poisoning" the txn, so we need to propagate it to the client), so we only have a single error type that we swallow. Also note that having another PUS is needed for this problem to occur because we must have concurrency between the child PUS that performs the eager cancellation and another operator that gets poisoned, and while we have two sources of concurrency within a single flow, only PUS is applicable (the other being outboxes but we only have eager cancellation for local plans). Additionally, while working on this change I realized another reason for why we don't want to lift the restriction for having eager cancellation only on "leaf" PUSes, so I extended the comment. This commit also adds a few more logic tests. Release note: None
Re-opening till the backport merges. |
This commit hardens the eager cancellation mechanism in the parallel unordered synchronizer. It was recently fixed in dda8b3a, but the newly added test exposed a bug where the eager cancellation in a child PUS could poison the query execution of another input of the parent PUS, incorrectly failing the query altogether. More detailed description can be found [here](#127942 (comment)), but in short, due sharing of the same leaf txn between most operators in a flow, eager cancellation of one operator could lead to poisoning the execution of another operator which could only happen with a hierarchy of PUSes. This commit fixes such situation by swallowing all context cancellation errors in draining state of a PUS _even if_ that particular PUS didn't eagerly cancel its inputs. The rationale for why this behavior is safe is the following: - if the query should result in an error, then some other error must have been propagated to the client, and this is what caused the sync to transition into the draining state in the first place. (We do replace errors for the client in one case - set `DistSQLReceiver.SetError` where some errors from KV have higher priority then others, but it isn't applicable here.) - if the query should not result in an error and should succeed, yet we have some pending context cancellation errors, then it must be the case that query execution was short-circuited (e.g. because of the LIMIT), so we can pretend the part of the execution that hit the pending error didn't actually run since clearly it wasn't necessary to compute the query result. Note that we couldn't swallow all types of errors in the draining state (e.g. ReadWithinUncertaintyIntervalError that comes from the KV layer results in "poisoning" the txn, so we need to propagate it to the client), so we only have a single error type that we swallow. Also note that having another PUS is needed for this problem to occur because we must have concurrency between the child PUS that performs the eager cancellation and another operator that gets poisoned, and while we have two sources of concurrency within a single flow, only PUS is applicable (the other being outboxes but we only have eager cancellation for local plans). Additionally, while working on this change I realized another reason for why we don't want to lift the restriction for having eager cancellation only on "leaf" PUSes, so I extended the comment. This commit also adds a few more logic tests. Release note: None
134560: colexec: add session variable to disable eager cancellation r=yuzefovich a=yuzefovich This commit adds a session variable that allows us to disable the eager cancellation that is performed by the parallel unordered synchronizer in local flows in some cases when it transitions into draining state. This will serve as an escape hatch in case we find more issues with this feature. Informs: #127043. Informs: #127942. Epic: None Release note: None 134759: sql: fire DELETE triggers for cascading deletes r=DrewKimball a=DrewKimball This commit fixes a bug in `onDeleteFastCascadeBuilder` which caused the incorrect type of trigger to fire. This issue is that `TriggerEventInsert` was supplied to `buildRowLevelBeforeTriggers` rather than `TriggerEventDelete`. Fixes #134741 Fixes #134745 Release note (bug fix): Fixed a bug that could cause DELETE triggers not to fire on cascading delete, and which could cause INSERT triggers to match incorrectly in the same scenario. 134781: changefeedccl: add log tag for processor type r=asg0451 a=andyyang890 This patch adds a log tag for the processor type so that we can easily tell whether a log came from a changefeed's frontier or one of its aggregators. Epic: CRDB-37337 Release note: None Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: Drew Kimball <drewk@cockroachlabs.com> Co-authored-by: Andy Yang <yang@cockroachlabs.com>
pkg/sql/logictest/tests/local-mixed-23.2/local-mixed-23_2_test.TestLogic_union failed with artifacts on master @ c7d1ceed692d36e9d7c5a71e7122408e2e3e3b8b:
Help
See also: How To Investigate a Go Test Failure (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-40681
The text was updated successfully, but these errors were encountered: