pkg/sql/logictest/tests/local-mixed-23.2/local-mixed-23_2_test: TestLogic_union failed [aborted in DistSender: result is ambiguous] #127942

cockroach-teamcity · 2024-07-31T05:50:47Z

pkg/sql/logictest/tests/local-mixed-23.2/local-mixed-23_2_test.TestLogic_union failed with artifacts on master @ c7d1ceed692d36e9d7c5a71e7122408e2e3e3b8b:

=== RUN   TestLogic_union
    test_log_scope.go:170: test logs captured to: /artifacts/tmp/_tmp/5f7b4dcd658a4aaead5502f982e54f03/logTestLogic_union1559883951
    test_log_scope.go:81: use -show-logs to present logs inline
    test_server_shim.go:144: cluster virtualization disabled in global scope due to issue: #76378 (expected label: C-bug)
    logic.go:2968: 
         pq: txn already encountered an error; cannot be used anymore (previous err: aborted in DistSender: result is ambiguous: context canceled)
[05:48:47] --- done: /home/roach/.cache/bazel/_bazel_roach/c5a4e7d36696d9cd970af2045211a7df/sandbox/processwrapper-sandbox/3513/execroot/com_github_cockroachdb_cockroach/bazel-out/k8-fastbuild/bin/pkg/sql/logictest/tests/local-mixed-23.2/local-mixed-23_2_test_/local-mixed-23_2_test.runfiles/com_github_cockroachdb_cockroach/pkg/sql/logictest/testdata/logic_test/union with config local-mixed-23.2: 102 tests, 1 failures
[05:48:49] --- total progress: 102 statements
--- total: 102 tests, 1 failures
    logic.go:4308: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/5f7b4dcd658a4aaead5502f982e54f03/logTestLogic_union1559883951
--- FAIL: TestLogic_union (4.19s)

Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/sql-foundations _{This test on roachdash | Improve this report!

Jira issue: CRDB-40681}

The text was updated successfully, but these errors were encountered:

michae2 · 2024-08-01T17:27:22Z

The failure in #128064 shows the testcase that was added by #127076, so that's a hint.

michae2 · 2024-08-09T22:33:59Z

I can hit this after about 400 runs using the following on a gceworker:

./dev testlogic base --config=local-legacy-schema-changer --files=union --ignore-cache --stress

rytaft · 2024-08-21T15:14:48Z

The failure in #128064 shows the testcase that was added by #127076, so that's a hint.

The backports of #127076 are still open. Do you think merging them will fix this issue, @michae2? cc @mgartner who is assigned to review those backports and merge them when ready.

michae2 · 2024-08-21T16:00:35Z

The backports of #127076 are still open. Do you think merging them will fix this issue, @michae2?

No, I think it was likely #127076 which causes this failure, so we should hold off on merging the backports until we fix this.

rytaft · 2024-08-21T16:54:39Z

Ohh I see, these failures are all on branch-master. I got confused because the test is 23.2 mixed version (or 24.1 in the other issue).

rytaft · 2024-08-21T17:03:10Z

In that case, we should definitely address this issue before 24.3. I'll mark this as a GA-blocker and apply P-2. We might need to look into this before @yuzefovich gets back (especially if these test failures seem to be relatively common).

rafiss · 2024-10-02T15:14:26Z

I made this PR to attempt to deflake the test: #131680 (since it also caused #131324 to be opened, which landed on the SQL Foundations board). I haven't been able to repro under stress since merging that.

edit: looks like it failed in #132038

This commit hardens the eager cancellation mechanism in the parallel unordered synchronizer. It was recently fixed in dda8b3a, but the newly added test exposed a bug where the eager cancellation in a child PUS could poison the query execution of another input of the parent PUS, incorrectly failing the query altogether. More detailed description can be found [here](cockroachdb#127942 (comment)), but in short, due sharing of the same leaf txn between most operators in a flow, eager cancellation of one operator could lead to poisoning the execution of another operator which could only happen with a hierarchy of PUSes. This commit fixes such situation by swallowing all context cancellation errors in draining state of the PUS _even if_ that particular PUS didn't eagerly cancel its inputs. The rationale for why this behavior is safe is the following: - if the query should result in an error, then some other error must have been propagated to the client, and this is what caused the sync to transition into the draining state in the first place. (We do replace errors for the client in one case - set `DistSQLReceiver.SetError` where some errors from KV have higher priority then others, but it isn't applicable here.) - if the query should not result in an error and should succeed, yet we have some pending context cancellation errors, then it must be the case that query execution was short-circuited (e.g. because of the LIMIT), so we can pretend the part of the execution that hit the pending error didn't actually run since clearly it wasn't necessary to compute the query result. Note that we couldn't swallow all types of errors in the draining state (e.g. ReadWithinUncertaintyIntervalError that comes from the KV layer results in "poisoning" the txn, so we need to propagate it to the client), so we only have a single error type that we swallow. Additionally, while working on this change I realized another reason for why we don't want to lift the restriction for having eager cancellation only on "leaf" PUSes, so I extended the comment. This commit also adds a few more logic tests. Release note: None

This commit hardens the eager cancellation mechanism in the parallel unordered synchronizer. It was recently fixed in dda8b3a, but the newly added test exposed a bug where the eager cancellation in a child PUS could poison the query execution of another input of the parent PUS, incorrectly failing the query altogether. More detailed description can be found [here](cockroachdb#127942 (comment)), but in short, due sharing of the same leaf txn between most operators in a flow, eager cancellation of one operator could lead to poisoning the execution of another operator which could only happen with a hierarchy of PUSes. This commit fixes such situation by swallowing all context cancellation errors in draining state of a PUS _even if_ that particular PUS didn't eagerly cancel its inputs. The rationale for why this behavior is safe is the following: - if the query should result in an error, then some other error must have been propagated to the client, and this is what caused the sync to transition into the draining state in the first place. (We do replace errors for the client in one case - set `DistSQLReceiver.SetError` where some errors from KV have higher priority then others, but it isn't applicable here.) - if the query should not result in an error and should succeed, yet we have some pending context cancellation errors, then it must be the case that query execution was short-circuited (e.g. because of the LIMIT), so we can pretend the part of the execution that hit the pending error didn't actually run since clearly it wasn't necessary to compute the query result. Note that we couldn't swallow all types of errors in the draining state (e.g. ReadWithinUncertaintyIntervalError that comes from the KV layer results in "poisoning" the txn, so we need to propagate it to the client), so we only have a single error type that we swallow. Also note that having another PUS is needed for this problem to occur because we must have concurrency between the child PUS that performs the eager cancellation and another operator that gets poisoned, and while we have two sources of concurrency within a single flow, only PUS is applicable (the other being outboxes but we only have eager cancellation for local plans). Additionally, while working on this change I realized another reason for why we don't want to lift the restriction for having eager cancellation only on "leaf" PUSes, so I extended the comment. This commit also adds a few more logic tests. Release note: None

yuzefovich · 2024-11-08T01:30:58Z

Re-opening till the backport merges.

This commit hardens the eager cancellation mechanism in the parallel unordered synchronizer. It was recently fixed in dda8b3a, but the newly added test exposed a bug where the eager cancellation in a child PUS could poison the query execution of another input of the parent PUS, incorrectly failing the query altogether. More detailed description can be found [here](#127942 (comment)), but in short, due sharing of the same leaf txn between most operators in a flow, eager cancellation of one operator could lead to poisoning the execution of another operator which could only happen with a hierarchy of PUSes. This commit fixes such situation by swallowing all context cancellation errors in draining state of a PUS _even if_ that particular PUS didn't eagerly cancel its inputs. The rationale for why this behavior is safe is the following: - if the query should result in an error, then some other error must have been propagated to the client, and this is what caused the sync to transition into the draining state in the first place. (We do replace errors for the client in one case - set `DistSQLReceiver.SetError` where some errors from KV have higher priority then others, but it isn't applicable here.) - if the query should not result in an error and should succeed, yet we have some pending context cancellation errors, then it must be the case that query execution was short-circuited (e.g. because of the LIMIT), so we can pretend the part of the execution that hit the pending error didn't actually run since clearly it wasn't necessary to compute the query result. Note that we couldn't swallow all types of errors in the draining state (e.g. ReadWithinUncertaintyIntervalError that comes from the KV layer results in "poisoning" the txn, so we need to propagate it to the client), so we only have a single error type that we swallow. Also note that having another PUS is needed for this problem to occur because we must have concurrency between the child PUS that performs the eager cancellation and another operator that gets poisoned, and while we have two sources of concurrency within a single flow, only PUS is applicable (the other being outboxes but we only have eager cancellation for local plans). Additionally, while working on this change I realized another reason for why we don't want to lift the restriction for having eager cancellation only on "leaf" PUSes, so I extended the comment. This commit also adds a few more logic tests. Release note: None

134560: colexec: add session variable to disable eager cancellation r=yuzefovich a=yuzefovich This commit adds a session variable that allows us to disable the eager cancellation that is performed by the parallel unordered synchronizer in local flows in some cases when it transitions into draining state. This will serve as an escape hatch in case we find more issues with this feature. Informs: #127043. Informs: #127942. Epic: None Release note: None 134759: sql: fire DELETE triggers for cascading deletes r=DrewKimball a=DrewKimball This commit fixes a bug in `onDeleteFastCascadeBuilder` which caused the incorrect type of trigger to fire. This issue is that `TriggerEventInsert` was supplied to `buildRowLevelBeforeTriggers` rather than `TriggerEventDelete`. Fixes #134741 Fixes #134745 Release note (bug fix): Fixed a bug that could cause DELETE triggers not to fire on cascading delete, and which could cause INSERT triggers to match incorrectly in the same scenario. 134781: changefeedccl: add log tag for processor type r=asg0451 a=andyyang890 This patch adds a log tag for the processor type so that we can easily tell whether a log came from a changefeed's frontier or one of its aggregators. Epic: CRDB-37337 Release note: None Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: Drew Kimball <drewk@cockroachlabs.com> Co-authored-by: Andy Yang <yang@cockroachlabs.com>

rafiss changed the title ~~pkg/sql/logictest/tests/local-mixed-23.2/local-mixed-23_2_test: TestLogic_union failed~~ pkg/sql/logictest/tests/local-mixed-23.2/local-mixed-23_2_test: TestLogic_union failed [aborted in DistSender: result is ambiguous] Jul 31, 2024

rafiss added T-sql-queries SQL Queries Team and removed T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) labels Jul 31, 2024

cockroach-teamcity added this to SQL Queries Jul 31, 2024

github-project-automation bot moved this to Triage in SQL Queries Jul 31, 2024

michae2 mentioned this issue Aug 1, 2024

pkg/sql/logictest/tests/local-legacy-schema-changer/local-legacy-schema-changer_test: TestLogic_union failed #128064

Closed

michae2 mentioned this issue Aug 2, 2024

release-24.2: colexec: fix incorrect eager cancellation in parallel unordered sync #127655

Closed

rafiss mentioned this issue Aug 19, 2024

pkg/sql/logictest/tests/local-mixed-24.1/local-mixed-24_1_test: TestLogic_union failed [aborted in DistSender: result is ambiguous] #129215

Closed

rytaft assigned michae2 Aug 21, 2024

rytaft added GA-blocker P-2 Issues/test failures with a fix SLA of 3 months and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Aug 21, 2024

rytaft unassigned michae2 Aug 21, 2024

rytaft moved this from Triage to 24.3 Release in SQL Queries Aug 27, 2024

DrewKimball mentioned this issue Aug 27, 2024

pkg/sql/logictest/tests/local-legacy-schema-changer/local-legacy-schema-changer_test: TestLogic_union failed #129613

Closed

michae2 self-assigned this Sep 3, 2024

yuzefovich mentioned this issue Oct 30, 2024

pkg/sql/logictest/tests/local-legacy-schema-changer/local-legacy-schema-changer_test: TestLogic_union failed #132522

Closed

yuzefovich mentioned this issue Oct 30, 2024

colexec: improve eager cancellation in parallel unordered sync #127376

Closed

rafiss mentioned this issue Nov 7, 2024

pkg/sql/logictest/tests/local-mixed-24.1/local-mixed-24_1_test: TestLogic_union failed #134491

Closed

yuzefovich mentioned this issue Nov 7, 2024

colexec: add session variable to disable eager cancellation #134560

Merged

craig bot closed this as completed in b37e365 Nov 7, 2024

github-project-automation bot moved this from Active to Done in SQL Queries Nov 7, 2024

yuzefovich reopened this Nov 8, 2024

github-project-automation bot moved this from Done to Triage in SQL Queries Nov 8, 2024

yuzefovich moved this from Triage to Active in SQL Queries Nov 8, 2024

blathers-crl bot mentioned this issue Nov 8, 2024

release-24.3: colexec: harden eager cancellation in parallel unordered sync #134609

Merged

yuzefovich mentioned this issue Nov 8, 2024

pkg/ccl/logictestccl/tests/local-read-committed/local-read-committed_test: TestReadCommittedLogic_union failed #134544

Closed

yuzefovich closed this as completed Nov 8, 2024

github-project-automation bot moved this from Active to Done in SQL Queries Nov 8, 2024

blathers-crl bot mentioned this issue Nov 11, 2024

release-24.3: colexec: add session variable to disable eager cancellation #134898

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pkg/sql/logictest/tests/local-mixed-23.2/local-mixed-23_2_test: TestLogic_union failed [aborted in DistSender: result is ambiguous] #127942

pkg/sql/logictest/tests/local-mixed-23.2/local-mixed-23_2_test: TestLogic_union failed [aborted in DistSender: result is ambiguous] #127942

cockroach-teamcity commented Jul 31, 2024 •

edited by cockroach-jira-scripts

Loading

michae2 commented Aug 1, 2024

michae2 commented Aug 9, 2024

rytaft commented Aug 21, 2024

michae2 commented Aug 21, 2024

rytaft commented Aug 21, 2024

rytaft commented Aug 21, 2024

rafiss commented Oct 2, 2024 •

edited

Loading

yuzefovich commented Nov 8, 2024

pkg/sql/logictest/tests/local-mixed-23.2/local-mixed-23_2_test: TestLogic_union failed [aborted in DistSender: result is ambiguous] #127942

pkg/sql/logictest/tests/local-mixed-23.2/local-mixed-23_2_test: TestLogic_union failed [aborted in DistSender: result is ambiguous] #127942

Comments

cockroach-teamcity commented Jul 31, 2024 • edited by cockroach-jira-scripts Loading

michae2 commented Aug 1, 2024

michae2 commented Aug 9, 2024

rytaft commented Aug 21, 2024

michae2 commented Aug 21, 2024

rytaft commented Aug 21, 2024

rytaft commented Aug 21, 2024

rafiss commented Oct 2, 2024 • edited Loading

yuzefovich commented Nov 8, 2024

cockroach-teamcity commented Jul 31, 2024 •

edited by cockroach-jira-scripts

Loading

rafiss commented Oct 2, 2024 •

edited

Loading