Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/sql/logictest/tests/local-mixed-23.2/local-mixed-23_2_test: TestLogic_union failed [aborted in DistSender: result is ambiguous] #127942

Closed
cockroach-teamcity opened this issue Jul 31, 2024 · 13 comments · Fixed by #133893
Assignees
Labels
branch-master Failures and bugs on the master branch. branch-release-24.3 Used to mark GA and release blockers, technical advisories, and bugs for 24.3 C-test-failure Broken test (automatically or manually discovered). GA-blocker O-robot Originated from a bot. P-2 Issues/test failures with a fix SLA of 3 months T-sql-queries SQL Queries Team

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jul 31, 2024

pkg/sql/logictest/tests/local-mixed-23.2/local-mixed-23_2_test.TestLogic_union failed with artifacts on master @ c7d1ceed692d36e9d7c5a71e7122408e2e3e3b8b:

=== RUN   TestLogic_union
    test_log_scope.go:170: test logs captured to: /artifacts/tmp/_tmp/5f7b4dcd658a4aaead5502f982e54f03/logTestLogic_union1559883951
    test_log_scope.go:81: use -show-logs to present logs inline
    test_server_shim.go:144: cluster virtualization disabled in global scope due to issue: #76378 (expected label: C-bug)
    logic.go:2968: 
         pq: txn already encountered an error; cannot be used anymore (previous err: aborted in DistSender: result is ambiguous: context canceled)
[05:48:47] --- done: /home/roach/.cache/bazel/_bazel_roach/c5a4e7d36696d9cd970af2045211a7df/sandbox/processwrapper-sandbox/3513/execroot/com_github_cockroachdb_cockroach/bazel-out/k8-fastbuild/bin/pkg/sql/logictest/tests/local-mixed-23.2/local-mixed-23_2_test_/local-mixed-23_2_test.runfiles/com_github_cockroachdb_cockroach/pkg/sql/logictest/testdata/logic_test/union with config local-mixed-23.2: 102 tests, 1 failures
[05:48:49] --- total progress: 102 statements
--- total: 102 tests, 1 failures
    logic.go:4308: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/5f7b4dcd658a4aaead5502f982e54f03/logTestLogic_union1559883951
--- FAIL: TestLogic_union (4.19s)
Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/sql-foundations

This test on roachdash | Improve this report!

Jira issue: CRDB-40681

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) labels Jul 31, 2024
@rafiss rafiss changed the title pkg/sql/logictest/tests/local-mixed-23.2/local-mixed-23_2_test: TestLogic_union failed pkg/sql/logictest/tests/local-mixed-23.2/local-mixed-23_2_test: TestLogic_union failed [aborted in DistSender: result is ambiguous] Jul 31, 2024
@rafiss rafiss added T-sql-queries SQL Queries Team and removed T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) labels Jul 31, 2024
@github-project-automation github-project-automation bot moved this to Triage in SQL Queries Jul 31, 2024
@michae2
Copy link
Collaborator

michae2 commented Aug 1, 2024

The failure in #128064 shows the testcase that was added by #127076, so that's a hint.

@michae2
Copy link
Collaborator

michae2 commented Aug 9, 2024

I can hit this after about 400 runs using the following on a gceworker:

./dev testlogic base --config=local-legacy-schema-changer --files=union --ignore-cache --stress

@rytaft
Copy link
Collaborator

rytaft commented Aug 21, 2024

The failure in #128064 shows the testcase that was added by #127076, so that's a hint.

The backports of #127076 are still open. Do you think merging them will fix this issue, @michae2? cc @mgartner who is assigned to review those backports and merge them when ready.

@michae2
Copy link
Collaborator

michae2 commented Aug 21, 2024

The backports of #127076 are still open. Do you think merging them will fix this issue, @michae2?

No, I think it was likely #127076 which causes this failure, so we should hold off on merging the backports until we fix this.

@rytaft
Copy link
Collaborator

rytaft commented Aug 21, 2024

Ohh I see, these failures are all on branch-master. I got confused because the test is 23.2 mixed version (or 24.1 in the other issue).

@rytaft
Copy link
Collaborator

rytaft commented Aug 21, 2024

In that case, we should definitely address this issue before 24.3. I'll mark this as a GA-blocker and apply P-2. We might need to look into this before @yuzefovich gets back (especially if these test failures seem to be relatively common).

@rytaft rytaft added GA-blocker P-2 Issues/test failures with a fix SLA of 3 months and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Aug 21, 2024
@rytaft rytaft moved this from Triage to 24.3 Release in SQL Queries Aug 27, 2024
@michae2 michae2 self-assigned this Sep 3, 2024
@rafiss
Copy link
Collaborator

rafiss commented Oct 2, 2024

I made this PR to attempt to deflake the test: #131680 (since it also caused #131324 to be opened, which landed on the SQL Foundations board). I haven't been able to repro under stress since merging that.

edit: looks like it failed in #132038

yuzefovich added a commit to yuzefovich/cockroach that referenced this issue Oct 30, 2024
This commit hardens the eager cancellation mechanism in the parallel
unordered synchronizer. It was recently fixed in dda8b3a,
but the newly added test exposed a bug where the eager cancellation in
a child PUS could poison the query execution of another input of the
parent PUS, incorrectly failing the query altogether. More detailed
description can be found [here](cockroachdb#127942 (comment)),
but in short, due sharing of the same leaf txn between most operators in
a flow, eager cancellation of one operator could lead to poisoning the
execution of another operator which could only happen with a hierarchy
of PUSes. This commit fixes such situation by swallowing all context
cancellation errors in draining state of the PUS _even if_ that
particular PUS didn't eagerly cancel its inputs.

The rationale for why this behavior is safe is the following:
- if the query should result in an error, then some other error must
have been propagated to the client, and this is what caused the sync to
transition into the draining state in the first place. (We do replace
errors for the client in one case - set `DistSQLReceiver.SetError` where
some errors from KV have higher priority then others, but it isn't
applicable here.)
- if the query should not result in an error and should succeed, yet we
have some pending context cancellation errors, then it must be the case
that query execution was short-circuited (e.g. because of the LIMIT), so
we can pretend the part of the execution that hit the pending error
didn't actually run since clearly it wasn't necessary to compute the
query result.

Note that we couldn't swallow all types of errors in the draining state
(e.g. ReadWithinUncertaintyIntervalError that comes from the KV layer
results in "poisoning" the txn, so we need to propagate it to the
client), so we only have a single error type that we swallow.

Additionally, while working on this change I realized another reason for
why we don't want to lift the restriction for having eager cancellation
only on "leaf" PUSes, so I extended the comment. This commit also adds
a few more logic tests.

Release note: None
yuzefovich added a commit to yuzefovich/cockroach that referenced this issue Oct 30, 2024
This commit hardens the eager cancellation mechanism in the parallel
unordered synchronizer. It was recently fixed in dda8b3a,
but the newly added test exposed a bug where the eager cancellation in
a child PUS could poison the query execution of another input of the
parent PUS, incorrectly failing the query altogether. More detailed
description can be found [here](cockroachdb#127942 (comment)),
but in short, due sharing of the same leaf txn between most operators in
a flow, eager cancellation of one operator could lead to poisoning the
execution of another operator which could only happen with a hierarchy
of PUSes. This commit fixes such situation by swallowing all context
cancellation errors in draining state of the PUS _even if_ that
particular PUS didn't eagerly cancel its inputs.

The rationale for why this behavior is safe is the following:
- if the query should result in an error, then some other error must
have been propagated to the client, and this is what caused the sync to
transition into the draining state in the first place. (We do replace
errors for the client in one case - set `DistSQLReceiver.SetError` where
some errors from KV have higher priority then others, but it isn't
applicable here.)
- if the query should not result in an error and should succeed, yet we
have some pending context cancellation errors, then it must be the case
that query execution was short-circuited (e.g. because of the LIMIT), so
we can pretend the part of the execution that hit the pending error
didn't actually run since clearly it wasn't necessary to compute the
query result.

Note that we couldn't swallow all types of errors in the draining state
(e.g. ReadWithinUncertaintyIntervalError that comes from the KV layer
results in "poisoning" the txn, so we need to propagate it to the
client), so we only have a single error type that we swallow.

Additionally, while working on this change I realized another reason for
why we don't want to lift the restriction for having eager cancellation
only on "leaf" PUSes, so I extended the comment. This commit also adds
a few more logic tests.

Release note: None
yuzefovich added a commit to yuzefovich/cockroach that referenced this issue Oct 30, 2024
This commit hardens the eager cancellation mechanism in the parallel
unordered synchronizer. It was recently fixed in dda8b3a,
but the newly added test exposed a bug where the eager cancellation in
a child PUS could poison the query execution of another input of the
parent PUS, incorrectly failing the query altogether. More detailed
description can be found [here](cockroachdb#127942 (comment)),
but in short, due sharing of the same leaf txn between most operators in
a flow, eager cancellation of one operator could lead to poisoning the
execution of another operator which could only happen with a hierarchy
of PUSes. This commit fixes such situation by swallowing all context
cancellation errors in draining state of a PUS _even if_ that
particular PUS didn't eagerly cancel its inputs.

The rationale for why this behavior is safe is the following:
- if the query should result in an error, then some other error must
have been propagated to the client, and this is what caused the sync to
transition into the draining state in the first place. (We do replace
errors for the client in one case - set `DistSQLReceiver.SetError` where
some errors from KV have higher priority then others, but it isn't
applicable here.)
- if the query should not result in an error and should succeed, yet we
have some pending context cancellation errors, then it must be the case
that query execution was short-circuited (e.g. because of the LIMIT), so
we can pretend the part of the execution that hit the pending error
didn't actually run since clearly it wasn't necessary to compute the
query result.

Note that we couldn't swallow all types of errors in the draining state
(e.g. ReadWithinUncertaintyIntervalError that comes from the KV layer
results in "poisoning" the txn, so we need to propagate it to the
client), so we only have a single error type that we swallow.

Also note that having another PUS is needed for this problem to occur
because we must have concurrency between the child PUS that performs the
eager cancellation and another operator that gets poisoned, and while we
have two sources of concurrency within a single flow, only PUS is
applicable (the other being outboxes but we only have eager cancellation
for local plans).

Additionally, while working on this change I realized another reason for
why we don't want to lift the restriction for having eager cancellation
only on "leaf" PUSes, so I extended the comment. This commit also adds
a few more logic tests.

Release note: None
yuzefovich added a commit to yuzefovich/cockroach that referenced this issue Nov 7, 2024
This commit hardens the eager cancellation mechanism in the parallel
unordered synchronizer. It was recently fixed in dda8b3a,
but the newly added test exposed a bug where the eager cancellation in
a child PUS could poison the query execution of another input of the
parent PUS, incorrectly failing the query altogether. More detailed
description can be found [here](cockroachdb#127942 (comment)),
but in short, due sharing of the same leaf txn between most operators in
a flow, eager cancellation of one operator could lead to poisoning the
execution of another operator which could only happen with a hierarchy
of PUSes. This commit fixes such situation by swallowing all context
cancellation errors in draining state of a PUS _even if_ that
particular PUS didn't eagerly cancel its inputs.

The rationale for why this behavior is safe is the following:
- if the query should result in an error, then some other error must
have been propagated to the client, and this is what caused the sync to
transition into the draining state in the first place. (We do replace
errors for the client in one case - set `DistSQLReceiver.SetError` where
some errors from KV have higher priority then others, but it isn't
applicable here.)
- if the query should not result in an error and should succeed, yet we
have some pending context cancellation errors, then it must be the case
that query execution was short-circuited (e.g. because of the LIMIT), so
we can pretend the part of the execution that hit the pending error
didn't actually run since clearly it wasn't necessary to compute the
query result.

Note that we couldn't swallow all types of errors in the draining state
(e.g. ReadWithinUncertaintyIntervalError that comes from the KV layer
results in "poisoning" the txn, so we need to propagate it to the
client), so we only have a single error type that we swallow.

Also note that having another PUS is needed for this problem to occur
because we must have concurrency between the child PUS that performs the
eager cancellation and another operator that gets poisoned, and while we
have two sources of concurrency within a single flow, only PUS is
applicable (the other being outboxes but we only have eager cancellation
for local plans).

Additionally, while working on this change I realized another reason for
why we don't want to lift the restriction for having eager cancellation
only on "leaf" PUSes, so I extended the comment. This commit also adds
a few more logic tests.

Release note: None
@craig craig bot closed this as completed in b37e365 Nov 7, 2024
@github-project-automation github-project-automation bot moved this from Active to Done in SQL Queries Nov 7, 2024
@yuzefovich
Copy link
Member

Re-opening till the backport merges.

@yuzefovich yuzefovich reopened this Nov 8, 2024
@github-project-automation github-project-automation bot moved this from Done to Triage in SQL Queries Nov 8, 2024
@yuzefovich yuzefovich moved this from Triage to Active in SQL Queries Nov 8, 2024
blathers-crl bot pushed a commit that referenced this issue Nov 8, 2024
This commit hardens the eager cancellation mechanism in the parallel
unordered synchronizer. It was recently fixed in dda8b3a,
but the newly added test exposed a bug where the eager cancellation in
a child PUS could poison the query execution of another input of the
parent PUS, incorrectly failing the query altogether. More detailed
description can be found [here](#127942 (comment)),
but in short, due sharing of the same leaf txn between most operators in
a flow, eager cancellation of one operator could lead to poisoning the
execution of another operator which could only happen with a hierarchy
of PUSes. This commit fixes such situation by swallowing all context
cancellation errors in draining state of a PUS _even if_ that
particular PUS didn't eagerly cancel its inputs.

The rationale for why this behavior is safe is the following:
- if the query should result in an error, then some other error must
have been propagated to the client, and this is what caused the sync to
transition into the draining state in the first place. (We do replace
errors for the client in one case - set `DistSQLReceiver.SetError` where
some errors from KV have higher priority then others, but it isn't
applicable here.)
- if the query should not result in an error and should succeed, yet we
have some pending context cancellation errors, then it must be the case
that query execution was short-circuited (e.g. because of the LIMIT), so
we can pretend the part of the execution that hit the pending error
didn't actually run since clearly it wasn't necessary to compute the
query result.

Note that we couldn't swallow all types of errors in the draining state
(e.g. ReadWithinUncertaintyIntervalError that comes from the KV layer
results in "poisoning" the txn, so we need to propagate it to the
client), so we only have a single error type that we swallow.

Also note that having another PUS is needed for this problem to occur
because we must have concurrency between the child PUS that performs the
eager cancellation and another operator that gets poisoned, and while we
have two sources of concurrency within a single flow, only PUS is
applicable (the other being outboxes but we only have eager cancellation
for local plans).

Additionally, while working on this change I realized another reason for
why we don't want to lift the restriction for having eager cancellation
only on "leaf" PUSes, so I extended the comment. This commit also adds
a few more logic tests.

Release note: None
@github-project-automation github-project-automation bot moved this from Active to Done in SQL Queries Nov 8, 2024
craig bot pushed a commit that referenced this issue Nov 11, 2024
134560: colexec: add session variable to disable eager cancellation r=yuzefovich a=yuzefovich

This commit adds a session variable that allows us to disable the eager cancellation that is performed by the parallel unordered synchronizer in local flows in some cases when it transitions into draining state. This will serve as an escape hatch in case we find more issues with this feature.

Informs: #127043.
Informs: #127942.
Epic: None

Release note: None

134759: sql: fire DELETE triggers for cascading deletes r=DrewKimball a=DrewKimball

This commit fixes a bug in `onDeleteFastCascadeBuilder` which caused the incorrect type of trigger to fire. This issue is that `TriggerEventInsert` was supplied to `buildRowLevelBeforeTriggers` rather than `TriggerEventDelete`.

Fixes #134741
Fixes #134745

Release note (bug fix): Fixed a bug that could cause DELETE triggers not to fire on cascading delete, and which could cause INSERT triggers to match incorrectly in the same scenario.

134781: changefeedccl: add log tag for processor type r=asg0451 a=andyyang890

This patch adds a log tag for the processor type so that we can easily
tell whether a log came from a changefeed's frontier or one of its
aggregators.

Epic: CRDB-37337

Release note: None

Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
Co-authored-by: Drew Kimball <drewk@cockroachlabs.com>
Co-authored-by: Andy Yang <yang@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. branch-release-24.3 Used to mark GA and release blockers, technical advisories, and bugs for 24.3 C-test-failure Broken test (automatically or manually discovered). GA-blocker O-robot Originated from a bot. P-2 Issues/test failures with a fix SLA of 3 months T-sql-queries SQL Queries Team
Projects
Archived in project
7 participants