release-23.2: sql: don't parallelize any FK and unique checks containing locking #118931

michae2 · 2024-02-08T01:10:18Z

Backport 3/3 commits from #118722.

/cc @cockroachdb/release

sql, execbuilder: move plan flags from planTop to planComponents

Execbuild produces a sql.planComponents containing the main query
plan, plus plans for any subqueries, checks, and cascades. In the normal
SQL code path (sql.(*optPlanningCtx).runExecBuilder) we copy this
planComponents into a sql.planTop, which then is decorated with extra
information gathered from the Builder.

However, there are other users of execbuild that only work with the
planComponents, and don't have a planTop. In the next commit, one such
user, PlanAndRunCascadesAndChecks, needs to see some of the plan flags
set when building FK check plans, but doesn't have access to the Builder
which gathered those plans.

To solve this, this commit moves plan flags from the planTop to
planComponents, and then refactors the execbuilder and exec factory to
set some of these flags when creating the planComponents.

Informs: #118720

Epic: None

Release note: None

sql: don't parallelize any FK and unique checks containing locking

In #111896 we disallowed parallel execution of FK and unique checks that
contain locking, to avoid usage of LeafTxns. But #111896 only considered
checks built during planning of the main query, and overlooked checks
built during planning of FK cascades.

This commit also considers checks built during planning of FK cascades.

Fixes: #118720

Epic: None

Release note (bug fix): This commit fixes an internal error with message
like: "LeafTxn ... incompatible with locking request" that occurs when
performing an update under read committed isolation which cascades to a
table with multiple other foreign keys.

sql: rename *ContainsNonDefaultKeyLocking to *ContainsLocking

"Default" locking here simply means no locking. I find this convention
of saying "non-default key locking" a little verbose and confusing.

Epic: None

Release note: None

Release justification: fixes a bug that prevents some update statements from being executed under read committed isolation.

blathers-crl · 2024-02-08T01:10:21Z

cockroach-teamcity · 2024-02-08T01:10:29Z

This change is

michae2 · 2024-02-08T01:52:51Z

Are all of the changes protected via a flag or option, new syntax, an environment variable or default-disabled cluster setting?

No:

The first commit is a small refactor of planTop, planComponents, and execbuilder that is not behind any protection. This is the riskiest part of the change. Unfortunately this part is necessary to make the fix possible.

The second commit is behind the sql.distsql.parallelize_checks.enabled cluster setting, but this setting is default-enabled in 23.2, so not much protection.

The third commit is a rename of some constants.

What are the risks to backporting this change? Can the risks of backporting be mitigated?

There is a small risk that the first commit (the refactor) breaks something that depends on the flags in planTop. These flags are used by distsql physical planning, metrics collection, and various checks.

What are the risks to not backporting?

Due to LeafTxn ... incompatible with locking request errors, it would be impossible to execute some UPDATE statements on some tables under read committed isolation with the sql.distsql.parallelize_checks.enabled cluster setting enabled (default in 23.2).

Does this change really need to be backported? Or can it reasonably wait until the next major release to be addressed?

If we don't backport, there are two workarounds to allow affected UPDATE statements to run:

disabling cluster setting sql.distsql.parallelize_checks.enabled, but some customers rely on this parallelization for adequate query performance
running the affected UPDATE statements under serializable isolation, but this might not be possible for some workloads

I believe it is worth the small risk of backporting the refactor in the first commit to make sure customers have a good experience with read committed isolation in 23.2.

Is this the smallest possible change that can be made?

The third commit (rename of some constants) could be skipped. I've included it to reduce future backport skew, but it doesn't have to be included. Curious to know what people think.

mgartner

I like the third commit and I think it's fine to backport. I left a few optional comments that you can consider in the future - no need to change anything in this PR.

Reviewed 15 of 15 files at r1, 2 of 2 files at r2, 7 of 7 files at r3, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @michae2, @nvanbenschoten, and @yuzefovich)

pkg/sql/distsql_running.go line 2103 at r2 (raw file):

			return false
		}
		cp := cascadePlan.(*planComponents)

🤔 this comment could probably use an update:

cockroach/pkg/sql/opt/exec/factory.go

Line 34 in 7ad0187

// Plan represents the plan for a query (currently maps to sql.planTop).

pkg/sql/opt/exec/factory.go line 41 at r1 (raw file):

// PlanFlags tracks various properties of the built plan.
type PlanFlags uint32

nit: Maybe a better name is PlanProperties or PlanProps? It won't match the sql.planFlags naming, though.

pkg/sql/opt/exec/factory.go line 80 at r1 (raw file):

func (pf PlanFlags) IsSet(flag PlanFlags) bool {
	return (pf & flag) != 0

super nit: I don't think we're terribly consistent with this, but I think it's safer to do return pf&flag == flag. That way, if flag has multiple bits set, IsSet will only return true if pf has all of those bits set rather than returning true if it has one bit set.

michae2 · 2024-02-10T04:39:03Z

The failures look like known flakes #116702 and #112550. I'll try rebasing.

michae2

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @mgartner, @nvanbenschoten, and @yuzefovich)

pkg/sql/opt/exec/factory.go line 80 at r1 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

super nit: I don't think we're terribly consistent with this, but I think it's safer to do return pf&flag == flag. That way, if flag has multiple bits set, IsSet will only return true if pf has all of those bits set rather than returning true if it has one bit set.

Good call! This inspired me to perform an audit of all such helper functions: #119035

Execbuild produces a `sql.planComponents` containing the main query plan, plus plans for any subqueries, checks, and cascades. In the normal SQL code path (`sql.(*optPlanningCtx).runExecBuilder`) we copy this planComponents into a `sql.planTop`, which then is decorated with extra information gathered from the Builder. However, there are other users of execbuild that only work with the planComponents, and don't have a planTop. In the next commit, one such user, `PlanAndRunCascadesAndChecks`, needs to see some of the plan flags set when building FK check plans, but doesn't have access to the Builder which gathered those plans. To solve this, this commit moves plan flags from the planTop to planComponents, and then refactors the execbuilder and exec factory to set some of these flags when creating the planComponents. Informs: cockroachdb#118720 Epic: None Release note: None

In cockroachdb#111896 we disallowed parallel execution of FK and unique checks that contain locking, to avoid usage of LeafTxns. But cockroachdb#111896 only considered checks built during planning of the main query, and overlooked checks built during planning of FK cascades. This commit also considers checks built during planning of FK cascades. Fixes: cockroachdb#118720 Epic: None Release note (bug fix): This commit fixes an internal error with message like: "LeafTxn ... incompatible with locking request" that occurs when performing an update under read committed isolation which cascades to a table with multiple other foreign keys.

"Default" locking here simply means no locking. I find this convention of saying "non-default key locking" a little verbose and confusing. Epic: None Release note: None

119035: sql, sem, opt, explain, memo, kv: audit bit-flag-checking helpers r=DrewKimball,mgartner a=michae2 `@mgartner's` [comment](#118931 (review)) on #118931 inspired me to audit all the other helper functions that check whether a flag is set in a bitfield. I might have found a couple bugs. See individual commits for details. Epic: None Release note: None 133270: raft: deflake non-determinisctic raft node tests r=iskettaneh a=iskettaneh We sporadically see that some raft node_test tests fail due to the leader not being stable. This commit should reduce the chances of that happening by increasing the election timeout to 250ms (instead of 50ms). I couldn't reproduce the bug locally with this change. If the bug still happens, we can try to force leadership to make it more deterministic. Fixes: #132992, #131676, #132205, #133048 Release note: None Co-authored-by: Michael Erickson <michae2@cockroachlabs.com> Co-authored-by: Ibrahim Kettaneh <ibrahim.kettaneh@cockroachlabs.com>

michae2 requested a review from a team as a code owner February 8, 2024 01:10

michae2 requested review from rharding6373 and removed request for a team February 8, 2024 01:10

blathers-crl bot added the backport Label PR's that are backports to older release branches label Feb 8, 2024

michae2 requested a review from yuzefovich February 8, 2024 01:10

michae2 requested review from mgartner and nvanbenschoten and removed request for rharding6373 February 8, 2024 01:10

mgartner approved these changes Feb 8, 2024

View reviewed changes

michae2 mentioned this pull request Feb 9, 2024

sql, sem, opt, explain, memo, kv: audit bit-flag-checking helpers #119035

Merged

michae2 force-pushed the backport23.2-118722 branch from 6dfcf75 to 741c335 Compare February 9, 2024 22:14

michae2 commented Feb 10, 2024

View reviewed changes

michae2 added 3 commits February 12, 2024 15:49

sql: rename *ContainsNonDefaultKeyLocking to *ContainsLocking

d788ffa

"Default" locking here simply means no locking. I find this convention of saying "non-default key locking" a little verbose and confusing. Epic: None Release note: None

michae2 force-pushed the backport23.2-118722 branch from 741c335 to d788ffa Compare February 12, 2024 23:49

michae2 merged commit ef05ff5 into cockroachdb:release-23.2 Feb 13, 2024
5 of 6 checks passed

michae2 deleted the backport23.2-118722 branch February 13, 2024 18:14

michae2 mentioned this pull request Oct 7, 2024

sql/randgen: disable crazy names in most randomized tests #132106

Merged

blathers-crl bot mentioned this pull request Oct 29, 2024

release-24.3: sql, sem, opt, explain, memo, kv: audit bit-flag-checking helpers #133744

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release-23.2: sql: don't parallelize any FK and unique checks containing locking #118931

release-23.2: sql: don't parallelize any FK and unique checks containing locking #118931

michae2 commented Feb 8, 2024 •

edited

Loading

blathers-crl bot commented Feb 8, 2024 •

edited by michae2

Loading

cockroach-teamcity commented Feb 8, 2024

michae2 commented Feb 8, 2024

mgartner left a comment

michae2 commented Feb 10, 2024

michae2 left a comment

release-23.2: sql: don't parallelize *any* FK and unique checks containing locking #118931

release-23.2: sql: don't parallelize *any* FK and unique checks containing locking #118931

Conversation

michae2 commented Feb 8, 2024 • edited Loading

blathers-crl bot commented Feb 8, 2024 • edited by michae2 Loading

cockroach-teamcity commented Feb 8, 2024

michae2 commented Feb 8, 2024

mgartner left a comment

Choose a reason for hiding this comment

michae2 commented Feb 10, 2024

michae2 left a comment

Choose a reason for hiding this comment

release-23.2: sql: don't parallelize any FK and unique checks containing locking #118931

release-23.2: sql: don't parallelize any FK and unique checks containing locking #118931

michae2 commented Feb 8, 2024 •

edited

Loading

blathers-crl bot commented Feb 8, 2024 •

edited by michae2

Loading