upgrades: fix txn retry bug in upgrade batching #104752

adityamaru · 2023-06-12T22:07:08Z

In #104545 we broke up the txn that is responsible for backfill the system.job_info table as part of an upgrade. That diff had a bug where a txn retry inside the db.Txn closure could result in us skipping rows to backfill. The consequence of this is that some jobs will not have their payload and progress copied over from the system.jobs table to the system.job_info table. This is bad because once the cluster is fully upgraded, the job system will only consult the system.job_info table during execution. When it does so, the job is destined to fail as there will be no payload or progress entry corresponding to that job.

Fixes: #104653
Release note (bug fix): fixes a bug where a txn retry during the backfill of the jobs info table could result in job rows being missed

Fixes: cockroachdb#104653 Release note (bug fix): fixes a bug where a txn retry during the backfill of the jobs info table could result in job rows being missed

blathers-crl · 2023-06-12T22:07:12Z

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

cockroach-teamcity · 2023-06-12T22:07:16Z

This change is

dt · 2023-06-12T22:08:30Z

blathers backport 23.1

blathers-crl · 2023-06-12T22:08:34Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from 947e34f to blathers/backport-release-23.1-104752: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1 failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

adityamaru · 2023-06-13T00:52:38Z

bors r=dt,stevendanna

adityamaru · 2023-06-13T00:54:46Z

bors r-

craig · 2023-06-13T00:54:48Z

Canceled.

adityamaru · 2023-06-13T00:55:33Z

bors r+ p=999 single-on

craig · 2023-06-13T02:19:00Z

Timed out.

healthy-pod · 2023-06-13T04:46:18Z

bors r+ single on

craig · 2023-06-13T05:16:51Z

This PR was included in a batch that successfully built, but then failed to merge into master. It will not be retried.

Additional information:

{"message":"You're not authorized to push to this branch. Visit https://docs.github.com/articles/about-protected-branches/ for more information.","documentation_url":"https://docs.github.com/articles/about-protected-branches"}

healthy-pod · 2023-06-13T06:27:44Z

bors r+ single on

craig · 2023-06-13T06:47:57Z

Build failed:

Bazel Essential CI (Cockroach)

stevendanna · 2023-06-13T07:13:18Z

bors r+ single on

craig · 2023-06-13T07:42:07Z

Build succeeded:

Bazel Essential CI (Cockroach)

In cockroachdb#105750 we split the backfill of the job_type column across multiple transactions. This introduced a bug in which we would modify the resumeAfter variable that controlled the batching before the transaction succeeded. In the face of a transaction retry, this would result in some rows not having their job_type column populated. This was caught in nightly tests that attempted to use the crdb_internal.system_jobs virtual index on the job_type column. Here, we apply the same fix that we applied in cockroachdb#104752 for the same type of bug. Fixes cockroachdb#106347 Fixes cockroachdb#106246 Release note (bug fix): Fixes a bug where a transaction retry during the backfill of the job_type column in the jobs table could result some job records with no job_type value.

106236: admission: avoid recursive grant chain r=irfansharif a=irfansharif Fixes #105185. Fixes #105613. In #97599 we introduced a non-blocking admission interface for below-raft, replication admission control. When doing so, we unintentionally violated the 'requester' interface -- when 'requester.granted()' is invoked, the granter expects to admit a single queued request. The code layering made it so that after granting one, when doing post-hoc token adjustments, if we observed the granter was exhausted but now no longer was so, we'd try to grant again. This resulted in admitting work recursively, with a callstack as deep as the admit chain. Not only is that undesirable design-wise, it also triggered panics in the granter that wasn't expecting more than one request being admitted. Recursively we were incrementing the grant chain index, which overflowed (it was in int8, so happened readily with long enough admit chains), after which we panic-ed when using a negative index to access an array. We add a test that fails without the changes. The failure can also be triggered by applying the diff below (which reverts to the older, buggy behavior): ``` dev test pkg/kv/kvserver -f TestFlowControlGranterAdmitOneByOne -v --show-logs ``` ```diff diff --git i/pkg/util/admission/granter.go w/pkg/util/admission/granter.go index ba42213c375..7c526fbb3d8 100644 --- i/pkg/util/admission/granter.go +++ w/pkg/util/admission/granter.go `@@` -374,7 +374,7 `@@` func (cg *kvStoreTokenChildGranter) storeWriteDone( func (cg *kvStoreTokenChildGranter) storeReplicatedWorkAdmittedLocked( originalTokens int64, admittedInfo storeReplicatedWorkAdmittedInfo, ) (additionalTokens int64) { - return cg.parent.storeReplicatedWorkAdmittedLocked(cg.workClass, originalTokens, admittedInfo, false /* canGrantAnother */) + return cg.parent.storeReplicatedWorkAdmittedLocked(cg.workClass, originalTokens, admittedInfo, true /* canGrantAnother */) } ``` Release note: None 106378: upgrades: fix txn retry bug in upgrade batching r=adityamaru a=stevendanna In #105750 we split the backfill of the job_type column across multiple transactions. This introduced a bug in which we would modify the resumeAfter variable that controlled the batching before the transaction succeeded. In the face of a transaction retry, this would result in some rows not having their job_type column populated. This was caught in nightly tests that attempted to use the crdb_internal.system_jobs virtual index on the job_type column. Here, we apply the same fix that we applied in #104752 for the same type of bug. Fixes #106347 Fixes #106246 Release note (bug fix): Fixes a bug where a transaction retry during the backfill of the job_type column in the jobs table could result some job records with no job_type value. 106408: ci: remove `lint` job from GitHub CI r=rail a=rickystewart With `staticcheck` and `unused` working identically under `lint` in Bazel and `make` now, it's time! Delete this file so that GitHub CI lint stops running. This is the *last* GitHub CI job. :) Now only Bazel builds and tests will run on PR's. Epic: CRDB-15060 Release note: None Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com> Co-authored-by: Steven Danna <danna@cockroachlabs.com> Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com>

In #105750 we split the backfill of the job_type column across multiple transactions. This introduced a bug in which we would modify the resumeAfter variable that controlled the batching before the transaction succeeded. In the face of a transaction retry, this would result in some rows not having their job_type column populated. This was caught in nightly tests that attempted to use the crdb_internal.system_jobs virtual index on the job_type column. Here, we apply the same fix that we applied in #104752 for the same type of bug. Fixes #106347 Fixes #106246 Release note (bug fix): Fixes a bug where a transaction retry during the backfill of the job_type column in the jobs table could result some job records with no job_type value.

upgrades: fix txn retry bug in upgrade batching

947e34f

Fixes: cockroachdb#104653 Release note (bug fix): fixes a bug where a txn retry during the backfill of the jobs info table could result in job rows being missed

adityamaru requested a review from dt June 12, 2023 22:07

adityamaru requested a review from a team as a code owner June 12, 2023 22:07

dt approved these changes Jun 12, 2023

View reviewed changes

stevendanna approved these changes Jun 12, 2023

View reviewed changes

adityamaru mentioned this pull request Jun 13, 2023

release-23.1: upgrades: fix txn retry bug in upgrade batching #104757

Merged

craig bot merged commit 322faef into cockroachdb:master Jun 13, 2023

adityamaru mentioned this pull request Jun 13, 2023

jobs: TestStartableJobMixedVersion failed #104654

Closed

dt mentioned this pull request Jun 13, 2023

jobs: some rows may be dropped during 22.2 -> 23.1 upgrade if txn retried #104798

Closed

stevendanna mentioned this pull request Jul 7, 2023

upgrades: fix txn retry bug in upgrade batching #106378

Merged

adityamaru deleted the fix-batch branch July 7, 2023 11:57

blathers-crl bot mentioned this pull request Jul 7, 2023

release-23.1: upgrades: fix txn retry bug in upgrade batching #106412

Merged

stevendanna mentioned this pull request Jul 7, 2023

kv,testserver: more easily inject transaction retries #106417

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

upgrades: fix txn retry bug in upgrade batching #104752

upgrades: fix txn retry bug in upgrade batching #104752

adityamaru commented Jun 12, 2023 •

edited

Loading

blathers-crl bot commented Jun 12, 2023

cockroach-teamcity commented Jun 12, 2023

dt commented Jun 12, 2023

blathers-crl bot commented Jun 12, 2023

adityamaru commented Jun 13, 2023

adityamaru commented Jun 13, 2023

craig bot commented Jun 13, 2023

adityamaru commented Jun 13, 2023

craig bot commented Jun 13, 2023

healthy-pod commented Jun 13, 2023

craig bot commented Jun 13, 2023

healthy-pod commented Jun 13, 2023

craig bot commented Jun 13, 2023

stevendanna commented Jun 13, 2023

craig bot commented Jun 13, 2023

upgrades: fix txn retry bug in upgrade batching #104752

upgrades: fix txn retry bug in upgrade batching #104752

Conversation

adityamaru commented Jun 12, 2023 • edited Loading

blathers-crl bot commented Jun 12, 2023

cockroach-teamcity commented Jun 12, 2023

dt commented Jun 12, 2023

blathers-crl bot commented Jun 12, 2023

adityamaru commented Jun 13, 2023

adityamaru commented Jun 13, 2023

craig bot commented Jun 13, 2023

adityamaru commented Jun 13, 2023

craig bot commented Jun 13, 2023

healthy-pod commented Jun 13, 2023

craig bot commented Jun 13, 2023

healthy-pod commented Jun 13, 2023

craig bot commented Jun 13, 2023

stevendanna commented Jun 13, 2023

craig bot commented Jun 13, 2023

adityamaru commented Jun 12, 2023 •

edited

Loading