Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upgrades: fix txn retry bug in upgrade batching #106378

Merged

Conversation

stevendanna
Copy link
Collaborator

@stevendanna stevendanna commented Jul 7, 2023

In #105750 we split the backfill of the job_type column across multiple transactions. This introduced a bug in which we would modify the resumeAfter variable that controlled the batching before the transaction succeeded. In the face of a transaction retry, this would result in some rows not having their job_type column populated.

This was caught in nightly tests that attempted to use the crdb_internal.system_jobs virtual index on the job_type column.

Here, we apply the same fix that we applied in #104752 for the same type of bug.

Fixes #106347
Fixes #106246

Release note (bug fix): Fixes a bug where a transaction retry during the backfill of the job_type column in the jobs table could result some job records with no job_type value.

@stevendanna stevendanna requested a review from a team as a code owner July 7, 2023 11:00
@blathers-crl
Copy link

blathers-crl bot commented Jul 7, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@stevendanna stevendanna added the backport-23.1.x Flags PRs that need to be backported to 23.1 label Jul 7, 2023
@stevendanna
Copy link
Collaborator Author

As far as I can see, this bug has not yet been released, so my comment in the commit message about doing follow-up work to repair existing rows should actually not be necessary, I will remove it from the commit message.

I am wondering if in general we should default to not doing types of backfills manually. For instance, could we have instead used the schema changer to do this work for us by adding the new column with a DEFAULT expression?

Copy link
Contributor

@adityamaru adityamaru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😢

I agree with your comment re: establishing some best practices when writing a backfill in an upgrade. We've now seen two identical issues:

  • Backfill doesn't batch and so runs into contention.
  • We add batching to the backfill but tricky txn retry semantics make them susceptible to bugs that are not caught in regular CI.

We should consider writing this down somewhere where folks would look when writing a new ugprade. Maybe a best practices/README file we check in and point authors to via comments?

@stevendanna
Copy link
Collaborator Author

Test failure looks unrelated. Filed here: #106383

In cockroachdb#105750 we split the backfill of the job_type column across
multiple transactions. This introduced a bug in which we would modify
the resumeAfter variable that controlled the batching before the
transaction succeeded. In the face of a transaction retry, this would
result in some rows not having their job_type column populated.

This was caught in nightly tests that attempted to use the
crdb_internal.system_jobs virtual index on the job_type column.

Here, we apply the same fix that we applied in cockroachdb#104752 for the same
type of bug.

Fixes cockroachdb#106347
Fixes cockroachdb#106246

Release note (bug fix): Fixes a bug where a transaction retry during
the backfill of the job_type column in the jobs table could result
some job records with no job_type value.
@stevendanna stevendanna force-pushed the ssd/job-type-backfill-retry-bug branch from c0f1853 to a7356bc Compare July 7, 2023 12:29
@knz
Copy link
Contributor

knz commented Jul 7, 2023

would it make sense to add a metamorphic test knob that randomly inserts retry errors through kv and/or sql?

@stevendanna
Copy link
Collaborator Author

@knz I believe so.

I'm working on a hacky version at the moment just to see how many things fail.

I was also thinking of a linter that might help. For example, I think anytime you read and write to a variable captured in a Txn func that is probably code worth a second look. But, I such a linter may be a bit too noisy.

@stevendanna
Copy link
Collaborator Author

Going to merge this to avoid more nightly failures.

bors r=adityamaru

@craig
Copy link
Contributor

craig bot commented Jul 7, 2023

Build succeeded:

@craig craig bot merged commit d802aaf into cockroachdb:master Jul 7, 2023
@blathers-crl
Copy link

blathers-crl bot commented Jul 7, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error setting reviewers, but backport branch blathers/backport-release-23.1-106378 is ready: POST https://api.github.com/repos/cockroachdb/cockroach/pulls/106412/requested_reviewers: 422 Reviews may only be requested from collaborators. One or more of the teams you specified is not a collaborator of the cockroachdb/cockroach repository. []

Backport to branch 23.1.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1 db-cy-23
Projects
None yet
Development

Successfully merging this pull request may close these issues.

jobs: TestJobInfoUpgradeRegressionTests failed jobs: TestJobInfoUpgradeRegressionTests failed
5 participants