-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
upgrades: fix txn retry bug in upgrade batching #106378
upgrades: fix txn retry bug in upgrade batching #106378
Conversation
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
As far as I can see, this bug has not yet been released, so my comment in the commit message about doing follow-up work to repair existing rows should actually not be necessary, I will remove it from the commit message. I am wondering if in general we should default to not doing types of backfills manually. For instance, could we have instead used the schema changer to do this work for us by adding the new column with a DEFAULT expression? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😢
I agree with your comment re: establishing some best practices when writing a backfill in an upgrade. We've now seen two identical issues:
- Backfill doesn't batch and so runs into contention.
- We add batching to the backfill but tricky txn retry semantics make them susceptible to bugs that are not caught in regular CI.
We should consider writing this down somewhere where folks would look when writing a new ugprade. Maybe a best practices/README file we check in and point authors to via comments?
Test failure looks unrelated. Filed here: #106383 |
In cockroachdb#105750 we split the backfill of the job_type column across multiple transactions. This introduced a bug in which we would modify the resumeAfter variable that controlled the batching before the transaction succeeded. In the face of a transaction retry, this would result in some rows not having their job_type column populated. This was caught in nightly tests that attempted to use the crdb_internal.system_jobs virtual index on the job_type column. Here, we apply the same fix that we applied in cockroachdb#104752 for the same type of bug. Fixes cockroachdb#106347 Fixes cockroachdb#106246 Release note (bug fix): Fixes a bug where a transaction retry during the backfill of the job_type column in the jobs table could result some job records with no job_type value.
c0f1853
to
a7356bc
Compare
would it make sense to add a metamorphic test knob that randomly inserts retry errors through kv and/or sql? |
@knz I believe so. I'm working on a hacky version at the moment just to see how many things fail. I was also thinking of a linter that might help. For example, I think anytime you read and write to a variable captured in a Txn func that is probably code worth a second look. But, I such a linter may be a bit too noisy. |
Going to merge this to avoid more nightly failures. bors r=adityamaru |
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error setting reviewers, but backport branch blathers/backport-release-23.1-106378 is ready: POST https://api.github.com/repos/cockroachdb/cockroach/pulls/106412/requested_reviewers: 422 Reviews may only be requested from collaborators. One or more of the teams you specified is not a collaborator of the cockroachdb/cockroach repository. [] Backport to branch 23.1.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
In #105750 we split the backfill of the job_type column across multiple transactions. This introduced a bug in which we would modify the resumeAfter variable that controlled the batching before the transaction succeeded. In the face of a transaction retry, this would result in some rows not having their job_type column populated.
This was caught in nightly tests that attempted to use the crdb_internal.system_jobs virtual index on the job_type column.
Here, we apply the same fix that we applied in #104752 for the same type of bug.
Fixes #106347
Fixes #106246
Release note (bug fix): Fixes a bug where a transaction retry during the backfill of the job_type column in the jobs table could result some job records with no job_type value.