Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upgrades: batch the job_info backfill upgrade #104545

Merged
merged 1 commit into from
Jun 8, 2023

Conversation

dt
Copy link
Member

@dt dt commented Jun 7, 2023

Release note (bug fix): The backfill of system.job_info upgrade migration that runs during upgrades from 22.2 now processes rows in batches to avoid cases where it could become stuck due to contention and transaction retries.
Epic: none.

@dt dt requested a review from adityamaru June 7, 2023 19:21
@dt dt requested review from a team as code owners June 7, 2023 19:21
@blathers-crl
Copy link

blathers-crl bot commented Jun 7, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@adityamaru adityamaru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes LGTM, thanks for doing this. If you decide to make the batch size a cluster setting, we have a test TestBackfillJobsInfoTable that should maybe metamorphic set that cluster setting to [1, num(jobs)] just to ensure the resumeAfter logic works fine?

@adityamaru adityamaru self-requested a review June 7, 2023 19:39
@dt dt force-pushed the job_info_batched branch from d93ca3e to 1bea18d Compare June 7, 2023 19:53
@dt
Copy link
Member Author

dt commented Jun 7, 2023

yeah, I ran the test by hand with the const version set to 1 locally before switching it to 100, so I did make sure the resuming worked

Release note (bug fix): The backfill of system.job_info upgrade migration that runs during upgrades from 22.2 now processes rows in batches to avoid cases where it could become stuck due to contention and transaction retries.
Epic: none.
@dt dt force-pushed the job_info_batched branch from 1bea18d to 3850b83 Compare June 7, 2023 20:30
@dt dt added the backport-23.1.x Flags PRs that need to be backported to 23.1 label Jun 7, 2023
@dt
Copy link
Member Author

dt commented Jun 7, 2023

TFTR!

bors r+

@craig
Copy link
Contributor

craig bot commented Jun 7, 2023

This PR was included in a batch that successfully built, but then failed to merge into master. It will not be retried.

Additional information:

{"message":"Changes must be made through a pull request.","documentation_url":"https://docs.github.com/articles/about-protected-branches"}

@healthy-pod
Copy link
Contributor

bors r+ single on

@craig
Copy link
Contributor

craig bot commented Jun 8, 2023

Build succeeded:

@craig craig bot merged commit 31898e7 into cockroachdb:master Jun 8, 2023
@blathers-crl
Copy link

blathers-crl bot commented Jun 8, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error setting reviewers, but backport branch blathers/backport-release-23.1-104545 is ready: POST https://api.github.com/repos/cockroachdb/cockroach/pulls/104574/requested_reviewers: 422 Reviews may only be requested from collaborators. One or more of the teams you specified is not a collaborator of the cockroachdb/cockroach repository. []

Backport to branch 23.1.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@dt dt deleted the job_info_batched branch June 12, 2023 16:09
craig bot pushed a commit that referenced this pull request Jun 13, 2023
104752: upgrades: fix txn retry bug in upgrade batching r=stevendanna a=adityamaru

In #104545 we broke up the txn that is responsible for backfill the `system.job_info` table as part of an upgrade. That diff had a bug where a txn retry inside the `db.Txn` closure could result in us skipping rows to backfill. The consequence of this is that some jobs will not have their payload and progress copied over from the `system.jobs` table to the `system.job_info` table. This is bad because once the cluster is fully upgraded, the job system will **only** consult the `system.job_info` table during execution. When it does so, the job is destined to fail as there will be no payload or progress entry corresponding to that job.

Fixes: #104653
Release note (bug fix): fixes a bug where a txn retry during the backfill of the jobs info table could result in job rows being missed

Co-authored-by: adityamaru <adityamaru@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants