Skip to content

Comments

sql,migration: ensure cluster version never regresses#78705

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
ajwerner:ajwerner/fix-version-upgrade
Mar 31, 2022
Merged

sql,migration: ensure cluster version never regresses#78705
craig[bot] merged 1 commit intocockroachdb:masterfrom
ajwerner:ajwerner/fix-version-upgrade

Conversation

@ajwerner
Copy link
Contributor

In #68074 (which is in 21.2), we added logic to bump the version stored in the
system.settings table to intermediate versions as we run migrations. This was
critical to provide any sort of invariant when upgrading secondary tenants. The
logic to do this bumping works through a callback plumbed into the
migrationmanager from the sql pacakge. Unfortunately, this callback did not
ensure that the version being written was greater than the exisiting version;
it just checked that it was different. This was previously made safe by some
transactional properties of the version upgrade.

Fixing the check to ensure that the version does indeed go up solves the flake
decisively. The question which remains is: why did the flake start January 8th?
It seems that it flaked earlier, on December 4th, with #73468 which we never
solved. I hypothesize that it becomes more likely the more versions we put into
play. Right after we cut the release branch for 22.1, the flake was less common.
I think that explains why it got worse over time.

The release note is also not great because I don't quite know the
repercussions.

Fixes #74599.

Release note (bug fix): Fixed a bug whereby the cluster version could regress
due to a race condition.

@ajwerner ajwerner requested a review from a team as a code owner March 29, 2022 05:14
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@ajwerner
Copy link
Contributor Author

In somewhat better news, it turns out that this is less severe than I first thought. On dedicated clusters, the value in the system.settings table isn't really ever used. We instead use the value written to disk on each store, at least, so far as I can tell. For multi-tenant clusters, we don't allow more than one sql pod to be active during upgrades.

@ajwerner ajwerner force-pushed the ajwerner/fix-version-upgrade branch from 7a76ef4 to 6432398 Compare March 29, 2022 17:42
In cockroachdb#68074 (which is in 21.2), we added logic to bump the version stored in the
system.settings table to intermediate versions as we run migrations. This was
critical to provide any sort of invariant when upgrading secondary tenants. The
logic to do this bumping works through a callback plumbed into the
migrationmanager from the sql pacakge. Unfortunately, this callback did not
ensure that the version being written was greater than the exisiting version;
it just checked that it was different. This was previously made safe by some
transactional properties of the version upgrade.

Fixing the check to ensure that the version does indeed go up solves the flake
decisively. The question which remains is: why did the flake start January 8th?
It seems that it flaked earlier, on December 4th, with cockroachdb#73468 which we never
solved. I hypothesize that it becomes more likely the more versions we put into
play. Right after we cut the release branch for 22.1, the flake was less common.
I think that explains why it got worse over time.

The release note is also not great because I don't quite know the
repercussions.

Fixes cockroachdb#74599.

Release note (bug fix): Fixed a bug whereby the cluster version could regress
due to a race condition.
@ajwerner ajwerner force-pushed the ajwerner/fix-version-upgrade branch from 6432398 to 7d3415b Compare March 31, 2022 00:33
@ajwerner
Copy link
Contributor Author

TFTR!

bors r+

@craig
Copy link
Contributor

craig bot commented Mar 31, 2022

Build succeeded:

@craig craig bot merged commit 4426742 into cockroachdb:master Mar 31, 2022
@blathers-crl
Copy link

blathers-crl bot commented Mar 31, 2022

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 7d3415b to blathers/backport-release-21.2-78705: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 21.2.x failed. See errors above.


error creating merge commit from 7d3415b to blathers/backport-release-22.1-78705: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.1.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@adityamaru
Copy link
Contributor

adityamaru commented Apr 28, 2022

@ajwerner doesn't look like the backports made it to the branches? I seem to have hit the related test failure in a recent backport to 22.1.0 - #80224

@ajwerner
Copy link
Contributor Author

ack, will backport

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

server: TestClusterVersionUpgrade failed

4 participants