Skip to content

release-22.1.0: sql,migration: ensure cluster version never regresses#80712

Merged
ajwerner merged 1 commit intocockroachdb:release-22.1.0from
ajwerner:backport22.1.0-80711
Apr 28, 2022
Merged

release-22.1.0: sql,migration: ensure cluster version never regresses#80712
ajwerner merged 1 commit intocockroachdb:release-22.1.0from
ajwerner:backport22.1.0-80711

Conversation

@ajwerner
Copy link
Contributor

@ajwerner ajwerner commented Apr 28, 2022

Backport 1/1 commits from #78705.

/cc @cockroachdb/release


In #68074 (which is in 21.2), we added logic to bump the version stored in the
system.settings table to intermediate versions as we run migrations. This was
critical to provide any sort of invariant when upgrading secondary tenants. The
logic to do this bumping works through a callback plumbed into the
migrationmanager from the sql pacakge. Unfortunately, this callback did not
ensure that the version being written was greater than the exisiting version;
it just checked that it was different. This was previously made safe by some
transactional properties of the version upgrade.

Fixing the check to ensure that the version does indeed go up solves the flake
decisively. The question which remains is: why did the flake start January 8th?
It seems that it flaked earlier, on December 4th, with #73468 which we never
solved. I hypothesize that it becomes more likely the more versions we put into
play. Right after we cut the release branch for 22.1, the flake was less common.
I think that explains why it got worse over time.

The release note is also not great because I don't quite know the
repercussions.

Fixes #74599.

Release note (bug fix): Fixed a bug whereby the cluster version could regress
due to a race condition.

Release justification: Small fix which corrects a potentially serious bug.

In cockroachdb#68074 (which is in 21.2), we added logic to bump the version stored in the
system.settings table to intermediate versions as we run migrations. This was
critical to provide any sort of invariant when upgrading secondary tenants. The
logic to do this bumping works through a callback plumbed into the
migrationmanager from the sql pacakge. Unfortunately, this callback did not
ensure that the version being written was greater than the exisiting version;
it just checked that it was different. This was previously made safe by some
transactional properties of the version upgrade.

Fixing the check to ensure that the version does indeed go up solves the flake
decisively. The question which remains is: why did the flake start January 8th?
It seems that it flaked earlier, on December 4th, with cockroachdb#73468 which we never
solved. I hypothesize that it becomes more likely the more versions we put into
play. Right after we cut the release branch for 22.1, the flake was less common.
I think that explains why it got worse over time.

The release note is also not great because I don't quite know the
repercussions.

Fixes cockroachdb#74599.

Release note (bug fix): Fixed a bug whereby the cluster version could regress
due to a race condition.
@blathers-crl
Copy link

blathers-crl bot commented Apr 28, 2022

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Patches should only be created for serious issues or test-only changes.
  • Patches should not break backwards-compatibility.
  • Patches should change as little code as possible.
  • Patches should not change on-disk formats or node communication protocols.
  • Patches should not add new functionality.
  • Patches must not add, edit, or otherwise modify cluster versions; or add version gates.
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters.
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.

Add a brief release justification to the body of your PR to justify this backport.

Some other things to consider:

  • What did we do to ensure that a user that doesn’t know & care about this backport, has no idea that it happened?
  • Will this work in a cluster of mixed patch versions? Did we test that?
  • If a user upgrades a patch version, uses this feature, and then downgrades, what happens?

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@ajwerner
Copy link
Contributor Author

TFTR!

@ajwerner ajwerner merged commit 6556002 into cockroachdb:release-22.1.0 Apr 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants