sql,migration: ensure cluster version never regresses#78705
sql,migration: ensure cluster version never regresses#78705craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
|
In somewhat better news, it turns out that this is less severe than I first thought. On dedicated clusters, the value in the system.settings table isn't really ever used. We instead use the value written to disk on each store, at least, so far as I can tell. For multi-tenant clusters, we don't allow more than one sql pod to be active during upgrades. |
7a76ef4 to
6432398
Compare
In cockroachdb#68074 (which is in 21.2), we added logic to bump the version stored in the system.settings table to intermediate versions as we run migrations. This was critical to provide any sort of invariant when upgrading secondary tenants. The logic to do this bumping works through a callback plumbed into the migrationmanager from the sql pacakge. Unfortunately, this callback did not ensure that the version being written was greater than the exisiting version; it just checked that it was different. This was previously made safe by some transactional properties of the version upgrade. Fixing the check to ensure that the version does indeed go up solves the flake decisively. The question which remains is: why did the flake start January 8th? It seems that it flaked earlier, on December 4th, with cockroachdb#73468 which we never solved. I hypothesize that it becomes more likely the more versions we put into play. Right after we cut the release branch for 22.1, the flake was less common. I think that explains why it got worse over time. The release note is also not great because I don't quite know the repercussions. Fixes cockroachdb#74599. Release note (bug fix): Fixed a bug whereby the cluster version could regress due to a race condition.
6432398 to
7d3415b
Compare
|
TFTR! bors r+ |
|
Build succeeded: |
|
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from 7d3415b to blathers/backport-release-21.2-78705: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 21.2.x failed. See errors above. error creating merge commit from 7d3415b to blathers/backport-release-22.1-78705: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 22.1.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
|
ack, will backport |
In #68074 (which is in 21.2), we added logic to bump the version stored in the
system.settings table to intermediate versions as we run migrations. This was
critical to provide any sort of invariant when upgrading secondary tenants. The
logic to do this bumping works through a callback plumbed into the
migrationmanager from the sql pacakge. Unfortunately, this callback did not
ensure that the version being written was greater than the exisiting version;
it just checked that it was different. This was previously made safe by some
transactional properties of the version upgrade.
Fixing the check to ensure that the version does indeed go up solves the flake
decisively. The question which remains is: why did the flake start January 8th?
It seems that it flaked earlier, on December 4th, with #73468 which we never
solved. I hypothesize that it becomes more likely the more versions we put into
play. Right after we cut the release branch for 22.1, the flake was less common.
I think that explains why it got worse over time.
The release note is also not great because I don't quite know the
repercussions.
Fixes #74599.
Release note (bug fix): Fixed a bug whereby the cluster version could regress
due to a race condition.