Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: persist correct cluster version on join #27639

Merged
merged 2 commits into from
Jul 18, 2018

Conversation

tbg
Copy link
Member

@tbg tbg commented Jul 17, 2018

Prior to this commit, when a node joined an existing cluster, it would
likely persist its MinSupportedVersion as the cluster version (even
though it would operate at the correct version). This is problematic
since this node's data directories wouldn't be regarded compatible with
the successor version of the real version, even though they are.

The problem was that the callback which receives the real cluster
version would fire before the stores were bootstrapped, and bootstrap
would erroneously pick the minimum supported version. We now "backfill"
the correct cluster version after having bootstrapped all stores.

Even though the fixtures will need to be regenerated:
Fixes #27525.

Optimistically:
Fixes #27162.

Release note: None

@tbg tbg requested a review from a team July 17, 2018 14:22
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg
Copy link
Member Author

tbg commented Jul 18, 2018

Interestingly we'll have to do something about the mixed version tests still, since that test involves nodes that don't have the code. My preference would be to add an extra stop-restart cycle in the test which will update the cluster version in-place even on the old binaries.

@tbg
Copy link
Member Author

tbg commented Jul 18, 2018

Still playing with the roachtest -- got it failing immediately with a

> /Users/tschottdorf/go/bin/roachprod run local:1 -- ./cockroach quit --insecure --port {pgport:1}
Error: Failed to connect to the node: initial connection heartbeat failed: rpc error: code = Unknown desc = version compatibility check failed on ping request: cluster requires at least version 2.0-9, but peer has version 2.0

which is odd because the cluster version is supposedly 2.0. Looking.

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/storage/stores.go, line 447 at r1 (raw file):

	}
	// If the update downgrades the minimum version, ignore it. Must be
	// a reordering. Note that we do carry out the upgrade if the MinVersion

Can you elaborate on the reordering that is possible here?

Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RFAL, the new commits take care of the roachtest failures.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/storage/stores.go, line 447 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Can you elaborate on the reordering that is possible here?

Done. (Along with a small refactor).

tbg added 2 commits July 18, 2018 17:16
Prior to this commit, when a node joined an existing cluster, it would
likely persist its MinSupportedVersion as the cluster version (even
though it would operate at the correct version). This is problematic
since this node's data directories wouldn't be regarded compatible with
the successor version of the real version, even though they are.

The problem was that the callback which receives the real cluster
version would fire before the stores were bootstrapped, and bootstrap
would erroneously pick the minimum supported version. We now "backfill"
the correct cluster version after having bootstrapped all stores.

Even though the fixtures will need to be regenerated:
Fixes cockroachdb#27525.

Optimistically:
Fixes cockroachdb#27162.

Release note: None
Change the test slightly so that it works around cockroachdb#27162 until we
backport cockroachdb#27639 to release-2.0.

Release note: None
@tbg tbg force-pushed the fix/version-unpersisted branch from 272be82 to fa7677f Compare July 18, 2018 15:16
Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/cmd/roachtest/upgrade.go, line 55 at r3 (raw file):

		// TODO(tschottdorf): this is a hack similar to the one in the mixed version
		// test. Remove it when we have a 2.0.x binary that has #27639 fixed.

Might want to file an issue so you don't forget about this. Or just don't forget about this.

@tbg
Copy link
Member Author

tbg commented Jul 18, 2018

Reminder filed: #27717

bors r=petermattis
TFTR!

craig bot pushed a commit that referenced this pull request Jul 18, 2018
27639: server: persist correct cluster version on join r=petermattis a=tschottdorf

Prior to this commit, when a node joined an existing cluster, it would
likely persist its MinSupportedVersion as the cluster version (even
though it would operate at the correct version). This is problematic
since this node's data directories wouldn't be regarded compatible with
the successor version of the real version, even though they are.

The problem was that the callback which receives the real cluster
version would fire before the stores were bootstrapped, and bootstrap
would erroneously pick the minimum supported version. We now "backfill"
the correct cluster version after having bootstrapped all stores.

Even though the fixtures will need to be regenerated:
Fixes #27525.

Optimistically:
Fixes #27162.

Release note: None

27712: rpc: avoid clock offset fatal on unvalidated connection r=a-robinson a=tschottdorf

I think we were seeing rare flakes in CI in which a heartbeat service
got connections from another test that used a different clock offset
due to port reuse.
Since the assertion didn't wait for the connection to be validated (i.e.
cluster versions matching), this could crash. This is only a theory,
but we're never able to reproduce these failures.

There are lots of places that use a 1ns maxoffset, including
bootstrapCluster. If we continue to see these failures, we should make
all of those unique numbers to lend credibility to the port reuse
theory.

Speculatively:
Fixes #27555.

Release note: None

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
@craig
Copy link
Contributor

craig bot commented Jul 18, 2018

Build succeeded

@craig craig bot merged commit fa7677f into cockroachdb:master Jul 18, 2018
craig bot pushed a commit that referenced this pull request Jul 19, 2018
27741: backport-2.0: server: persist correct cluster version on join r=petermattis a=tschottdorf

Backport 1/2 commits from #27639.

I had to adapt the test, and I have verified that it fails without the main change.

/cc @cockroachdb/release

---

Prior to this commit, when a node joined an existing cluster, it would
likely persist its MinSupportedVersion as the cluster version (even
though it would operate at the correct version). This is problematic
since this node's data directories wouldn't be regarded compatible with
the successor version of the real version, even though they are.

The problem was that the callback which receives the real cluster
version would fire before the stores were bootstrapped, and bootstrap
would erroneously pick the minimum supported version. We now "backfill"
the correct cluster version after having bootstrapped all stores.

Even though the fixtures will need to be regenerated:
Fixes #27525.

Optimistically:
Fixes #27162.

Release note: None


Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
@tbg tbg deleted the fix/version-unpersisted branch July 26, 2018 09:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants