Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: clean up migration story via unstable negative versions #33578

Closed
cockroach-teamcity opened this issue Jan 9, 2019 · 24 comments
Closed
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) no-issue-activity O-robot Originated from a bot. T-kv KV Team

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jan 9, 2019

SHA: https://github.com/cockroachdb/cockroach/commits/12e28159b1d8b63b56d6a48f22ebbb5c75e8ee5c

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=upgrade/oldVersion=v2.0.5/nodes=5 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1085451&tab=buildLog

The test failed on provisional_201901081731_v2.2.0-alpha-20190114:
	test.go:696,upgrade.go:149,upgrade.go:265: dial tcp 35.193.121.106:26257: connect: connection refused

Jira issue: CRDB-4682

@cockroach-teamcity cockroach-teamcity added this to the 2.2 milestone Jan 9, 2019
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Jan 9, 2019
@tbg tbg assigned nvanbenschoten and unassigned tbg Jan 9, 2019
@tbg
Copy link
Member

tbg commented Jan 9, 2019

Hey, glad to see this test catch stuff. @knz, blame tells me that you might know what to make of this, though looking a bit more I end up here

func (n *Node) batchInternal(
ctx context.Context, args *roachpb.BatchRequest,
) (*roachpb.BatchResponse, error) {
if detail := checkNoUnknownRequest(args.Requests); detail != nil {
and am starting to think that the "unsupported request" part is really the lynchpin. Fantastic that the error isn't telling you what the unsupported request is, no?

I think the node that crashed with the below error is running 2.2-alpha, so the "unknown request" originated at a node running 2.0.5. So, what requests did we remove between the two? This sounds more like a @nvanbenschoten question now. I didn't find anything recent in git log -p pkg/roachpb/api.proto, though. Any cache hit on your end?

* WARNING: neither --listen-addr nor --advertise-addr was specified.
* The server will advertise "teamcity-1085451-upgrade-oldversion-v2-0-5-nodes-5-0001" to other nodes, is this routable?
* 
* Consider using:
* - for local-only servers:  --listen-addr=localhost
* - for multi-node clusters: --advertise-addr=<host/IP addr>
* 
*
F190109 06:49:23.660601 14 server/server.go:1617  [n1] failed to run migration "create default databases": create-default-db: unsupported request
goroutine 14 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc420056300, 0xc420056300, 0x4760500, 0x10)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:1005 +0xcf
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x4ed7400, 0xc400000004, 0x476059e, 0x10, 0x651, 0xc4259a53e0, 0x5f)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:872 +0x939
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x35f1ce0, 0xc420d4a4b0, 0x4, 0x2, 0x0, 0x0, 0xc4252acbe8, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:85 +0x2e5
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x35f1ce0, 0xc420d4a4b0, 0x1, 0xc400000004, 0x0, 0x0, 0xc4252acbe8, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:71 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatal(0x35f1ce0, 0xc420d4a4b0, 0xc4252acbe8, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:191 +0x6c
github.com/cockroachdb/cockroach/pkg/server.(*Server).Start(0xc420250000, 0x35f1ce0, 0xc420d4a4b0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/server/server.go:1617 +0x307d
github.com/cockroachdb/cockroach/pkg/cli.runStart.func3.2(0xc420534090, 0xc4204ee108, 0xc4202fe550, 0x35f1ce0, 0xc420ac17a0, 0x106e53b1, 0xed3c7906b, 0x0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/cli/start.go:605 +0x109
github.com/cockroachdb/cockroach/pkg/cli.runStart.func3(0xc4204ee108, 0x35f1ce0, 0xc420ac17a0, 0x3624f20, 0xc4209b1ee0, 0xc420534090, 0xc4202fe550, 0x106e53b1, 0xed3c7906b, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/cli/start.go:710 +0x11c
created by github.com/cockroachdb/cockroach/pkg/cli.runStart
	/go/src/github.com/cockroachdb/cockroach/pkg/cli/start.go:561 +0x6f8

@nvanbenschoten
Copy link
Member

I think the node that crashed with the below error is running 2.2-alpha, so the "unknown request" originated at a node running 2.0.5.

How is a 2.0.5 node even talking to a node running the 2.2-alpha? Is this because we haven't bumped the minor version on the alphas yet? I don't know what is causing this, but I can see how things would slip in if people are developing against a model where they only need to support compatibility with 2.1 going forward for these alphas.

@tbg
Copy link
Member

tbg commented Jan 9, 2019

Right. I think the right thing is to bump 2.0.5 to 2.1. (at which point the test should stop being run in 2.1, or we fork it to retain 2.0 there), but we should still figure out what request caused this to make sure that's all there is to it.

@knz
Copy link
Contributor

knz commented Jan 9, 2019

The SQL query you've blamed is a 2.1 SQL migration. This should work fine in mixed 2.0/2.1 configurations.

@nvanbenschoten
Copy link
Member

I think the right thing is to bump 2.0.5 to 2.1. (at which point the test should stop being run in 2.1, or we fork it to retain 2.0 there)

Making sure I got that - you're saying that we should bump the test's oldVersion to 2.1? That sounds good to me. The release-2.1 branch will hold the current state of this test going forward, so there's no need to explicitly fork.

I'll throw in some logging and pull out the offending request type.

@nvanbenschoten
Copy link
Member

I think the node that crashed with the below error is running 2.2-alpha, so the "unknown request" originated at a node running 2.0.5. So, what requests did we remove between the two?

The node that crashed was running the 2.2-alpha. It crashed after 5 of its migrations failed due to the "unknown request" error. I think this means that the 2.0.5 node was rejecting a request that it didn't know about, not the other way around. So the question is, what did we add? That's great because it will make it easier to debug.

@nvanbenschoten
Copy link
Member

Looks like we were sending QueryIntent requests to the 2.0.5 node. This is a result of #33450.

We can change the test, but my question still remains about 2.2 alphas connecting to 2.0 nodes. Since we haven't bumped the minor cluster version, do we have any protection against this in production clusters?

@tbg
Copy link
Member

tbg commented Jan 9, 2019

It depends. If that commit ever landed on release-2.1, it would cause this problem in production. If it doesn't, it'll only ever interact with 2.1.x clusters which can handle it.

This seems easy enough to get wrong accidentally, so perhaps we ought to adopt a policy of not removing version checks unless there's a non-patch cluster version between them.

@tbg
Copy link
Member

tbg commented Jan 9, 2019

Hmm, now that I read this again, I'm not 100% confident what I'm saying is even true. We have version checks in the rpc handlers that are supposed to avoid connections across non-adjacent versions, but there seem to be some exceptions.

func checkVersion(
clusterVersion *cluster.ExposedClusterVersion, peerVersion roachpb.Version,
) error {
if !clusterVersion.IsInitialized() {
// Cluster version has not yet been determined.
return nil
}
if !clusterVersion.IsActive(cluster.VersionRPCVersionCheck) {
// Cluster version predates this version check.
return nil
}
minVersion := clusterVersion.Version().MinimumVersion
if peerVersion == (roachpb.Version{}) {
return errors.Errorf(
"cluster requires at least version %s, but peer did not provide a version", minVersion)
}
if peerVersion.Less(minVersion) {
return errors.Errorf(
"cluster requires at least version %s, but peer has version %s", minVersion, peerVersion)
}
return nil
}

@nvanbenschoten
Copy link
Member

It depends. If that commit ever landed on release-2.1, it would cause this problem in production. If it doesn't, it'll only ever interact with 2.1.x clusters which can handle it.

Even without the exceptions, won't 2.0 binaries and 2.2-alpha binaries look compatible? The only difference between 2.2 alpha binaries and 2.1 release binaries is an Unstable version, which is explicitly ignored by Version.CanBump.

@tbg
Copy link
Member

tbg commented Jan 10, 2019

That's true. Maybe we should be introducing the 2.2 tag before the first alpha, and make the 2.2 released version an unstable version bump. But then we have to make sure `SET CLUSTER SETTING version = 2.2' actually tries to upgrade to 2.2-n, where n is the "right" unstable (so probably that of the gateway node).

Or we allow negative unstables, so that we can tag alphas as 2.2.0-"-124" (so that 2.2 can still be 2.2.0).

Or we leave everything as-is and mandate that it's simply a caveat to watch out for when testing alphas.

@bdarnell, wdyt?

craig bot pushed a commit that referenced this issue Jan 10, 2019
33598: roachtest: update `upgrade` roachtest to start at v2.1.3 r=nvanbenschoten a=nvanbenschoten

Fixes #33578.

I've run this twice successfully.

Release note: None

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@craig craig bot closed this as completed in #33598 Jan 10, 2019
@nvanbenschoten
Copy link
Member

Keeping this open until we resolve the other issue.

@nvanbenschoten
Copy link
Member

Or we allow negative unstables, so that we can tag alphas as 2.2.0-"-124" (so that 2.2 can still be 2.2.0).

@andreimatei @ajwerner and I discussed this and the negative unstable version was the best solution we could come up with.

andreimatei added a commit to andreimatei/cockroach that referenced this issue Jan 10, 2019
... making 2.2 nodes incompatible with 2.0 nodes.
This lets me assume 2.1 nodes (and do the next commit). The instructions
for when to bump this say to bump this when we introduce 2_2, but I
think that's off (or, rather, we should already have introduced 2.2).
Also see cockroachdb#33578.

Release note: None
@bdarnell
Copy link
Contributor

👍 to negative unstable versions (not that it matters here, but it's what I've been doing in Tornado for a long time as well)

@tbg tbg changed the title roachtest: upgrade/oldVersion=v2.0.5/nodes=5 failed server: clean up migration story via unstable negative versoins Jan 22, 2019
@tbg tbg changed the title server: clean up migration story via unstable negative versoins server: clean up migration story via unstable negative versions Jan 22, 2019
@tbg tbg assigned andreimatei and unassigned nvanbenschoten Jan 22, 2019
@tbg tbg added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) and removed C-test-failure Broken test (automatically or manually discovered). labels Jan 22, 2019
@github-actions
Copy link

github-actions bot commented Jun 5, 2021

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
5 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

@knz
Copy link
Contributor

knz commented Jun 5, 2021

@tbg @nvanbenschoten do we still need this?

@tbg
Copy link
Member

tbg commented Jun 7, 2021

I mean, we can, but it doesn't seem necessary, I think. We are worried that (translating to current versions) that a v20.2 node might be able to interact with v21.2-alpha. During RPC pings, the 20.2 node will send its binary version (20.2) and the recipient will compare that to its local cluster version. But that has to be at least 21.1 (since that's the min supported binary version in a v21.2-alpha binary), so that can't work.
The other direction does work: If the new binary reaches out to the old, we'll have v21.1 (the binary version of a v21.2 alpha) hit v20.2. and the v20.2 node will be fine seeing a newer version it doesn't know about, as long as it's ahead of its cluster version.

However, for a ping to be successful, both checks have to go through (since we check the version on both sides of a ping):

cockroach/pkg/rpc/context.go

Lines 1199 to 1206 in eba03c4

if err == nil {
err = errors.Wrap(
checkVersion(goCtx, ctx.Settings, response.ServerVersion),
"version compatibility check failed on ping response")
if err != nil {
returnErr = true
}
}

This should mean that a 20.2 and 21.2-alpha node can't ever connect to each other. However, I don't know if that code has changed materially since 2019. What we do now is put a start unstable version early in the release (VersionStart21_2), perhaps we only ever had issues when the new version was on the master branch but didn't have an unstable version minted yet. In that case, we could have code incompatible with v20.2, but the new build still looked exactly like a v21.1 build (i.e. cluster version 21.1.0-0) and so was allowed to interact with v20.2, despite having code changes incompatible with it.

@knz knz removed this from the 19.1 milestone Jun 7, 2021
@knz
Copy link
Contributor

knz commented Jun 7, 2021

So are we saying the incompatibility can show up if anyone sends a PR with an incompatible change on the new master branch and triggers a CI run before the unstable version is added?

I think that's a realistic scenario under current processes.

We can either add the protection described in this issue, or change the release process to ensure that no PR gets onto master after a release fork until the start unstable version is added in the code. What do you think?

@tbg
Copy link
Member

tbg commented Jun 7, 2021

Even if we add the protection, the tests will fail. This way or that way things cannot work in this setting. The real task is to make sure our release process is performing the various tasks around the version change in a way that avoids this problem. That involves pushing the new tag to the master branch at the right moment. I recently went over these and updated them.

@andreimatei
Copy link
Contributor

andreimatei commented Jun 7, 2021

What we do now is put a start unstable version early in the release (VersionStart21_2), perhaps we only ever had issues when the new version was on the master branch but didn't have an unstable version minted yet. In that case, we could have code incompatible with v20.2, but the new build still looked exactly like a v21.1 build (i.e. cluster version 21.1.0-0) and so was allowed to interact with v20.2, despite having code changes incompatible with it.

Does creating the VersionStart21_2 version play a role in any safety? You've brought it up, but I don't know what it matters for. I think it's just the bumping of binaryMinSupportedVersion that makes it impossible for a 20.2 node to join a 21.2-alpha?

In either case, I would certainly find things way less confusing if VersionStart21_2, as well as every other 21.2 version after it, started with Major:21, Minor: 2 instead of Major: 21, Minor: 1 as they do now.

@jlinder jlinder added the T-kv KV Team label Jun 16, 2021
@tbg
Copy link
Member

tbg commented Jun 21, 2021

Yeah, you're right. Also agree about the UX improvement. It would be nice to fix this.

@github-actions
Copy link

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

@knz
Copy link
Contributor

knz commented Sep 20, 2023

We have fixed this in a roundabout way: we require a bidirectional handshake for RPCs now. So the version check goes both ways and the scenario at top is prevented.

@knz knz closed this as completed Sep 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) no-issue-activity O-robot Originated from a bot. T-kv KV Team
Projects
No open projects
Archived in project
Development

No branches or pull requests

7 participants