-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: clean up migration story via unstable negative versions #33578
Comments
Hey, glad to see this test catch stuff. @knz, blame tells me that you might know what to make of this, though looking a bit more I end up here Lines 963 to 966 in 352bdf4
I think the node that crashed with the below error is running 2.2-alpha, so the "unknown request" originated at a node running 2.0.5. So, what requests did we remove between the two? This sounds more like a @nvanbenschoten question now. I didn't find anything recent in
|
How is a 2.0.5 node even talking to a node running the 2.2-alpha? Is this because we haven't bumped the minor version on the alphas yet? I don't know what is causing this, but I can see how things would slip in if people are developing against a model where they only need to support compatibility with 2.1 going forward for these alphas. |
Right. I think the right thing is to bump 2.0.5 to 2.1. (at which point the test should stop being run in 2.1, or we fork it to retain 2.0 there), but we should still figure out what request caused this to make sure that's all there is to it. |
The SQL query you've blamed is a 2.1 SQL migration. This should work fine in mixed 2.0/2.1 configurations. |
Making sure I got that - you're saying that we should bump the test's I'll throw in some logging and pull out the offending request type. |
The node that crashed was running the 2.2-alpha. It crashed after 5 of its migrations failed due to the "unknown request" error. I think this means that the 2.0.5 node was rejecting a request that it didn't know about, not the other way around. So the question is, what did we add? That's great because it will make it easier to debug. |
Looks like we were sending We can change the test, but my question still remains about 2.2 alphas connecting to 2.0 nodes. Since we haven't bumped the minor cluster version, do we have any protection against this in production clusters? |
It depends. If that commit ever landed on release-2.1, it would cause this problem in production. If it doesn't, it'll only ever interact with 2.1.x clusters which can handle it. This seems easy enough to get wrong accidentally, so perhaps we ought to adopt a policy of not removing version checks unless there's a non-patch cluster version between them. |
Hmm, now that I read this again, I'm not 100% confident what I'm saying is even true. We have version checks in the rpc handlers that are supposed to avoid connections across non-adjacent versions, but there seem to be some exceptions. cockroach/pkg/rpc/heartbeat.go Lines 54 to 75 in 2558dcc
|
Even without the exceptions, won't 2.0 binaries and 2.2-alpha binaries look compatible? The only difference between 2.2 alpha binaries and 2.1 release binaries is an |
Fixes cockroachdb#33578. Release note: None
That's true. Maybe we should be introducing the 2.2 tag before the first alpha, and make the 2.2 released version an unstable version bump. But then we have to make sure `SET CLUSTER SETTING version = 2.2' actually tries to upgrade to 2.2-n, where n is the "right" unstable (so probably that of the gateway node). Or we allow negative unstables, so that we can tag alphas as Or we leave everything as-is and mandate that it's simply a caveat to watch out for when testing alphas. @bdarnell, wdyt? |
Keeping this open until we resolve the other issue. |
@andreimatei @ajwerner and I discussed this and the negative unstable version was the best solution we could come up with. |
... making 2.2 nodes incompatible with 2.0 nodes. This lets me assume 2.1 nodes (and do the next commit). The instructions for when to bump this say to bump this when we introduce 2_2, but I think that's off (or, rather, we should already have introduced 2.2). Also see cockroachdb#33578. Release note: None
👍 to negative unstable versions (not that it matters here, but it's what I've been doing in Tornado for a long time as well) |
We have marked this issue as stale because it has been inactive for |
@tbg @nvanbenschoten do we still need this? |
I mean, we can, but it doesn't seem necessary, I think. We are worried that (translating to current versions) that a v20.2 node might be able to interact with v21.2-alpha. During RPC pings, the 20.2 node will send its binary version (20.2) and the recipient will compare that to its local cluster version. But that has to be at least 21.1 (since that's the min supported binary version in a v21.2-alpha binary), so that can't work. However, for a ping to be successful, both checks have to go through (since we check the version on both sides of a ping): Lines 1199 to 1206 in eba03c4
This should mean that a 20.2 and 21.2-alpha node can't ever connect to each other. However, I don't know if that code has changed materially since 2019. What we do now is put a start unstable version early in the release ( |
So are we saying the incompatibility can show up if anyone sends a PR with an incompatible change on the new I think that's a realistic scenario under current processes. We can either add the protection described in this issue, or change the release process to ensure that no PR gets onto |
Even if we add the protection, the tests will fail. This way or that way things cannot work in this setting. The real task is to make sure our release process is performing the various tasks around the version change in a way that avoids this problem. That involves pushing the new tag to the |
Does creating the In either case, I would certainly find things way less confusing if |
Yeah, you're right. Also agree about the UX improvement. It would be nice to fix this. |
We have marked this issue as stale because it has been inactive for |
We have fixed this in a roundabout way: we require a bidirectional handshake for RPCs now. So the version check goes both ways and the scenario at top is prevented. |
SHA: https://github.com/cockroachdb/cockroach/commits/12e28159b1d8b63b56d6a48f22ebbb5c75e8ee5c
Parameters:
To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1085451&tab=buildLog
Jira issue: CRDB-4682
The text was updated successfully, but these errors were encountered: