Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: de-flake decommission/mixed-versions #56568

Merged
merged 1 commit into from
Nov 16, 2020

Conversation

tbg
Copy link
Member

@tbg tbg commented Nov 11, 2020

The test was fully decommissioning and then recommissioning nodes, which
does no longer work now that the master branch is 21.1 and thus the
predecessor version in this test is 20.2, which does know about fully
decommissioning (decommission attempts run via 20.1 nodes never set
the status to fully decommissioned, so we got away with it so far).

Pare down the test so that it carries out one random partial and full
decommission cycle alongside a partial version upgrade, which is
reasonably all we can do unless we want to get into the business
of wiping and re-adding nodes to the cluster in this test, which
I don't think we do as it doesn't work well with the step-driven
format this test is using.

Fixes #55798.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg
Copy link
Member Author

tbg commented Nov 12, 2020

Still issues here, btw. When we're decommissioning a node through itself, it's possible that the cluster will kick the node out before it manages to convince itself it's good to go. We need to terminate this through the RPC layer, which in turn motivated me to work on #54939.

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: 🐶

When we're decommissioning a node through itself, it's possible that the cluster will kick the node out before it manages to convince itself it's good to go.

What does this mean? Why does it matter what node we're decommissioning "through"? That's just about what machine we ssh to run the decommission command on, right?

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei, @irfansharif, and @tbg)


pkg/cmd/roachtest/mixed_version_decommission.go, line 62 at r1 (raw file):

		// Partially decommission a random node from another random node. We
		// use the predecessor CLI to do so.
		partialDecommissionStep(h.getRandNode(), h.getRandNode(), predecessorVersion),

I think this "from another node" part is confusing. Wouldn't it be much clearer if we always ran the cockroach node decommission commands locally (on the test runner machine)? Even if that means downloading the predecessor version locally.

@tbg tbg requested a review from andreimatei November 16, 2020 08:41
Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but what matters more is which node the decommission process talks to.

Here, the decommission process is on n1:
client ---gRPC---> n1.Decommission(n1)

you could also decommission n1 through n2, in which case n2 would be orchestrating the process (i.e. scanning the metas, marking liveness as decommissioning, etc):

client ---gRPC--> n2.Decommission(n1)

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei and @irfansharif)


pkg/cmd/roachtest/mixed_version_decommission.go, line 62 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I think this "from another node" part is confusing. Wouldn't it be much clearer if we always ran the cockroach node decommission commands locally (on the test runner machine)? Even if that means downloading the predecessor version locally.

While we could do this for the predecessor version (since it is always downloaded) in general we try to stay away from this pattern since the local arch can and often is different from the other, plus there would be a need to manually construct a more complicated command line (host, port , etc). When run on node X, the cli goes through node X by default, which is convenient.

@tbg
Copy link
Member Author

tbg commented Nov 16, 2020

bors r=andreimatei

I'm avoiding the self-decommission in this test. Instead, I filed an issue and added a skipped acceptance test that only decommissions self. Will look into that separately.

@craig
Copy link
Contributor

craig bot commented Nov 16, 2020

Build failed:

The test was fully decommissioning and then recommissioning nodes, which
does no longer work now that the master branch is 21.1 and thus the
predecessor version in this test is 20.2, which does know about fully
decommissioning (decommission attempts run via 20.1 nodes never set
the status to fully decommissioned, so we got away with it so far).

Pare down the test so that it carries out one random partial and full
decommission cycle alongside a partial version upgrade, which is
reasonably all we can do unless we want to get into the business
of wiping and re-adding nodes to the cluster in this test, which
I don't think we do as it doesn't work well with the step-driven
format this test is using.

While looking at this test, I also discovered cockroachdb#56718 (hang during
self-decommission) which I will look into separately; for now
the mixed version test will avoid this case. I did add an acceptance
test that decommissions self in this commit, but am keeping it
skipped for now. It was not immediately able to reproduce the problem.

Fixes cockroachdb#55798.

Release note: None
@tbg
Copy link
Member Author

tbg commented Nov 16, 2020

bors r=andreimatei

forgot bazel update

@craig
Copy link
Contributor

craig bot commented Nov 16, 2020

Build succeeded:

@craig craig bot merged commit 5403e54 into cockroachdb:master Nov 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

roachtest: decommission/mixed-versions failed
3 participants