-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster upgrade finalization fails with decommissioned n1 node #66468
Comments
Hello, I am Blathers. I am here to help you get the issue triaged. Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here. I have CC'd a few people who may be able to assist you:
If we have not gotten back to your issue within a few business days, you can try the following:
🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
Hm, there have been some recent changes in the area that could've made this possible. What does The reason for these annoying UX quirks is that we didn't even have this terminal state in earlier releases. This was introduced only as recently as #50329. For nodes that were decommissioned in earlier releases, there's no way for CRDB to distinguish between nodes that have truly gone for good, vs. nodes that are in the holdover state. CRDB takes the conservative assumption that it's the latter, requiring operators to finalize the decommission process for these nodes. To add more confusion to the matter, our UI doesn't distinguish between "decommissioning" and "decommissioned" (part of #50707). |
Indeed a number of those older decommissioned nodes still have membership
I've now actively decommissioned nodes 1 through 9 and now they all show up as The logs now show:
Thanks for pointing this out, issue resolved! :) |
Sweet! Thanks for filing it, hopefully it's google-able for others running into the same. @jseldess / @rmloveland, could we do anything on the docs side to surface this information better? |
@irfansharif, did this log message mean the upgrade did not, in fact, finish? If so, how exactly should we guide users here? To make sure that any decommissioned nodes have finished decommissioning before you finalize the upgrade? |
@jseldess @irfansharif unfortunately I haven't checked if the version number next to the cluster id in the UI showed |
@jseldess: I think our docs should say something like the following: If your upgrade process has stalled due to errors of the form:
Where the nodes listed have been long decommissioned, we recommend first checking the output for
Alternatively, as part of our upgrade docs, we could recommend checking the output for |
That makes sense, @irfansharif. Thank you. I like the approach of asking users to proactively check if any nodes are in the @joshimhoff, is this not an issue for CC clusters and self-hosted Kubernetes clusters based on the way we remove nodes, e.g., https://www.cockroachlabs.com/docs/stable/operate-cockroachdb-kubernetes.html#remove-nodes? |
It's also an issue for CC and self-hosted Kubernetes clusters, if they had nodes decommissioned prior to 20.2. We tracked one such issue at https://github.com/cockroachlabs/support/issues/1007, which we resolved concurrently with this issue. |
Interesting. @DuskEagle, does CC's decommissioning process differ from what we have in the self-hosted K8 docs? https://www.cockroachlabs.com/docs/stable/operate-cockroachdb-kubernetes.html#remove-nodes |
We don't use the operator in CC currently. Instead, our decommissioning process follows these steps: https://www.cockroachlabs.com/docs/stable/remove-nodes.html#step-2-start-the-decommissioning-process-on-the-node. |
Describe the problem
The following error gets logged in my cockroachdb pods. The
n1
node it refers to has been decommissioned over more than a year ago.The gui shows current nodes
n16
,n17
andn18
to still be around and all prior nodes -n1
ton15
- to be decommissioned.To Reproduce
Ran a cluster for multiple years, upgraded practically every patch version between September 2017 (not sure about the version) and
21.1.2
. Scaled up, scaled down, moved to different Kubernetes cluster leading to a lot of older nodes getting decommissioned.Expected behavior
Cluster upgrade to finalize taking the current nodes into account, not relying on
n1
to still be around.Environment:
Additional context
Apart from this log line it seems the upgrade has gone fine, but it might mean newer features haven't been unlocked yet.
The text was updated successfully, but these errors were encountered: