Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster upgrade finalization fails with decommissioned n1 node #66468

Closed
JorritSalverda opened this issue Jun 15, 2021 · 11 comments
Closed

Cluster upgrade finalization fails with decommissioned n1 node #66468

JorritSalverda opened this issue Jun 15, 2021 · 11 comments
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community X-blathers-triaged blathers was able to find an owner

Comments

@JorritSalverda
Copy link

Describe the problem

The following error gets logged in my cockroachdb pods. The n1 node it refers to has been decommissioned over more than a year ago.

I210615 09:11:52.025059 5710 server/auto_upgrade.go:75 ⋮ [n18] 362  error when finalizing cluster version upgrade: ‹set-version›: n1 required, but unavailable

The gui shows current nodes n16, n17 and n18 to still be around and all prior nodes - n1 to n15 - to be decommissioned.

To Reproduce

Ran a cluster for multiple years, upgraded practically every patch version between September 2017 (not sure about the version) and 21.1.2. Scaled up, scaled down, moved to different Kubernetes cluster leading to a lot of older nodes getting decommissioned.

Expected behavior

Cluster upgrade to finalize taking the current nodes into account, not relying on n1 to still be around.

Environment:

  • CockroachDB version 21.1.2
  • CockroachDB helm chart 6.0.3

Additional context

Apart from this log line it seems the upgrade has gone fine, but it might mean newer features haven't been unlocked yet.

@JorritSalverda JorritSalverda added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jun 15, 2021
@blathers-crl
Copy link

blathers-crl bot commented Jun 15, 2021

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I have CC'd a few people who may be able to assist you:

If we have not gotten back to your issue within a few business days, you can try the following:

  • Join our community slack channel and ask on #cockroachdb.
  • Try find someone from here if you know they worked closely on the area and CC them.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@blathers-crl blathers-crl bot added O-community Originated from the community X-blathers-triaged blathers was able to find an owner labels Jun 15, 2021
@irfansharif
Copy link
Contributor

irfansharif commented Jun 15, 2021

Hm, there have been some recent changes in the area that could've made this possible. What does cockroach node status --decommission show for n1-15? If it just shows up as "decommissioning" and not, say, "decommissioned", we'll want to finalize the decommission process so these nodes are in the terminal state, rather than the holdover state. By finalizing, I just mean decommissioning these nodes again so that their status gets marked as "decommissioned". It's perfectly acceptable to decommission these nodes if they're no longer around -- decommissioning nodes in absentia is a safe operation.

The reason for these annoying UX quirks is that we didn't even have this terminal state in earlier releases. This was introduced only as recently as #50329. For nodes that were decommissioned in earlier releases, there's no way for CRDB to distinguish between nodes that have truly gone for good, vs. nodes that are in the holdover state. CRDB takes the conservative assumption that it's the latter, requiring operators to finalize the decommission process for these nodes.

To add more confusion to the matter, our UI doesn't distinguish between "decommissioning" and "decommissioned" (part of #50707). cockroach node status --decommission however will be authoritative.

@JorritSalverda
Copy link
Author

Indeed a number of those older decommissioned nodes still have membership decomissioning:

  id | is_available | is_live | gossiped_replicas | is_decommissioning |   membership    | is_draining
-----+--------------+---------+-------------------+--------------------+-----------------+--------------
   1 | false        | false   | NULL              | true               | decommissioning | true
   2 | false        | false   | NULL              | true               | decommissioning | true
   3 | false        | false   | NULL              | true               | decommissioning | true
   4 | false        | false   | NULL              | true               | decommissioning | true
   5 | false        | false   | NULL              | true               | decommissioning | true
   6 | false        | false   | NULL              | true               | decommissioning | true
   7 | false        | false   | NULL              | true               | decommissioning | true
   8 | false        | false   | NULL              | true               | decommissioning | false
   9 | false        | false   | NULL              | true               | decommissioning | true
  10 | false        | false   | NULL              | true               | decommissioned  | true
  11 | false        | false   | NULL              | true               | decommissioned  | true
  12 | false        | false   | NULL              | true               | decommissioned  | true
  13 | false        | false   | NULL              | true               | decommissioned  | true
  14 | false        | false   | NULL              | true               | decommissioned  | true
  15 | false        | false   | NULL              | true               | decommissioned  | true
  16 | true         | true    |              2380 | false              | active          | false
  17 | true         | true    |              2380 | false              | active          | false
  18 | true         | true    |              2380 | false              | active          | false

I've now actively decommissioned nodes 1 through 9 and now they all show up as decommissioned.

The logs now show:

I210615 19:58:08.816649 5585 server/auto_upgrade.go:77 ⋮ [n16] 22444  successfully upgraded cluster version

Thanks for pointing this out, issue resolved! :)

@irfansharif
Copy link
Contributor

Sweet! Thanks for filing it, hopefully it's google-able for others running into the same. @jseldess / @rmloveland, could we do anything on the docs side to surface this information better?

@jseldess
Copy link
Contributor

@irfansharif, did this log message mean the upgrade did not, in fact, finish? If so, how exactly should we guide users here? To make sure that any decommissioned nodes have finished decommissioning before you finalize the upgrade?

@JorritSalverda
Copy link
Author

@jseldess @irfansharif unfortunately I haven't checked if the version number next to the cluster id in the UI showed V21.1.2 before I managed to fix this error. All the nodes where definitely at that latest version. But don't know what the logic behind that cluster version number is anyway, so might have shown the latest version regardless of whether the upgrade was finalized under the hood.

@irfansharif
Copy link
Contributor

@jseldess: I think our docs should say something like the following:

If your upgrade process has stalled due to errors of the form:

I210615 09:11:52.025059 5710 server/auto_upgrade.go:75 ⋮ [n18] 362  error when finalizing cluster version upgrade: ‹set-version›: n1 required, but unavailable

Where the nodes listed have been long decommissioned, we recommend first checking the output for cockroach node status --decommission, using the newer version binary, pointed at a server running the newer version binary. If there are node IDs listed with the status. Look towards the "membership" column, if any nodes appear to be in the "decommissioning" state, rather than the "decommissioned" state -- that's an indication that the decommissioning process has not been finalized. Nodes that were decommissioned in releases prior to 20.2 will appear as "decommissioning". It's safe to decommission those nodes again, and we'll have to do that in order to get them to the terminal "decommissioned" state. The upgrade will go through only after there are no nodes in the "decommissioning" state.

  id | is_available | is_live | gossiped_replicas | is_decommissioning |   membership    | is_draining
-----+--------------+---------+-------------------+--------------------+-----------------+--------------
   1 | false        | false   | NULL              | true               | decommissioning | true
   2 | false        | false   | NULL              | true               | decommissioning | true
   3 | false        | false   | NULL              | true               | decommissioning | true
   4 | false        | false   | NULL              | true               | decommissioning | true
   5 | false        | false   | NULL              | true               | decommissioning | true
   6 | false        | false   | NULL              | true               | decommissioning | true
   7 | false        | false   | NULL              | true               | decommissioning | true
   8 | false        | false   | NULL              | true               | decommissioning | false
   9 | false        | false   | NULL              | true               | decommissioning | true
  10 | false        | false   | NULL              | true               | decommissioned  | true
  11 | false        | false   | NULL              | true               | decommissioned  | true
  12 | false        | false   | NULL              | true               | decommissioned  | true
  13 | false        | false   | NULL              | true               | decommissioned  | true
  14 | false        | false   | NULL              | true               | decommissioned  | true
  15 | false        | false   | NULL              | true               | decommissioned  | true
  16 | true         | true    |              2380 | false              | active          | false
  17 | true         | true    |              2380 | false              | active          | false
  18 | true         | true    |              2380 | false              | active          | false

Alternatively, as part of our upgrade docs, we could recommend checking the output for cockroach node --decommission to ensure that there are no nodes with the membership status as "decommissioning". If there are (possible if nodes were decommissioned prior to 20.2), try decommissioning them again through live nodes in the cluster. The upgrade will only be able to make progress if there are no nodes in the decommissioning process.

@jseldess
Copy link
Contributor

That makes sense, @irfansharif. Thank you. I like the approach of asking users to proactively check if any nodes are in the decommissioning state rather than react to a log message. I created this docs issue: cockroachdb/docs#10790.

@joshimhoff, is this not an issue for CC clusters and self-hosted Kubernetes clusters based on the way we remove nodes, e.g., https://www.cockroachlabs.com/docs/stable/operate-cockroachdb-kubernetes.html#remove-nodes?

@DuskEagle
Copy link
Member

It's also an issue for CC and self-hosted Kubernetes clusters, if they had nodes decommissioned prior to 20.2. We tracked one such issue at https://github.com/cockroachlabs/support/issues/1007, which we resolved concurrently with this issue.

@jseldess
Copy link
Contributor

Interesting. @DuskEagle, does CC's decommissioning process differ from what we have in the self-hosted K8 docs? https://www.cockroachlabs.com/docs/stable/operate-cockroachdb-kubernetes.html#remove-nodes

@DuskEagle
Copy link
Member

We don't use the operator in CC currently. Instead, our decommissioning process follows these steps: https://www.cockroachlabs.com/docs/stable/remove-nodes.html#step-2-start-the-decommissioning-process-on-the-node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community X-blathers-triaged blathers was able to find an owner
Projects
None yet
Development

No branches or pull requests

4 participants