Cluster upgrade finalization fails with decommissioned n1 node #66468

JorritSalverda · 2021-06-15T09:27:54Z

Describe the problem

The following error gets logged in my cockroachdb pods. The n1 node it refers to has been decommissioned over more than a year ago.

I210615 09:11:52.025059 5710 server/auto_upgrade.go:75 ⋮ [n18] 362  error when finalizing cluster version upgrade: ‹set-version›: n1 required, but unavailable

The gui shows current nodes n16, n17 and n18 to still be around and all prior nodes - n1 to n15 - to be decommissioned.

To Reproduce

Ran a cluster for multiple years, upgraded practically every patch version between September 2017 (not sure about the version) and 21.1.2. Scaled up, scaled down, moved to different Kubernetes cluster leading to a lot of older nodes getting decommissioned.

Expected behavior

Cluster upgrade to finalize taking the current nodes into account, not relying on n1 to still be around.

Environment:

CockroachDB version 21.1.2
CockroachDB helm chart 6.0.3

Additional context

Apart from this log line it seems the upgrade has gone fine, but it might mean newer features haven't been unlocked yet.

The text was updated successfully, but these errors were encountered:

blathers-crl · 2021-06-15T09:27:55Z

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I have CC'd a few people who may be able to assist you:

@joshimhoff (found keywords: Kubernetes)

If we have not gotten back to your issue within a few business days, you can try the following:

Join our community slack channel and ask on #cockroachdb.
Try find someone from here if you know they worked closely on the area and CC them.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

irfansharif · 2021-06-15T17:22:03Z

Hm, there have been some recent changes in the area that could've made this possible. What does cockroach node status --decommission show for n1-15? If it just shows up as "decommissioning" and not, say, "decommissioned", we'll want to finalize the decommission process so these nodes are in the terminal state, rather than the holdover state. By finalizing, I just mean decommissioning these nodes again so that their status gets marked as "decommissioned". It's perfectly acceptable to decommission these nodes if they're no longer around -- decommissioning nodes in absentia is a safe operation.

The reason for these annoying UX quirks is that we didn't even have this terminal state in earlier releases. This was introduced only as recently as #50329. For nodes that were decommissioned in earlier releases, there's no way for CRDB to distinguish between nodes that have truly gone for good, vs. nodes that are in the holdover state. CRDB takes the conservative assumption that it's the latter, requiring operators to finalize the decommission process for these nodes.

To add more confusion to the matter, our UI doesn't distinguish between "decommissioning" and "decommissioned" (part of #50707). cockroach node status --decommission however will be authoritative.

JorritSalverda · 2021-06-15T20:00:51Z

Indeed a number of those older decommissioned nodes still have membership decomissioning:

  id | is_available | is_live | gossiped_replicas | is_decommissioning |   membership    | is_draining
-----+--------------+---------+-------------------+--------------------+-----------------+--------------
   1 | false        | false   | NULL              | true               | decommissioning | true
   2 | false        | false   | NULL              | true               | decommissioning | true
   3 | false        | false   | NULL              | true               | decommissioning | true
   4 | false        | false   | NULL              | true               | decommissioning | true
   5 | false        | false   | NULL              | true               | decommissioning | true
   6 | false        | false   | NULL              | true               | decommissioning | true
   7 | false        | false   | NULL              | true               | decommissioning | true
   8 | false        | false   | NULL              | true               | decommissioning | false
   9 | false        | false   | NULL              | true               | decommissioning | true
  10 | false        | false   | NULL              | true               | decommissioned  | true
  11 | false        | false   | NULL              | true               | decommissioned  | true
  12 | false        | false   | NULL              | true               | decommissioned  | true
  13 | false        | false   | NULL              | true               | decommissioned  | true
  14 | false        | false   | NULL              | true               | decommissioned  | true
  15 | false        | false   | NULL              | true               | decommissioned  | true
  16 | true         | true    |              2380 | false              | active          | false
  17 | true         | true    |              2380 | false              | active          | false
  18 | true         | true    |              2380 | false              | active          | false

I've now actively decommissioned nodes 1 through 9 and now they all show up as decommissioned.

The logs now show:

I210615 19:58:08.816649 5585 server/auto_upgrade.go:77 ⋮ [n16] 22444  successfully upgraded cluster version

Thanks for pointing this out, issue resolved! :)

irfansharif · 2021-06-16T02:37:28Z

Sweet! Thanks for filing it, hopefully it's google-able for others running into the same. @jseldess / @rmloveland, could we do anything on the docs side to surface this information better?

jseldess · 2021-06-16T02:50:04Z

@irfansharif, did this log message mean the upgrade did not, in fact, finish? If so, how exactly should we guide users here? To make sure that any decommissioned nodes have finished decommissioning before you finalize the upgrade?

JorritSalverda · 2021-06-16T07:13:35Z

@jseldess @irfansharif unfortunately I haven't checked if the version number next to the cluster id in the UI showed V21.1.2 before I managed to fix this error. All the nodes where definitely at that latest version. But don't know what the logic behind that cluster version number is anyway, so might have shown the latest version regardless of whether the upgrade was finalized under the hood.

irfansharif · 2021-06-16T14:02:37Z

@jseldess: I think our docs should say something like the following:

If your upgrade process has stalled due to errors of the form:

I210615 09:11:52.025059 5710 server/auto_upgrade.go:75 ⋮ [n18] 362  error when finalizing cluster version upgrade: ‹set-version›: n1 required, but unavailable

Where the nodes listed have been long decommissioned, we recommend first checking the output for cockroach node status --decommission, using the newer version binary, pointed at a server running the newer version binary. If there are node IDs listed with the status. Look towards the "membership" column, if any nodes appear to be in the "decommissioning" state, rather than the "decommissioned" state -- that's an indication that the decommissioning process has not been finalized. Nodes that were decommissioned in releases prior to 20.2 will appear as "decommissioning". It's safe to decommission those nodes again, and we'll have to do that in order to get them to the terminal "decommissioned" state. The upgrade will go through only after there are no nodes in the "decommissioning" state.

  id | is_available | is_live | gossiped_replicas | is_decommissioning |   membership    | is_draining
-----+--------------+---------+-------------------+--------------------+-----------------+--------------
   1 | false        | false   | NULL              | true               | decommissioning | true
   2 | false        | false   | NULL              | true               | decommissioning | true
   3 | false        | false   | NULL              | true               | decommissioning | true
   4 | false        | false   | NULL              | true               | decommissioning | true
   5 | false        | false   | NULL              | true               | decommissioning | true
   6 | false        | false   | NULL              | true               | decommissioning | true
   7 | false        | false   | NULL              | true               | decommissioning | true
   8 | false        | false   | NULL              | true               | decommissioning | false
   9 | false        | false   | NULL              | true               | decommissioning | true
  10 | false        | false   | NULL              | true               | decommissioned  | true
  11 | false        | false   | NULL              | true               | decommissioned  | true
  12 | false        | false   | NULL              | true               | decommissioned  | true
  13 | false        | false   | NULL              | true               | decommissioned  | true
  14 | false        | false   | NULL              | true               | decommissioned  | true
  15 | false        | false   | NULL              | true               | decommissioned  | true
  16 | true         | true    |              2380 | false              | active          | false
  17 | true         | true    |              2380 | false              | active          | false
  18 | true         | true    |              2380 | false              | active          | false

Alternatively, as part of our upgrade docs, we could recommend checking the output for cockroach node --decommission to ensure that there are no nodes with the membership status as "decommissioning". If there are (possible if nodes were decommissioned prior to 20.2), try decommissioning them again through live nodes in the cluster. The upgrade will only be able to make progress if there are no nodes in the decommissioning process.

jseldess · 2021-06-16T15:02:17Z

That makes sense, @irfansharif. Thank you. I like the approach of asking users to proactively check if any nodes are in the decommissioning state rather than react to a log message. I created this docs issue: cockroachdb/docs#10790.

@joshimhoff, is this not an issue for CC clusters and self-hosted Kubernetes clusters based on the way we remove nodes, e.g., https://www.cockroachlabs.com/docs/stable/operate-cockroachdb-kubernetes.html#remove-nodes?

DuskEagle · 2021-06-16T16:20:25Z

It's also an issue for CC and self-hosted Kubernetes clusters, if they had nodes decommissioned prior to 20.2. We tracked one such issue at https://github.com/cockroachlabs/support/issues/1007, which we resolved concurrently with this issue.

jseldess · 2021-06-16T16:24:06Z

Interesting. @DuskEagle, does CC's decommissioning process differ from what we have in the self-hosted K8 docs? https://www.cockroachlabs.com/docs/stable/operate-cockroachdb-kubernetes.html#remove-nodes

DuskEagle · 2021-06-16T17:52:10Z

We don't use the operator in CC currently. Instead, our decommissioning process follows these steps: https://www.cockroachlabs.com/docs/stable/remove-nodes.html#step-2-start-the-decommissioning-process-on-the-node.

JorritSalverda added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jun 15, 2021

blathers-crl bot added O-community Originated from the community X-blathers-triaged blathers was able to find an owner labels Jun 15, 2021

JorritSalverda closed this as completed Jun 15, 2021

jseldess mentioned this issue Jun 16, 2021

Check that nodes have finished decommissioning before starting rolling upgrade cockroachdb/docs#10790

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster upgrade finalization fails with decommissioned n1 node #66468

Cluster upgrade finalization fails with decommissioned n1 node #66468

JorritSalverda commented Jun 15, 2021

blathers-crl bot commented Jun 15, 2021

irfansharif commented Jun 15, 2021 •

edited

Loading

JorritSalverda commented Jun 15, 2021

irfansharif commented Jun 16, 2021

jseldess commented Jun 16, 2021

JorritSalverda commented Jun 16, 2021

irfansharif commented Jun 16, 2021

jseldess commented Jun 16, 2021

DuskEagle commented Jun 16, 2021

jseldess commented Jun 16, 2021

DuskEagle commented Jun 16, 2021

Cluster upgrade finalization fails with decommissioned n1 node #66468

Cluster upgrade finalization fails with decommissioned n1 node #66468

Comments

JorritSalverda commented Jun 15, 2021

blathers-crl bot commented Jun 15, 2021

irfansharif commented Jun 15, 2021 • edited Loading

JorritSalverda commented Jun 15, 2021

irfansharif commented Jun 16, 2021

jseldess commented Jun 16, 2021

JorritSalverda commented Jun 16, 2021

irfansharif commented Jun 16, 2021

jseldess commented Jun 16, 2021

DuskEagle commented Jun 16, 2021

jseldess commented Jun 16, 2021

DuskEagle commented Jun 16, 2021

irfansharif commented Jun 15, 2021 •

edited

Loading