backupccl: avoid div-by-zero crash on failed node count #55560

dt · 2020-10-14T19:02:51Z

We've seen a report of a node that crashed due to a divide-by-zero
hit during metrics collection, specifically when computing the
throughput-per-node by dividing the backup size by node count.

Since this is only now used for that metric, make a failure to count
nodes a warning only for release builds (and fallback to 1), and make
any error while counting, or not counting to more than 0, a returned
error in non-release builds.

Release note (bug fix): avoid crashing when BACKUP is unable to count the total nodes in the cluster.

cockroach-teamcity · 2020-10-14T19:02:58Z

This change is

pkg/ccl/backupccl/backup_job.go

pbardea · 2020-10-15T15:18:11Z

pkg/ccl/backupccl/restore_job.go

@@ -1169,7 +1170,11 @@ func (r *restoreResumer) Resume(

 	numClusterNodes, err := clusterNodeCount(p.ExecCfg().Gossip)


Should we also move this call to closer to the telemetry since that's the only thing that it's used for?

I sorta considered that but then we'd need to pass Gossip into restore too. I think the real answer is that we should be using number of nodes we actually plan the flow on, so I think this should eventually go away rather than move. In the interest of keeping this backport-friendly I think just leave it as is until then?

👍 yep, that sounds good

We've seen a report of a node that crashed due to a divide-by-zero hit during metrics collection, specifically when computing the throughput-per-node by dividing the backup size by node count. Since this is only now used for that metric, make a failure to count nodes a warning only for release builds (and fallback to 1), and make any error while counting, or not counting to more than 0, a returned error in non-release builds. Release note (bug fix): avoid crashing when BACKUP is unable to count the total nodes in the cluster.

dt · 2020-10-27T14:56:53Z

bors r+

craig · 2020-10-27T16:22:12Z

Build failed (retrying...):

GitHub CI (Cockroach)

craig · 2020-10-27T18:28:19Z

Build succeeded:

GitHub CI (Cockroach)

dt requested review from pbardea and a team October 14, 2020 19:02

pbardea reviewed Oct 15, 2020

View reviewed changes

dt force-pushed the div-by-zero branch from 3dd7d3f to 6451ce5 Compare October 26, 2020 04:03

dt force-pushed the div-by-zero branch from 6451ce5 to 63e79f3 Compare October 26, 2020 19:15

pbardea approved these changes Oct 27, 2020

View reviewed changes

craig bot merged commit bea5339 into cockroachdb:master Oct 27, 2020

This was referenced Oct 28, 2020

release-20.2: backupccl: avoid div-by-zero crash on failed node count #56050

Merged

release-20.1: backupccl: avoid div-by-zero crash on failed node count #56096

Merged

dt deleted the div-by-zero branch November 3, 2020 11:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backupccl: avoid div-by-zero crash on failed node count #55560

backupccl: avoid div-by-zero crash on failed node count #55560

dt commented Oct 14, 2020

cockroach-teamcity commented Oct 14, 2020

pbardea Oct 15, 2020

dt Oct 26, 2020

pbardea Oct 27, 2020

dt commented Oct 27, 2020

craig bot commented Oct 27, 2020

craig bot commented Oct 27, 2020

		@@ -1169,7 +1170,11 @@ func (r *restoreResumer) Resume(

		numClusterNodes, err := clusterNodeCount(p.ExecCfg().Gossip)

backupccl: avoid div-by-zero crash on failed node count #55560

backupccl: avoid div-by-zero crash on failed node count #55560

Conversation

dt commented Oct 14, 2020

cockroach-teamcity commented Oct 14, 2020

pbardea Oct 15, 2020

Choose a reason for hiding this comment

dt Oct 26, 2020

Choose a reason for hiding this comment

pbardea Oct 27, 2020

Choose a reason for hiding this comment

dt commented Oct 27, 2020

craig bot commented Oct 27, 2020

craig bot commented Oct 27, 2020