Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backupccl: avoid div-by-zero crash on failed node count #55560

Merged
merged 1 commit into from
Oct 27, 2020

Conversation

dt
Copy link
Member

@dt dt commented Oct 14, 2020

We've seen a report of a node that crashed due to a divide-by-zero
hit during metrics collection, specifically when computing the
throughput-per-node by dividing the backup size by node count.

Since this is only now used for that metric, make a failure to count
nodes a warning only for release builds (and fallback to 1), and make
any error while counting, or not counting to more than 0, a returned
error in non-release builds.

Release note (bug fix): avoid crashing when BACKUP is unable to count the total nodes in the cluster.

@dt dt requested review from pbardea and a team October 14, 2020 19:02
@cockroach-teamcity
Copy link
Member

This change is Reviewable

pkg/ccl/backupccl/backup_job.go Outdated Show resolved Hide resolved
pkg/ccl/backupccl/backup_job.go Outdated Show resolved Hide resolved
@@ -1169,7 +1170,11 @@ func (r *restoreResumer) Resume(

numClusterNodes, err := clusterNodeCount(p.ExecCfg().Gossip)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also move this call to closer to the telemetry since that's the only thing that it's used for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sorta considered that but then we'd need to pass Gossip into restore too. I think the real answer is that we should be using number of nodes we actually plan the flow on, so I think this should eventually go away rather than move. In the interest of keeping this backport-friendly I think just leave it as is until then?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 yep, that sounds good

We've seen a report of a node that crashed due to a divide-by-zero
hit during metrics collection, specifically when computing the
throughput-per-node by dividing the backup size by node count.

Since this is only now used for that metric, make a failure to count
nodes a warning only for release builds (and fallback to 1), and make
any error while counting, or not counting to more than 0, a returned
error in non-release builds.

Release note (bug fix): avoid crashing when BACKUP is unable to count the total nodes in the cluster.
@dt
Copy link
Member Author

dt commented Oct 27, 2020

bors r+

@craig
Copy link
Contributor

craig bot commented Oct 27, 2020

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Oct 27, 2020

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants