-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
backupccl: avoid div-by-zero crash on failed node count #55560
Conversation
@@ -1169,7 +1170,11 @@ func (r *restoreResumer) Resume( | |||
|
|||
numClusterNodes, err := clusterNodeCount(p.ExecCfg().Gossip) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also move this call to closer to the telemetry since that's the only thing that it's used for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I sorta considered that but then we'd need to pass Gossip into restore
too. I think the real answer is that we should be using number of nodes we actually plan the flow on, so I think this should eventually go away rather than move. In the interest of keeping this backport-friendly I think just leave it as is until then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 yep, that sounds good
We've seen a report of a node that crashed due to a divide-by-zero hit during metrics collection, specifically when computing the throughput-per-node by dividing the backup size by node count. Since this is only now used for that metric, make a failure to count nodes a warning only for release builds (and fallback to 1), and make any error while counting, or not counting to more than 0, a returned error in non-release builds. Release note (bug fix): avoid crashing when BACKUP is unable to count the total nodes in the cluster.
bors r+ |
Build failed (retrying...): |
Build succeeded: |
We've seen a report of a node that crashed due to a divide-by-zero
hit during metrics collection, specifically when computing the
throughput-per-node by dividing the backup size by node count.
Since this is only now used for that metric, make a failure to count
nodes a warning only for release builds (and fallback to 1), and make
any error while counting, or not counting to more than 0, a returned
error in non-release builds.
Release note (bug fix): avoid crashing when BACKUP is unable to count the total nodes in the cluster.