count_nonzero #23907
Labels
enhancement
Not as big of a feature, but technically not a bug. Should be easy to fix
module: numpy
Related to numpy support, and also numpy compatibility of our operators
module: performance
Issues related to performance, either of kernel code or framework glue
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
🚀 Feature
An efficient implementation for counting nonzero elements
Pitch
However in some situations (MaskRCNN) you don't need the exact positions of the nonzero elements, but the sum of them and the method is called quite frequently. So far any workaround is faster than retrieving the indices for the elements and taking it's length.
Some may want the differentiable count of these values, which effectively requires to not use the current nonzero method.
Related links:
It was previously mentioned on Discuss [1 2] and on #14848 #15190
Alternatives
On a 1080Ti, these times are respectively
10.9 ms, 3.54 ms, 2.68 ms
(usedtorch.cuda.synchronize
before and after operation)Additional context
A few other non-trivial things that popped up when I dived in finding out what is the fastest way:
torch.clamp_max()
andtorch.clamp_min()
is 5x times slower thantorch.clamp()
. Time on x 261 ms ± 78 ms, 202 ms ± 17.8 ms, 47.1 ms ± 437 µs)uint8
orint64
dtypeThanks @gchanan for asking to report this issue, hope this will help others.
cc @VitalyFedyunin @ngimel @mruberry
The text was updated successfully, but these errors were encountered: