-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TOPI] Rewrite GPU argwhere using exclusive scan #7314
Conversation
Could we add a column for the performance of the PR without thrust (i.e., TIR exclusive scan?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to include benchmarks without thrust in the PR for posterity, but otherwise this looks great, thanks! I'd wait to merge until @zhiics can review, since he wrote the existing kernel.
Ok updated the numbers to include TIR scan result. |
👍 Not as fast as thrust, as expected, but it's good to see it's still a performance improvement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the improvement.
85a91e9
to
63469a6
Compare
Thanks @mbrookhart @zhiics |
* use ex scan to write argwhere * add doc
* use ex scan to write argwhere * add doc
* use ex scan to write argwhere * add doc
* use ex scan to write argwhere * add doc
This PR improves the implementation of GPU
argwhere
added in #6868, using exclusive scan (see #7303).The current implementation of
argwhere
is very inefficient, because it uses atomic to update the write location. Since all threads compete for the single location, this effectively makes it a sequential kernel. Moreover, since the output indices need to be lexicographically sorted, the current implementation involves sorting along each axis.Since
argwhere
is literally an instance of stream compaction, this is a perfect application of exclusive scan. Now,argwhere
simply consists ofboth of which are highly parallel operation. Thus, both atomic and sort are gone, vastly simplifying the implementation. Moreover, it also brings huge speed up, as shown below.
All numbers in milli sec
please review @zhiics @Laurawly @mbrookhart @tkonolige @anijain2305 @trevor-m