Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TOPI] Rewrite GPU argwhere using exclusive scan #7314

Merged
merged 2 commits into from
Jan 21, 2021

Conversation

masahi
Copy link
Member

@masahi masahi commented Jan 20, 2021

This PR improves the implementation of GPU argwhere added in #6868, using exclusive scan (see #7303).

The current implementation of argwhere is very inefficient, because it uses atomic to update the write location. Since all threads compete for the single location, this effectively makes it a sequential kernel. Moreover, since the output indices need to be lexicographically sorted, the current implementation involves sorting along each axis.

Since argwhere is literally an instance of stream compaction, this is a perfect application of exclusive scan. Now, argwhere simply consists of

  • A single call to exclusive scan on a boolean flag array to compute the write indices.
  • Compaction using the write indices (just copying elements with nonzero condition).

both of which are highly parallel operation. Thus, both atomic and sort are gone, vastly simplifying the implementation. Moreover, it also brings huge speed up, as shown below.

All numbers in milli sec

Shape current main using thrust exclusive scan using TIR exclusive scan
(128, 65) 0.240 0.024 0.132
(200, 500) 0.571 0.029 0.184
(1000, 50) 0.270 0.027 0.154
(100000,) 0.205 0.027 0.182
(500000,) 0.947 0.090 0.418
(1000000,) 1.63 0.170 0.848
(1000, 1000) 3.17 0.185 0.863
(1000, 5000) 15.69 0.976 3.689
(100, 100, 1000) 48.70 2.087 7.423
(256, 128, 64, 32) 397.21 15.58 51.438

please review @zhiics @Laurawly @mbrookhart @tkonolige @anijain2305 @trevor-m

@mbrookhart
Copy link
Contributor

Could we add a column for the performance of the PR without thrust (i.e., TIR exclusive scan?)

Copy link
Contributor

@mbrookhart mbrookhart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to include benchmarks without thrust in the PR for posterity, but otherwise this looks great, thanks! I'd wait to merge until @zhiics can review, since he wrote the existing kernel.

@masahi
Copy link
Member Author

masahi commented Jan 20, 2021

Ok updated the numbers to include TIR scan result.

@mbrookhart
Copy link
Contributor

👍 Not as fast as thrust, as expected, but it's good to see it's still a performance improvement.

Copy link
Member

@zhiics zhiics left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the improvement.

@masahi masahi merged commit f829403 into apache:main Jan 21, 2021
@masahi
Copy link
Member Author

masahi commented Jan 21, 2021

Thanks @mbrookhart @zhiics

alexwong pushed a commit to alexwong/tvm that referenced this pull request Feb 11, 2021
electriclilies pushed a commit to electriclilies/tvm that referenced this pull request Feb 18, 2021
Lokiiiiii pushed a commit to Lokiiiiii/tvm that referenced this pull request Mar 2, 2021
trevor-m pushed a commit to neo-ai/tvm that referenced this pull request Mar 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants