-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TOPI] Parallelize GPU NMS inner loop #7172
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
masahi
commented
Dec 28, 2020
Laurawly
reviewed
Dec 29, 2020
Laurawly
approved these changes
Dec 30, 2020
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
thanks @Laurawly |
Kudos! |
tkonolige
pushed a commit
to tkonolige/incubator-tvm
that referenced
this pull request
Jan 11, 2021
* make NMS inner loop parallel * use one block two avoid global sync issue * temp disable write by only thread 0 * leave a TODO on write by only one thread * add some comments, remove check the check on negative class id * minor improvement when topk is available * fix write by a single thread
TusharKanekiDey
pushed a commit
to TusharKanekiDey/tvm
that referenced
this pull request
Jan 20, 2021
* make NMS inner loop parallel * use one block two avoid global sync issue * temp disable write by only thread 0 * leave a TODO on write by only one thread * add some comments, remove check the check on negative class id * minor improvement when topk is available * fix write by a single thread
trevor-m
pushed a commit
to neo-ai/tvm
that referenced
this pull request
Jan 21, 2021
* make NMS inner loop parallel * use one block two avoid global sync issue * temp disable write by only thread 0 * leave a TODO on write by only one thread * add some comments, remove check the check on negative class id * minor improvement when topk is available * fix write by a single thread
electriclilies
pushed a commit
to electriclilies/tvm
that referenced
this pull request
Feb 18, 2021
* make NMS inner loop parallel * use one block two avoid global sync issue * temp disable write by only thread 0 * leave a TODO on write by only one thread * add some comments, remove check the check on negative class id * minor improvement when topk is available * fix write by a single thread
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a follow-up to #7136. I found a simple way to parallelize the inner loop of GPU NMS, which is currently done in a sequential way since #6839 and hence extremely slow if the number of input box is large. This change brings massive speedup on object detection models from PyTorch and Gluon, as shown below.
Before I explain what I did, here is how currently we do the sequential, O(N ** 2) triangle loop. This is done by a single thread.
My parallelization instead does the above triangle in the following way:
The idea is, at the start of the inner loop, box j is assumed to be a valid box, and the inner loop invalidates other succeeding boxes that have high overlap with the newly found valid box j. The inner loop can be trivially done in parallel and the number of IOU tests reduces to O(# selected boxes * N).
Now, the inner loop is done in parallel and the other loop is sequential. All threads need to do the outer loop in a lock step: The results of checking if box j is still valid must be consistent across all threads. Since we cannot do global sync inside kernels, I use only one thread block for parallelization and use
__syncthreads()
after each inner loop.I have one more PR coming to optimize GPU NMS IR further.
please review @Laurawly @kevinthesun @zhiics @vinx13