Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Faster GPU NMS operator #16542

Merged
merged 19 commits into from
Nov 5, 2019
Merged

Faster GPU NMS operator #16542

merged 19 commits into from
Nov 5, 2019

Conversation

ptrendx
Copy link
Member

@ptrendx ptrendx commented Oct 19, 2019

Description

This PR significantly improves the performance of mx.sym.contrib.box_nms on GPU.

@zhreshold @Jerryzcn FYI

Test cases:

input size topk old time new time speedup
(1, 507, 5) 507 0.78 ms 0.14 ms 5.57
(1, 1875, 5) 1875 2.34 ms 0.22 ms 10.63
(1, 7500, 5) 2400 4.07 ms 0.41 ms 9.93
(1, 30000, 5) 2400 4.17 ms 0.41 ms 10.17
(1, 120000, 5) 2400 5.81 ms 0.47 ms 12.36
(1, 159882, 5) 12001 10.48 ms 1.17 ms 8.96

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Code is well-documented:
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Comments

  • Currently the new function is called when there is no requirement for backward pass, but it is easily extendable to include that.

@ptrendx ptrendx changed the title [WIP] Faster GPU NMS optimization Faster GPU NMS operator Oct 24, 2019
@ptrendx
Copy link
Member Author

ptrendx commented Oct 31, 2019

@zhreshold Do you have any other comments?

For the record - I think there are a few outstanding items to fully mark GPU implementation of NMS as "done", namely

  • handle the output required for backward in the new implementation
  • change the way sorting is done (currently sorting happens batch size times, and it turns out that for large topk it is not beneficial, the previous implementation that would always do 2 (larger) sorts was better in that regard.

That said, I will not be able to finish those 2 outstanding items before 1.6 code freeze and I believe performance improvements are enough to merge this PR and address those 2 points as part of the future PR.


template <typename DType>
__global__ void FilterAndPrepareAuxDataKernel(const DType* data, DType* out, DType* scores,
index_t num_elements_per_batch,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indentation alignment?


template <bool check_topk, bool check_score, typename DType>
__global__ void CompactDataKernel(const index_t* indices, const DType* source,
DType* destination, const index_t topk,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indentation alignment?

Copy link
Member

@zhreshold zhreshold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for the delay, was traveling these days.
The changes lgtm!

@zhreshold zhreshold merged commit 0c5677e into apache:master Nov 5, 2019
yajiedesign pushed a commit to yajiedesign/mxnet that referenced this pull request Nov 6, 2019
* Adding second NMS op

* NMS kernel

* Removing second sort

* Optimization

* Adding out-of-place ability to SortByKey

* Optimization pt2

* Optimizations pt3

* Do not recompute other boxes area every time

* Sort only topk results during second sorting

* Cleaning

* Fixes from rebase

* Fix lint and more fixes from rebase

* Fix typo

* Early exit in Triangle kernel

* Fixes

* Fix sort

* Fix from rebase

* Fix for the mixed naming convention

* Fix the index_t with int comparisoon
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants