Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TOPI] GPU scatter 1D via sorting based approach #7056

Merged
merged 13 commits into from
Dec 9, 2020

Conversation

masahi
Copy link
Member

@masahi masahi commented Dec 8, 2020

This is somewhat a follow up to #7044. As I explained there, the current implementation of CUDA scatter 1D uses only one thread. Inspired by @tkonolige's comment at #7044 (comment), I came up with the following sorting based approach powered by thrust's stable_sort_by_key function. I think it enables maximum parallelism while guaranteeing determinism.

  • Sort indices and updates. Updates are sorted using indices as keys. Both sorting can be done in one go via thrust sort_by_key function.
  • Compare an index at thread i with thread i + 1 in the sorted indices array. If thread i + 1 has a different index, thread i can scatter its update to the output.
  • To guarantee deterministic output, sorting must be done via stable_sort.

Here is the timing comparison. As expected, for big inputs a new approach performs much better.

All numbers in mili second, measured via time evaluator on GTX 1070 ti

size current (sequential) new (powered by thrust)
5000 0.169332 0.168660
10000 0.301014 0.172381
25000 0.730493 0.169007
50000 1.457996 0.315582
100000 2.943622 0.395291
500000 24.262521 1.254097
1000000 48.536304 2.244327

please review @mbrookhart @tkonolige @zhiics @Laurawly

Copy link
Contributor

@mbrookhart mbrookhart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! I'd like to test this against the topi argsort as well, but I don't think that's stable at the moment, I'm poking around at it today to see if I can fix that kernel.

As a larger conversation, I don't love how much we're depending on thrust for functionality, I'd kind of like to fix the issues in topi around sort so we don't have to lean on thrust so much. We're depending on the nomially cuda topi kernels for a lot of other GPUs, so this trend makes it harder to support more diverse hardware.

@zhiics
Copy link
Member

zhiics commented Dec 8, 2020

Yeah, I think we need to fix sort otherwise thrust would be the better way for us to go. There are multiple dependency on it already.

@mbrookhart
Copy link
Contributor

I'm working on it :) I'll let you know what I come up with

@tkonolige
Copy link
Contributor

Looks great! Just to clarify, this can handle repeated indices right?

Copy link
Contributor

@Laurawly Laurawly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job! Agree that thrust sort works much better on larger inputs compared with sort ir performance wise. Would be glad to see the results with the current sort ir and look forward to see any improvements to it.

@masahi
Copy link
Member Author

masahi commented Dec 8, 2020

Looks great! Just to clarify, this can handle repeated indices right?

@tkonolige Yes definitely (that's the whole point!!). Repeated indices are grouped together by sorting, and only the last one can scatter its update to the output. The result should always be identical with the current scatter 1D and our numpy reference used in the tests.

@masahi
Copy link
Member Author

masahi commented Dec 8, 2020

As a larger conversation, I don't love how much we're depending on thrust for functionality, I'd kind of like to fix the issues in topi around sort so we don't have to lean on thrust so much. We're depending on the nomially cuda topi kernels for a lot of other GPUs, so this trend makes it harder to support more diverse hardware.

@mbrookhart @zhiics While I fully agree with this generally, for fundamental, low level GPU primitives such as sorting, scan etc, I think it would be really hard for generic implementations to match or outperform platform specific libraries. These libraries have years of development behind it and use platform specific intrinsics to maximize performance. This also applies to cuDNN, but unlike convolution op, I don't think AutoTVM or Ansor would help generate efficient sort or scan ops.

Sooner or later, I think we will introduce cumsum op to TVM. On CUDA, cumsum can be implemented very efficiently via thrust or cub's inclusive_scan. But without it, we have to roll our own GPU scan implementation that can compete with vender-provided one, which I think would be a formidable or near impossible task.

So my opinion is, while native TVM solution is always what we should strive for, if there is a platform specific library, we should embrace it. Sort, scan etc are standard enough that there is a good chance platform specific library is available. For example, rocm has their implementation of thrust, on OpenCL there is Boost.compute.

@mbrookhart
Copy link
Contributor

I think I agree with you, @masahi, my only disagreement is the level at which we are currently implementing things. In general we should always have a generic implementation with reasonable performance, even if it's not great performance. Say, for instance, we have a stable implementation of sort. Then, when we find a faster kernel for sort via thrust, we specialize the topi sort implementaiton to return thrust instead of tir for that usecase.

At this point, we only need one implementation of scatter, because it just calls into topi sort, and topi does the specialization, instead of having to specialize scatter, topk, nms, etc all for vendor-specific versions of sort.

@masahi
Copy link
Member Author

masahi commented Dec 8, 2020

I see, you are right, the current way of directly calling thrust sort from higher level op like scatter is not ideal. Dispatching decisions should be left to topi.sort, yes.

One tricky bit is, what I need here is not exactly topi.sort, but more specific one like topi.stable_sort_by_key. Similar to this case, I imagine each higher level op could end up utilizing sorting in a customized way, so one topi.sort may not suffice.

Being able to introduce and use customized sorting op by directly dropping into low level libs is certainly convenient, as demonstrated in this PR, although I have to admit it is a bit ad hoc.

@masahi masahi merged commit 465cd14 into apache:main Dec 9, 2020
@masahi
Copy link
Member Author

masahi commented Dec 9, 2020

Thanks @mbrookhart @tkonolige @zhiics @Laurawly

@comaniac
Copy link
Contributor

comaniac commented Dec 9, 2020

@masahi my build is broken due to this PR:

nvcc fatal   : Unknown option '-extended-lambda'
CMakeFiles/tvm_objs.dir/build.make:4859: recipe for target 'CMakeFiles/tvm_objs.dir/src/runtime/contrib/thrust/thrust.cu.o' failed
make[2]: *** [CMakeFiles/tvm_objs.dir/src/runtime/contrib/thrust/thrust.cu.o] Error 1
make[2]: *** Waiting for unfinished jobs....
CMakeFiles/Makefile2:72: recipe for target 'CMakeFiles/tvm_objs.dir/all' failed
make[1]: *** [CMakeFiles/tvm_objs.dir/all] Error 2
Makefile:129: recipe for target 'all' failed

Is there any new version requirement for nvcc?

@comaniac
Copy link
Contributor

comaniac commented Dec 9, 2020

I found this issue saying that we should use --expt-extended-lambda to be compatible to CUDA <10.1
NVIDIA/MinkowskiEngine#207

@masahi
Copy link
Member Author

masahi commented Dec 9, 2020

oh sorry about that. I will send a fix ASAP

@comaniac
Copy link
Contributor

comaniac commented Dec 9, 2020

No worries. The above solution worked for me. Simply changing to set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --expt-extended-lambda") in CUDA.cmake.

TusharKanekiDey pushed a commit to TusharKanekiDey/tvm that referenced this pull request Jan 20, 2021
* add thrust stable sort

* rename

* scatter via sort working

* correctly handles negative indices

* clean up, add some comments

* add doc string

* remove scatter benchmark stuff

* add more doc

* fix typo

* lint fix

* silence lint

* fix py format

* check for thrust availablity before test

Co-authored-by: masa <masa@pop-os.localdomain>
trevor-m pushed a commit to neo-ai/tvm that referenced this pull request Jan 21, 2021
* add thrust stable sort

* rename

* scatter via sort working

* correctly handles negative indices

* clean up, add some comments

* add doc string

* remove scatter benchmark stuff

* add more doc

* fix typo

* lint fix

* silence lint

* fix py format

* check for thrust availablity before test

Co-authored-by: masa <masa@pop-os.localdomain>
electriclilies pushed a commit to electriclilies/tvm that referenced this pull request Feb 18, 2021
* add thrust stable sort

* rename

* scatter via sort working

* correctly handles negative indices

* clean up, add some comments

* add doc string

* remove scatter benchmark stuff

* add more doc

* fix typo

* lint fix

* silence lint

* fix py format

* check for thrust availablity before test

Co-authored-by: masa <masa@pop-os.localdomain>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants