[TOPI] GPU scatter 1D via sorting based approach #7056

masahi · 2020-12-08T13:37:21Z

This is somewhat a follow up to #7044. As I explained there, the current implementation of CUDA scatter 1D uses only one thread. Inspired by @tkonolige's comment at #7044 (comment), I came up with the following sorting based approach powered by thrust's stable_sort_by_key function. I think it enables maximum parallelism while guaranteeing determinism.

Sort indices and updates. Updates are sorted using indices as keys. Both sorting can be done in one go via thrust sort_by_key function.
Compare an index at thread i with thread i + 1 in the sorted indices array. If thread i + 1 has a different index, thread i can scatter its update to the output.
To guarantee deterministic output, sorting must be done via stable_sort.

Here is the timing comparison. As expected, for big inputs a new approach performs much better.

All numbers in mili second, measured via time evaluator on GTX 1070 ti

size	current (sequential)	new (powered by thrust)
5000	0.169332	0.168660
10000	0.301014	0.172381
25000	0.730493	0.169007
50000	1.457996	0.315582
100000	2.943622	0.395291
500000	24.262521	1.254097
1000000	48.536304	2.244327

please review @mbrookhart @tkonolige @zhiics @Laurawly

mbrookhart

Very nice! I'd like to test this against the topi argsort as well, but I don't think that's stable at the moment, I'm poking around at it today to see if I can fix that kernel.

As a larger conversation, I don't love how much we're depending on thrust for functionality, I'd kind of like to fix the issues in topi around sort so we don't have to lean on thrust so much. We're depending on the nomially cuda topi kernels for a lot of other GPUs, so this trend makes it harder to support more diverse hardware.

zhiics · 2020-12-08T17:10:08Z

Yeah, I think we need to fix sort otherwise thrust would be the better way for us to go. There are multiple dependency on it already.

mbrookhart · 2020-12-08T17:11:06Z

I'm working on it :) I'll let you know what I come up with

tkonolige · 2020-12-08T17:49:45Z

Looks great! Just to clarify, this can handle repeated indices right?

Laurawly

Great job! Agree that thrust sort works much better on larger inputs compared with sort ir performance wise. Would be glad to see the results with the current sort ir and look forward to see any improvements to it.

masahi · 2020-12-08T19:16:23Z

Looks great! Just to clarify, this can handle repeated indices right?

@tkonolige Yes definitely (that's the whole point!!). Repeated indices are grouped together by sorting, and only the last one can scatter its update to the output. The result should always be identical with the current scatter 1D and our numpy reference used in the tests.

masahi · 2020-12-08T19:40:21Z

As a larger conversation, I don't love how much we're depending on thrust for functionality, I'd kind of like to fix the issues in topi around sort so we don't have to lean on thrust so much. We're depending on the nomially cuda topi kernels for a lot of other GPUs, so this trend makes it harder to support more diverse hardware.

@mbrookhart @zhiics While I fully agree with this generally, for fundamental, low level GPU primitives such as sorting, scan etc, I think it would be really hard for generic implementations to match or outperform platform specific libraries. These libraries have years of development behind it and use platform specific intrinsics to maximize performance. This also applies to cuDNN, but unlike convolution op, I don't think AutoTVM or Ansor would help generate efficient sort or scan ops.

Sooner or later, I think we will introduce cumsum op to TVM. On CUDA, cumsum can be implemented very efficiently via thrust or cub's inclusive_scan. But without it, we have to roll our own GPU scan implementation that can compete with vender-provided one, which I think would be a formidable or near impossible task.

So my opinion is, while native TVM solution is always what we should strive for, if there is a platform specific library, we should embrace it. Sort, scan etc are standard enough that there is a good chance platform specific library is available. For example, rocm has their implementation of thrust, on OpenCL there is Boost.compute.

mbrookhart · 2020-12-08T20:48:46Z

I think I agree with you, @masahi, my only disagreement is the level at which we are currently implementing things. In general we should always have a generic implementation with reasonable performance, even if it's not great performance. Say, for instance, we have a stable implementation of sort. Then, when we find a faster kernel for sort via thrust, we specialize the topi sort implementaiton to return thrust instead of tir for that usecase.

At this point, we only need one implementation of scatter, because it just calls into topi sort, and topi does the specialization, instead of having to specialize scatter, topk, nms, etc all for vendor-specific versions of sort.

masahi · 2020-12-08T21:18:10Z

I see, you are right, the current way of directly calling thrust sort from higher level op like scatter is not ideal. Dispatching decisions should be left to topi.sort, yes.

One tricky bit is, what I need here is not exactly topi.sort, but more specific one like topi.stable_sort_by_key. Similar to this case, I imagine each higher level op could end up utilizing sorting in a customized way, so one topi.sort may not suffice.

Being able to introduce and use customized sorting op by directly dropping into low level libs is certainly convenient, as demonstrated in this PR, although I have to admit it is a bit ad hoc.

masahi · 2020-12-09T01:04:11Z

Thanks @mbrookhart @tkonolige @zhiics @Laurawly

comaniac · 2020-12-09T02:02:06Z

@masahi my build is broken due to this PR:

nvcc fatal   : Unknown option '-extended-lambda'
CMakeFiles/tvm_objs.dir/build.make:4859: recipe for target 'CMakeFiles/tvm_objs.dir/src/runtime/contrib/thrust/thrust.cu.o' failed
make[2]: *** [CMakeFiles/tvm_objs.dir/src/runtime/contrib/thrust/thrust.cu.o] Error 1
make[2]: *** Waiting for unfinished jobs....
CMakeFiles/Makefile2:72: recipe for target 'CMakeFiles/tvm_objs.dir/all' failed
make[1]: *** [CMakeFiles/tvm_objs.dir/all] Error 2
Makefile:129: recipe for target 'all' failed

Is there any new version requirement for nvcc?

comaniac · 2020-12-09T02:05:04Z

I found this issue saying that we should use --expt-extended-lambda to be compatible to CUDA <10.1
NVIDIA/MinkowskiEngine#207

masahi · 2020-12-09T02:06:29Z

oh sorry about that. I will send a fix ASAP

comaniac · 2020-12-09T02:09:32Z

No worries. The above solution worked for me. Simply changing to set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --expt-extended-lambda") in CUDA.cmake.

* add thrust stable sort * rename * scatter via sort working * correctly handles negative indices * clean up, add some comments * add doc string * remove scatter benchmark stuff * add more doc * fix typo * lint fix * silence lint * fix py format * check for thrust availablity before test Co-authored-by: masa <masa@pop-os.localdomain>

masa and others added 13 commits December 8, 2020 19:09

add thrust stable sort

0f3629c

rename

568461b

scatter via sort working

346e834

correctly handles negative indices

be6c207

clean up, add some comments

752e784

add doc string

6447a53

remove scatter benchmark stuff

e16be3a

add more doc

2364196

fix typo

aaaba5e

lint fix

0e87477

silence lint

095b157

fix py format

4a7b787

check for thrust availablity before test

cc513c4

mbrookhart approved these changes Dec 8, 2020

View reviewed changes

zhiics approved these changes Dec 8, 2020

View reviewed changes

Laurawly approved these changes Dec 8, 2020

View reviewed changes

masahi merged commit 465cd14 into apache:main Dec 9, 2020

masahi mentioned this pull request Dec 9, 2020

[NVCC] Fix nvcc compile option to be compatible with older cuda #7065

Merged

masahi mentioned this pull request Dec 23, 2020

[TOPI] GPU sort IR refactor to enable sort by keys #7157

Merged

junrushao mentioned this pull request Nov 1, 2021

Apache TVM v0.8 Release Note Candidate #9416

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TOPI] GPU scatter 1D via sorting based approach #7056

[TOPI] GPU scatter 1D via sorting based approach #7056

masahi commented Dec 8, 2020 •

edited

Loading

mbrookhart left a comment

zhiics commented Dec 8, 2020

mbrookhart commented Dec 8, 2020

tkonolige commented Dec 8, 2020

Laurawly left a comment

masahi commented Dec 8, 2020

masahi commented Dec 8, 2020 •

edited

Loading

mbrookhart commented Dec 8, 2020

masahi commented Dec 8, 2020

masahi commented Dec 9, 2020

comaniac commented Dec 9, 2020

comaniac commented Dec 9, 2020 •

edited

Loading

masahi commented Dec 9, 2020

comaniac commented Dec 9, 2020 •

edited

Loading

[TOPI] GPU scatter 1D via sorting based approach #7056

[TOPI] GPU scatter 1D via sorting based approach #7056

Conversation

masahi commented Dec 8, 2020 • edited Loading

mbrookhart left a comment

Choose a reason for hiding this comment

zhiics commented Dec 8, 2020

mbrookhart commented Dec 8, 2020

tkonolige commented Dec 8, 2020

Laurawly left a comment

Choose a reason for hiding this comment

masahi commented Dec 8, 2020

masahi commented Dec 8, 2020 • edited Loading

mbrookhart commented Dec 8, 2020

masahi commented Dec 8, 2020

masahi commented Dec 9, 2020

comaniac commented Dec 9, 2020

comaniac commented Dec 9, 2020 • edited Loading

masahi commented Dec 9, 2020

comaniac commented Dec 9, 2020 • edited Loading

masahi commented Dec 8, 2020 •

edited

Loading

masahi commented Dec 8, 2020 •

edited

Loading

comaniac commented Dec 9, 2020 •

edited

Loading

comaniac commented Dec 9, 2020 •

edited

Loading