-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TOPI] GPU scatter 1D via sorting based approach #7056
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice! I'd like to test this against the topi argsort as well, but I don't think that's stable at the moment, I'm poking around at it today to see if I can fix that kernel.
As a larger conversation, I don't love how much we're depending on thrust for functionality, I'd kind of like to fix the issues in topi around sort so we don't have to lean on thrust so much. We're depending on the nomially cuda topi kernels for a lot of other GPUs, so this trend makes it harder to support more diverse hardware.
Yeah, I think we need to fix sort otherwise thrust would be the better way for us to go. There are multiple dependency on it already. |
I'm working on it :) I'll let you know what I come up with |
Looks great! Just to clarify, this can handle repeated indices right? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job! Agree that thrust sort works much better on larger inputs compared with sort ir performance wise. Would be glad to see the results with the current sort ir and look forward to see any improvements to it.
@tkonolige Yes definitely (that's the whole point!!). Repeated indices are grouped together by sorting, and only the last one can scatter its update to the output. The result should always be identical with the current scatter 1D and our numpy reference used in the tests. |
@mbrookhart @zhiics While I fully agree with this generally, for fundamental, low level GPU primitives such as sorting, scan etc, I think it would be really hard for generic implementations to match or outperform platform specific libraries. These libraries have years of development behind it and use platform specific intrinsics to maximize performance. This also applies to cuDNN, but unlike convolution op, I don't think AutoTVM or Ansor would help generate efficient sort or scan ops. Sooner or later, I think we will introduce So my opinion is, while native TVM solution is always what we should strive for, if there is a platform specific library, we should embrace it. Sort, scan etc are standard enough that there is a good chance platform specific library is available. For example, rocm has their implementation of thrust, on OpenCL there is Boost.compute. |
I think I agree with you, @masahi, my only disagreement is the level at which we are currently implementing things. In general we should always have a generic implementation with reasonable performance, even if it's not great performance. Say, for instance, we have a stable implementation of sort. Then, when we find a faster kernel for sort via thrust, we specialize the topi sort implementaiton to return thrust instead of tir for that usecase. At this point, we only need one implementation of scatter, because it just calls into topi sort, and topi does the specialization, instead of having to specialize scatter, topk, nms, etc all for vendor-specific versions of sort. |
I see, you are right, the current way of directly calling thrust sort from higher level op like scatter is not ideal. Dispatching decisions should be left to One tricky bit is, what I need here is not exactly Being able to introduce and use customized sorting op by directly dropping into low level libs is certainly convenient, as demonstrated in this PR, although I have to admit it is a bit ad hoc. |
Thanks @mbrookhart @tkonolige @zhiics @Laurawly |
@masahi my build is broken due to this PR:
Is there any new version requirement for nvcc? |
I found this issue saying that we should use |
oh sorry about that. I will send a fix ASAP |
No worries. The above solution worked for me. Simply changing to |
* add thrust stable sort * rename * scatter via sort working * correctly handles negative indices * clean up, add some comments * add doc string * remove scatter benchmark stuff * add more doc * fix typo * lint fix * silence lint * fix py format * check for thrust availablity before test Co-authored-by: masa <masa@pop-os.localdomain>
* add thrust stable sort * rename * scatter via sort working * correctly handles negative indices * clean up, add some comments * add doc string * remove scatter benchmark stuff * add more doc * fix typo * lint fix * silence lint * fix py format * check for thrust availablity before test Co-authored-by: masa <masa@pop-os.localdomain>
* add thrust stable sort * rename * scatter via sort working * correctly handles negative indices * clean up, add some comments * add doc string * remove scatter benchmark stuff * add more doc * fix typo * lint fix * silence lint * fix py format * check for thrust availablity before test Co-authored-by: masa <masa@pop-os.localdomain>
This is somewhat a follow up to #7044. As I explained there, the current implementation of CUDA scatter 1D uses only one thread. Inspired by @tkonolige's comment at #7044 (comment), I came up with the following sorting based approach powered by thrust's
stable_sort_by_key
function. I think it enables maximum parallelism while guaranteeing determinism.sort_by_key
function.stable_sort
.Here is the timing comparison. As expected, for big inputs a new approach performs much better.
All numbers in mili second, measured via time evaluator on GTX 1070 ti
please review @mbrookhart @tkonolige @zhiics @Laurawly