Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CUDA] Parallel Cuda Mergesort #7099

Merged
merged 11 commits into from
Dec 21, 2020
Merged

[CUDA] Parallel Cuda Mergesort #7099

merged 11 commits into from
Dec 21, 2020

Conversation

mbrookhart
Copy link
Contributor

@mbrookhart mbrookhart commented Dec 12, 2020

@Laurawly @zhiics @icemelon9 @csullivan @tkonolige

There have been many complaints recently about stability and performance of the tir-based cuda sort kernel. I've spent a couple of days this week getting a cuda version of Parallel Mergesort. It's a stable sort, so it fixes the flakiness we've seen with argsort and argwhere, it changes the threading to support dynamic shapes, and it increases the performance significantly over the previous kernel.

This PR only addresses the core sort_ir function, extending this to other versions sort in this file is future work.

I tested performance on a variety of shapes using this script and obtained these numbers on my 1070TI. It's not as fast as Thrust, as expected, but it's much closer for all shapes tested here, and even manages to beat thrust on a few. (times are in milliseconds)

Thanks!

Shape main thrust this
(2000, 2, 2) 7.77 0.58 1.67
(2, 2000, 2) 4.8 0.7 1.59
(2, 2, 2000) 3.24 0.63 1.54
(4000, 2, 2) 25.53 0.65 4.05
(2, 4000, 2) 13.78 0.62 3.3
(2, 2, 4000) 9.85 0.63 4.04
(2, 12000, 2) 369.99 0.68 13.87
(2, 2, 12000) 86.55 0.66 11.11
(12000, 2, 2) 486.65 0.66 13.69
(2000, 8, 8) 259.21 10.4 4.22
(8, 2000, 8) 111.14 8.45 3.43
(8, 8, 2000) 50.37 9.05 3.05
(4000, 8, 8) 671.53 8.24 9.58
(8, 4000, 8) 368.59 8.47 10.12
(8, 8, 4000) 171.18 8.74 6.27
(12000, 8, 8) 3571.97 15.22 42.99
(8, 12000, 8) 3517.72 15.07 45.84
(8, 8, 12000) 1417.97 15.03 27.57

@mbrookhart
Copy link
Contributor Author

I'm hitting some very odd segfaults, just in the debug runtime with nvptx. Trying to figure out what's going on, I'll keep this as WIP until I can get that fixed.

@mbrookhart mbrookhart changed the title [CUDA] Parallel Cuda Mergesort [WIP][CUDA] Parallel Cuda Mergesort Dec 13, 2020
Copy link
Contributor

@tkonolige tkonolige left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look great! I think the main implementation might benefit from a couple of comments describing what it is doing.

python/tvm/topi/cuda/sort.py Show resolved Hide resolved
Copy link
Contributor

@csullivan csullivan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice approach. I didn't spot an obvious reason for the segfault.. Also agree with @tkonolige on docs, some high level info on the problem flattening and index mapping / thread assignment for each slice (and as slices are merged) will help maintainability by others.

python/tvm/topi/cuda/sort.py Outdated Show resolved Hide resolved
python/tvm/topi/cuda/sort.py Outdated Show resolved Hide resolved
@mbrookhart mbrookhart mentioned this pull request Dec 14, 2020
@mbrookhart mbrookhart force-pushed the cuda_mergesort branch 2 times, most recently from 6b8d79a to 8c6b03b Compare December 15, 2020 06:38
@mbrookhart mbrookhart changed the title [WIP][CUDA] Parallel Cuda Mergesort [CUDA] Parallel Cuda Mergesort Dec 15, 2020
@mbrookhart
Copy link
Contributor Author

Many thanks to @masahi for helping me find an issue with heterogeneous lowering and some overflow issues in how I was handling the threads. I think it should be ready for review now, thanks everyone!

@@ -277,7 +277,7 @@ def _build_for_device(input_mod, target, target_host):
lambda f: "calling_conv" not in f.attrs
or f.attrs["calling_conv"].value != CallingConv.DEVICE_KERNEL_LAUNCH
),
tvm.tir.transform.Apply(lambda f: f.with_attr("target", target)),
tvm.tir.transform.Apply(lambda f: f.with_attr("target", target_host)),
Copy link
Member

@masahi masahi Dec 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, segfault with nvptx was happening because the generated host code was calling intrinsics registered for nvptx, like __nv_log2 or __nv_ceil. The reason it was working on CUDA was just by coincident: there is no CUDA intrinsics registered for fp64 log2, ceil, so TVM was using the default lowering, which happens to be the right one (llvm).

This change fixes that issue.

@masahi
Copy link
Member

masahi commented Dec 18, 2020

@mbrookhart I think we can revive some tests that are currently disabled due to flaky sort. See

# TODO(zhiics) Enable argwhere gpu test after sort is fixed. Otherwise, we have
# to use thrust to guarantee the correct results which has been tested locally.
# @tvm.testing.uses_gpu
def test_any_argwhere():

# TODO(zhiics) Enable argwhere gpu test after sort is fixed.
if ctx.device_type != 1:
continue
check_device(target, ctx)
# TODO(zhiics) Enable argwhere gpu test after sort is fixed. Otherwise, we have
# to use thrust to guarantee the correct results which has been tested locally.
# @tvm.testing.uses_gpu

@masahi
Copy link
Member

masahi commented Dec 18, 2020

We should also remove

# TODO(zhiics) Enable argwhere gpu test after sort is fixed.
if ctx.device_type != 1:
continue

Copy link
Member

@zhiics zhiics left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the work. please fix the unit test

@mbrookhart
Copy link
Contributor Author

Oh no! A copy-paste error! Will fix

Copy link
Contributor

@Laurawly Laurawly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tqchen tqchen merged commit 38273ee into apache:main Dec 21, 2020
masahi pushed a commit to masahi/tvm that referenced this pull request Dec 24, 2020
@mbrookhart mbrookhart deleted the cuda_mergesort branch January 4, 2021 17:10
TusharKanekiDey pushed a commit to TusharKanekiDey/tvm that referenced this pull request Jan 20, 2021
trevor-m pushed a commit to neo-ai/tvm that referenced this pull request Jan 21, 2021
electriclilies pushed a commit to electriclilies/tvm that referenced this pull request Feb 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants