-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Hausdorff perf and accept larger number of inputs. #424
Improve Hausdorff perf and accept larger number of inputs. #424
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please explain in the description what this PR does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes a kernel is just clearer, cleaner and faster.
cpp/src/spatial/hausdorff.cu
Outdated
namespace { | ||
|
||
using size_type = cudf::size_type; | ||
|
||
constexpr cudf::size_type THREADS_PER_BLOCK = 64; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's only one use of this, right? Just define it in the function where you use it. This is a pretty small block size, have you tried other sizes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried 1024, 512, 256, 128, 64, and 32. 64 was the sweet spot for the 6000 spaces with 100 points each use case we have in mind.
new 1024 threads
------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
------------------------------------------------------------------------------------------------------------
HausdorffBenchmark/hausdorff/100/10/manual_time 1.60 ms 1.77 ms 328 items_per_second=497.657M/ss
HausdorffBenchmark/hausdorff/6000/10/manual_time 114 ms 116 ms 6 items_per_second=25.6795G/s
HausdorffBenchmark/hausdorff/100/100/manual_time 15.4 ms 15.6 ms 45 items_per_second=6.22256G/s
HausdorffBenchmark/hausdorff/6000/100/manual_time 8717 ms 8717 ms 1 items_per_second=40.4615G/s
---------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
------------------------------------------------------------------------------------------------------------
HausdorffBenchmark/hausdorff/100/10/manual_time 0.888 ms 1.05 ms 580 items_per_second=894.137M/ss
HausdorffBenchmark/hausdorff/6000/10/manual_time 110 ms 112 ms 6 items_per_second=26.4986G/s
HausdorffBenchmark/hausdorff/100/100/manual_time 7.76 ms 7.93 ms 88 items_per_second=12.3754G/s
HausdorffBenchmark/hausdorff/6000/100/manual_time 8041 ms 8029 ms 1 items_per_second=43.8623G/s
new 256 threads
------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
------------------------------------------------------------------------------------------------------------
HausdorffBenchmark/hausdorff/100/10/manual_time 0.500 ms 0.671 ms 1136 items_per_second=1.58662G/s
HausdorffBenchmark/hausdorff/6000/10/manual_time 94.5 ms 96.8 ms 7 items_per_second=30.8406G/s
HausdorffBenchmark/hausdorff/100/100/manual_time 3.93 ms 4.09 ms 178 items_per_second=24.4323G/s
HausdorffBenchmark/hausdorff/6000/100/manual_time 7860 ms 7860 ms 1 items_per_second=44.8779G/s
new 128 threads
------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
------------------------------------------------------------------------------------------------------------
HausdorffBenchmark/hausdorff/100/10/manual_time 0.292 ms 0.457 ms 2219 items_per_second=2.71551G/s
HausdorffBenchmark/hausdorff/6000/10/manual_time 82.4 ms 84.5 ms 8 items_per_second=35.3905G/s
HausdorffBenchmark/hausdorff/100/100/manual_time 3.99 ms 4.16 ms 175 items_per_second=24.0617G/s
HausdorffBenchmark/hausdorff/6000/100/manual_time 7851 ms 7851 ms 1 items_per_second=44.9246G/s
new 64 threads
------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
------------------------------------------------------------------------------------------------------------
HausdorffBenchmark/hausdorff/100/10/manual_time 0.220 ms 0.395 ms 2562 items_per_second=3.60258G/ss
HausdorffBenchmark/hausdorff/6000/10/manual_time 87.6 ms 89.7 ms 8 items_per_second=33.2735G/s
HausdorffBenchmark/hausdorff/100/100/manual_time 2.92 ms 3.08 ms 239 items_per_second=32.8852G/s
HausdorffBenchmark/hausdorff/6000/100/manual_time 7658 ms 7658 ms 1 items_per_second=46.0564G/s
new 32 threads
------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
------------------------------------------------------------------------------------------------------------
HausdorffBenchmark/hausdorff/100/10/manual_time 0.181 ms 0.339 ms 3221 items_per_second=4.38609G/s
HausdorffBenchmark/hausdorff/6000/10/manual_time 82.1 ms 84.3 ms 8 items_per_second=35.4951G/s
HausdorffBenchmark/hausdorff/100/100/manual_time 2.55 ms 2.72 ms 272 items_per_second=37.6194G/s
HausdorffBenchmark/hausdorff/6000/100/manual_time 7764 ms 7764 ms 1 items_per_second=45.428G/s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had planned on using THREADS_PER_BLOCK
in some cross-block communication logic within the thread, but decided to see how atomicMax
would fare. It fares well, so I ended up not doing the complicated cross-block communication stuff. I'll move the var to a more local scope. 👍
One of the hausdorff tests was failing due to
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving ops-codeowner
file changes
…sdorff-partitioning-2
@gpucibot merge |
Fixes #393
We switched to the exclusive scan approach to Hausdorff because certain benchmarks indicated better performance. Apparently those benchmarks were inadequate or just plain badly written (by me), and performance was in fact worse. This became apparent while fixing the OOM error reported in #393.
I copied the 0.14 implementation in to the 21.08 branch to re-benchmark. here are the results:
cuspatial@0.14:
cuspatial@21.08 (with fix for OOM, as seen in previous commits of this PR)
The performance is bad, and this regression is my fault. Fortunately I was able to quickly reverse this regression and improve performance while getting rid of a bunch of code (and learning a lot in the process). This PR re-implements Hausdorff as a straightforward custom kernel that requires zero intermediate memory.
this pr: