Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Hausdorff perf and accept larger number of inputs. #424

Merged
merged 15 commits into from
Jul 2, 2021

Conversation

cwharris
Copy link
Contributor

@cwharris cwharris commented Jun 22, 2021

Fixes #393

We switched to the exclusive scan approach to Hausdorff because certain benchmarks indicated better performance. Apparently those benchmarks were inadequate or just plain badly written (by me), and performance was in fact worse. This became apparent while fixing the OOM error reported in #393.

I copied the 0.14 implementation in to the 21.08 branch to re-benchmark. here are the results:

cuspatial@0.14:

------------------------------------------------------------------------------------------------------------
Benchmark                                                  Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------------------
HausdorffBenchmark/hausdorff/100/64/manual_time         1.62 ms         1.78 ms          428 items_per_second=23.9898G/s
HausdorffBenchmark/hausdorff/512/64/manual_time         43.9 ms         44.1 ms           16 items_per_second=23.6053G/s
HausdorffBenchmark/hausdorff/4096/64/manual_time        2810 ms         2810 ms            1 items_per_second=23.6845G/s
HausdorffBenchmark/hausdorff/6000/64/manual_time        6148 ms         6148 ms            1 items_per_second=23.2318G/s
HausdorffBenchmark/hausdorff/100/100/manual_time        3.31 ms         3.47 ms          210 items_per_second=29.0333G/s
HausdorffBenchmark/hausdorff/512/100/manual_time        88.9 ms         89.1 ms            8 items_per_second=28.7737G/s
HausdorffBenchmark/hausdorff/4096/100/manual_time       5842 ms         5842 ms            1 items_per_second=28.132G/s
HausdorffBenchmark/hausdorff/6000/100/manual_time      12698 ms        12698 ms            1 items_per_second=27.7783G/s

cuspatial@21.08 (with fix for OOM, as seen in previous commits of this PR)

------------------------------------------------------------------------------------------------------------
Benchmark                                                  Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------------------
HausdorffBenchmark/hausdorff/100/64/manual_time         17.4 ms         17.6 ms           38 items_per_second=2.2391G/s
HausdorffBenchmark/hausdorff/512/64/manual_time          489 ms          490 ms            2 items_per_second=2.11979G/s
HausdorffBenchmark/hausdorff/4096/64/manual_time       37120 ms        37119 ms            1 items_per_second=1.79299G/s
HausdorffBenchmark/hausdorff/6000/64/manual_time       82732 ms        82729 ms            1 items_per_second=1.7265G/s
HausdorffBenchmark/hausdorff/100/100/manual_time        43.4 ms         43.7 ms           16 items_per_second=2.21402G/s
HausdorffBenchmark/hausdorff/512/100/manual_time        1341 ms         1341 ms            1 items_per_second=1.90885G/s
HausdorffBenchmark/hausdorff/4096/100/manual_time      94898 ms        94894 ms            1 items_per_second=1.7319G/s
HausdorffBenchmark/hausdorff/6000/100/manual_time     199120 ms       199115 ms            1 items_per_second=1.77138G/s

The performance is bad, and this regression is my fault. Fortunately I was able to quickly reverse this regression and improve performance while getting rid of a bunch of code (and learning a lot in the process). This PR re-implements Hausdorff as a straightforward custom kernel that requires zero intermediate memory.

this pr:

------------------------------------------------------------------------------------------------------------
Benchmark                                                  Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------------------
HausdorffBenchmark/hausdorff/100/64/manual_time         1.31 ms         1.47 ms          526 items_per_second=29.6763G/s
HausdorffBenchmark/hausdorff/512/64/manual_time         23.2 ms         23.3 ms           30 items_per_second=44.7567G/s
HausdorffBenchmark/hausdorff/4096/64/manual_time        1589 ms         1590 ms            1 items_per_second=41.8747G/s
HausdorffBenchmark/hausdorff/6000/64/manual_time        3170 ms         3170 ms            1 items_per_second=45.0638G/s
HausdorffBenchmark/hausdorff/100/100/manual_time        2.92 ms         3.08 ms          239 items_per_second=32.8852G/s
HausdorffBenchmark/hausdorff/512/100/manual_time        55.8 ms         55.8 ms           12 items_per_second=45.8415G/s
HausdorffBenchmark/hausdorff/4096/100/manual_time       3547 ms         3547 ms            1 items_per_second=46.3317G/s
HausdorffBenchmark/hausdorff/6000/100/manual_time       7658 ms         7658 ms            1 items_per_second=46.0564G/s

@cwharris cwharris requested a review from trxcllnt June 22, 2021 20:42
@github-actions github-actions bot added libcuspatial Relates to the cuSpatial C++ library Python Related to Python code labels Jun 22, 2021
@cwharris cwharris added 3 - Ready for Review Ready for review by team non-breaking Non-breaking change labels Jun 22, 2021
@cwharris cwharris marked this pull request as ready for review June 22, 2021 22:52
@cwharris cwharris requested review from a team as code owners June 22, 2021 22:52
@cwharris cwharris requested a review from thomcom June 22, 2021 22:52
@cwharris cwharris added the bug Something isn't working label Jun 22, 2021
Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain in the description what this PR does.

cpp/src/spatial/hausdorff.cu Outdated Show resolved Hide resolved
@cwharris cwharris requested a review from a team as a code owner June 24, 2021 21:02
@github-actions github-actions bot added the cmake Related to CMake code or build configuration label Jun 24, 2021
@cwharris cwharris changed the title fix hausdorff input partitioning to align with natural key boundaries Improve Hausdorff perf and accept larger number of inputs. Jun 24, 2021
@cwharris cwharris requested a review from harrism June 24, 2021 21:15
Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes a kernel is just clearer, cleaner and faster.

namespace {

using size_type = cudf::size_type;

constexpr cudf::size_type THREADS_PER_BLOCK = 64;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's only one use of this, right? Just define it in the function where you use it. This is a pretty small block size, have you tried other sizes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried 1024, 512, 256, 128, 64, and 32. 64 was the sweet spot for the 6000 spaces with 100 points each use case we have in mind.

new 1024 threads
------------------------------------------------------------------------------------------------------------
Benchmark                                                  Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------------------
HausdorffBenchmark/hausdorff/100/10/manual_time         1.60 ms         1.77 ms          328 items_per_second=497.657M/ss
HausdorffBenchmark/hausdorff/6000/10/manual_time         114 ms          116 ms            6 items_per_second=25.6795G/s
HausdorffBenchmark/hausdorff/100/100/manual_time        15.4 ms         15.6 ms           45 items_per_second=6.22256G/s
HausdorffBenchmark/hausdorff/6000/100/manual_time       8717 ms         8717 ms            1 items_per_second=40.4615G/s
---------------------------------------------------------------------------------------------
Benchmark                                                  Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------------------
HausdorffBenchmark/hausdorff/100/10/manual_time        0.888 ms         1.05 ms          580 items_per_second=894.137M/ss
HausdorffBenchmark/hausdorff/6000/10/manual_time         110 ms          112 ms            6 items_per_second=26.4986G/s
HausdorffBenchmark/hausdorff/100/100/manual_time        7.76 ms         7.93 ms           88 items_per_second=12.3754G/s
HausdorffBenchmark/hausdorff/6000/100/manual_time       8041 ms         8029 ms            1 items_per_second=43.8623G/s

new 256 threads
------------------------------------------------------------------------------------------------------------
Benchmark                                                  Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------------------
HausdorffBenchmark/hausdorff/100/10/manual_time        0.500 ms        0.671 ms         1136 items_per_second=1.58662G/s
HausdorffBenchmark/hausdorff/6000/10/manual_time        94.5 ms         96.8 ms            7 items_per_second=30.8406G/s
HausdorffBenchmark/hausdorff/100/100/manual_time        3.93 ms         4.09 ms          178 items_per_second=24.4323G/s
HausdorffBenchmark/hausdorff/6000/100/manual_time       7860 ms         7860 ms            1 items_per_second=44.8779G/s

new 128 threads
------------------------------------------------------------------------------------------------------------
Benchmark                                                  Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------------------
HausdorffBenchmark/hausdorff/100/10/manual_time        0.292 ms        0.457 ms         2219 items_per_second=2.71551G/s
HausdorffBenchmark/hausdorff/6000/10/manual_time        82.4 ms         84.5 ms            8 items_per_second=35.3905G/s
HausdorffBenchmark/hausdorff/100/100/manual_time        3.99 ms         4.16 ms          175 items_per_second=24.0617G/s
HausdorffBenchmark/hausdorff/6000/100/manual_time       7851 ms         7851 ms            1 items_per_second=44.9246G/s

new 64 threads
------------------------------------------------------------------------------------------------------------
Benchmark                                                  Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------------------
HausdorffBenchmark/hausdorff/100/10/manual_time        0.220 ms        0.395 ms         2562 items_per_second=3.60258G/ss
HausdorffBenchmark/hausdorff/6000/10/manual_time        87.6 ms         89.7 ms            8 items_per_second=33.2735G/s
HausdorffBenchmark/hausdorff/100/100/manual_time        2.92 ms         3.08 ms          239 items_per_second=32.8852G/s
HausdorffBenchmark/hausdorff/6000/100/manual_time       7658 ms         7658 ms            1 items_per_second=46.0564G/s

new 32 threads
------------------------------------------------------------------------------------------------------------
Benchmark                                                  Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------------------
HausdorffBenchmark/hausdorff/100/10/manual_time        0.181 ms        0.339 ms         3221 items_per_second=4.38609G/s
HausdorffBenchmark/hausdorff/6000/10/manual_time        82.1 ms         84.3 ms            8 items_per_second=35.4951G/s
HausdorffBenchmark/hausdorff/100/100/manual_time        2.55 ms         2.72 ms          272 items_per_second=37.6194G/s
HausdorffBenchmark/hausdorff/6000/100/manual_time       7764 ms         7764 ms            1 items_per_second=45.428G/s

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had planned on using THREADS_PER_BLOCK in some cross-block communication logic within the thread, but decided to see how atomicMax would fare. It fares well, so I ended up not doing the complicated cross-block communication stuff. I'll move the var to a more local scope. 👍

@cwharris cwharris requested a review from a team as a code owner June 25, 2021 13:24
@github-actions github-actions bot added the conda Related to conda and conda configuration label Jun 25, 2021
@cwharris
Copy link
Contributor Author

One of the hausdorff tests was failing due to atomicMax being used against some uninitialized memory. Initializing the output did not significantly impact performance.

------------------------------------------------------------------------------------------------------------
Benchmark                                                  Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------------------
HausdorffBenchmark/hausdorff/32/4/manual_time          0.057 ms        0.215 ms         9811 items_per_second=152.51M/s
HausdorffBenchmark/hausdorff/64/4/manual_time          0.097 ms        0.255 ms         7132 items_per_second=368.497M/s
HausdorffBenchmark/hausdorff/512/4/manual_time         0.825 ms        0.982 ms          829 items_per_second=2.84929G/s
HausdorffBenchmark/hausdorff/4096/4/manual_time         8.81 ms         10.9 ms           68 items_per_second=17.1311G/s
HausdorffBenchmark/hausdorff/8192/4/manual_time         30.9 ms         33.0 ms           22 items_per_second=19.522G/s
HausdorffBenchmark/hausdorff/32/8/manual_time          0.075 ms        0.234 ms         9317 items_per_second=626.194M/s
HausdorffBenchmark/hausdorff/64/8/manual_time          0.133 ms        0.292 ms         5238 items_per_second=1.45975G/s
HausdorffBenchmark/hausdorff/512/8/manual_time          1.12 ms         1.27 ms          620 items_per_second=11.4586G/s
HausdorffBenchmark/hausdorff/4096/8/manual_time         27.8 ms         29.8 ms           25 items_per_second=29.5844G/s
HausdorffBenchmark/hausdorff/8192/8/manual_time          104 ms          106 ms            7 items_per_second=31.7485G/s
HausdorffBenchmark/hausdorff/32/64/manual_time         0.273 ms        0.432 ms         2555 items_per_second=13.9497G/s
HausdorffBenchmark/hausdorff/64/64/manual_time         0.536 ms        0.694 ms         1261 items_per_second=29.4053G/s
HausdorffBenchmark/hausdorff/512/64/manual_time         23.2 ms         23.3 ms           30 items_per_second=44.7398G/s
HausdorffBenchmark/hausdorff/4096/64/manual_time        1459 ms         1459 ms            1 items_per_second=45.6102G/s
HausdorffBenchmark/hausdorff/8192/64/manual_time        5917 ms         5917 ms            1 items_per_second=45.0033G/s
HausdorffBenchmark/hausdorff/32/128/manual_time        0.503 ms        0.660 ms         1372 items_per_second=30.8398G/s
HausdorffBenchmark/hausdorff/64/128/manual_time         1.65 ms         1.81 ms          421 items_per_second=38.8039G/s
HausdorffBenchmark/hausdorff/512/128/manual_time        96.7 ms         96.9 ms            7 items_per_second=43.5433G/s
HausdorffBenchmark/hausdorff/4096/128/manual_time       5849 ms         5849 ms            1 items_per_second=46.2435G/s
HausdorffBenchmark/hausdorff/8192/128/manual_time      23490 ms        23490 ms            1 items_per_second=46.0679G/s

Copy link
Member

@ajschmidt8 ajschmidt8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving ops-codeowner file changes

@github-actions github-actions bot removed the conda Related to conda and conda configuration label Jul 2, 2021
@cwharris
Copy link
Contributor Author

cwharris commented Jul 2, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 17ccadd into rapidsai:branch-21.08 Jul 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team bug Something isn't working cmake Related to CMake code or build configuration libcuspatial Relates to the cuSpatial C++ library non-breaking Non-breaking change Python Related to Python code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Hausdorff fails with OOM for > intmax num inputs
4 participants