-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Hausdorff fails with OOM for > intmax num inputs #393
Comments
) same fix seen here, but via patch: NVIDIA/thrust#1424 Also fixes rapidsai/cuspatial#393 Alternatively, we could wait and update our thrust version, rather than patching the existing one. Authors: - Christopher Harris (https://github.com/cwharris) Approvers: - Mark Harris (https://github.com/harrism) - Paul Taylor (https://github.com/trxcllnt) URL: #8199
fix had to be rolled back due to performances regression in libcudf. Will have to tackle this problem by partitioning inputs, rather than accepting a single large input. partitions will have to begin/end at specific offsets to ensure |
I reapplied the thrust scan_by_key patch locally to allow for 64 bit indices, ran all @rapidsai/cudf benchmarks, and found little to no performance difference. From there, I was able to get hausdorff working with 5000 spaces and 100 points. However, higher inputs resulted in an OOM from thrust temporary memory allocations:
I hypothesize that this is a legitimate OOM caused by the design of temporary memory usage in the single-pass scan algorithm within thrust/cub. However, It I believe it is unreasonable to attempt to reduce the temporary memory requirements for cub/thrust single-pass-scan implementation at this time (specifically, as part of this bug). Therefore, I will be looking for alternatives that do not require patching thrust. edit: for anyone coming across this comment wondering why inclusive_scan is being used, it's because the hausdorff implementation uses an associative (non-commutative) reduction operator, but thrust/cub does not provide an associative reduction algorithm. Instead, inclusive_scan is being used in conjunction with a proprietary "scatter output iterator" to compress the output at write time, only taking the last element of each scan. Essentially the thrust/cub scan algorithm was never designed to be used with this many inputs, and therefore makes a reasonably assumption about the maximum size of temporary storage. patching thrust to a 64 bit integer increased the maximum number of inputs, and therefore increased the maximum possibly temporary storage size, resulting in an an OOM for a large number of inputs. This could be alleviated by using a circular buffer for the single-pass scan algorithm, rather than retaining all intermediate results, but that's far out of scope for this bug. |
Fixes #393 We switched to the exclusive scan approach to Hausdorff because certain benchmarks indicated better performance. Apparently those benchmarks were inadequate or just plain badly written (by me), and performance was in fact worse. This became apparent while fixing the OOM error reported in #393. I copied the 0.14 implementation in to the 21.08 branch to re-benchmark. here are the results: cuspatial@0.14: ``` ------------------------------------------------------------------------------------------------------------ Benchmark Time CPU Iterations UserCounters... ------------------------------------------------------------------------------------------------------------ HausdorffBenchmark/hausdorff/100/64/manual_time 1.62 ms 1.78 ms 428 items_per_second=23.9898G/s HausdorffBenchmark/hausdorff/512/64/manual_time 43.9 ms 44.1 ms 16 items_per_second=23.6053G/s HausdorffBenchmark/hausdorff/4096/64/manual_time 2810 ms 2810 ms 1 items_per_second=23.6845G/s HausdorffBenchmark/hausdorff/6000/64/manual_time 6148 ms 6148 ms 1 items_per_second=23.2318G/s HausdorffBenchmark/hausdorff/100/100/manual_time 3.31 ms 3.47 ms 210 items_per_second=29.0333G/s HausdorffBenchmark/hausdorff/512/100/manual_time 88.9 ms 89.1 ms 8 items_per_second=28.7737G/s HausdorffBenchmark/hausdorff/4096/100/manual_time 5842 ms 5842 ms 1 items_per_second=28.132G/s HausdorffBenchmark/hausdorff/6000/100/manual_time 12698 ms 12698 ms 1 items_per_second=27.7783G/s ``` cuspatial@21.08 (with fix for OOM, as seen in previous commits of this PR) ``` ------------------------------------------------------------------------------------------------------------ Benchmark Time CPU Iterations UserCounters... ------------------------------------------------------------------------------------------------------------ HausdorffBenchmark/hausdorff/100/64/manual_time 17.4 ms 17.6 ms 38 items_per_second=2.2391G/s HausdorffBenchmark/hausdorff/512/64/manual_time 489 ms 490 ms 2 items_per_second=2.11979G/s HausdorffBenchmark/hausdorff/4096/64/manual_time 37120 ms 37119 ms 1 items_per_second=1.79299G/s HausdorffBenchmark/hausdorff/6000/64/manual_time 82732 ms 82729 ms 1 items_per_second=1.7265G/s HausdorffBenchmark/hausdorff/100/100/manual_time 43.4 ms 43.7 ms 16 items_per_second=2.21402G/s HausdorffBenchmark/hausdorff/512/100/manual_time 1341 ms 1341 ms 1 items_per_second=1.90885G/s HausdorffBenchmark/hausdorff/4096/100/manual_time 94898 ms 94894 ms 1 items_per_second=1.7319G/s HausdorffBenchmark/hausdorff/6000/100/manual_time 199120 ms 199115 ms 1 items_per_second=1.77138G/s ``` The performance is bad, and this regression is my fault. Fortunately I was able to quickly reverse this regression and improve performance while getting rid of a bunch of code (and learning a lot in the process). This PR re-implements Hausdorff as a straightforward custom kernel that requires zero intermediate memory. this pr: ``` ------------------------------------------------------------------------------------------------------------ Benchmark Time CPU Iterations UserCounters... ------------------------------------------------------------------------------------------------------------ HausdorffBenchmark/hausdorff/100/64/manual_time 1.31 ms 1.47 ms 526 items_per_second=29.6763G/s HausdorffBenchmark/hausdorff/512/64/manual_time 23.2 ms 23.3 ms 30 items_per_second=44.7567G/s HausdorffBenchmark/hausdorff/4096/64/manual_time 1589 ms 1590 ms 1 items_per_second=41.8747G/s HausdorffBenchmark/hausdorff/6000/64/manual_time 3170 ms 3170 ms 1 items_per_second=45.0638G/s HausdorffBenchmark/hausdorff/100/100/manual_time 2.92 ms 3.08 ms 239 items_per_second=32.8852G/s HausdorffBenchmark/hausdorff/512/100/manual_time 55.8 ms 55.8 ms 12 items_per_second=45.8415G/s HausdorffBenchmark/hausdorff/4096/100/manual_time 3547 ms 3547 ms 1 items_per_second=46.3317G/s HausdorffBenchmark/hausdorff/6000/100/manual_time 7658 ms 7658 ms 1 items_per_second=46.0564G/s ``` Authors: - Christopher Harris (https://github.com/cwharris) Approvers: - Mark Harris (https://github.com/harrism) - Paul Taylor (https://github.com/trxcllnt) - AJ Schmidt (https://github.com/ajschmidt8) URL: #424
Hausdorff fails for large number of inputs. The problem originates in Thrust, which is storing
num_items
asint
, but fixing that uncovers an illegal memory access bug.NVIDIA/cccl#766
The text was updated successfully, but these errors were encountered: