Improve robustness and performance of CCL #595

stephenswat · 2024-05-28T12:24:38Z

This commit partially addresses #567. In the past, the CCL kernel was unable to deal with extremely large partitions. Although this is very unlikely to happen, our ODD samples contain a few cases of partitions so large it crashes the code. This commit equips the CCL code with some scratch memory which it can reserve using a mutex. This allows it enough space to do its work in global memory. Although this is, of course, slower, it should happen very infrequently. Parameters can be tuned to determine that frequency. This commit also contains a few optimizations to the code which reduce the running time on a μ = 200 ODD ttbar event from about 1100 microseconds to 700 microseconds on an RTX A5000.

stephenswat · 2024-05-28T12:26:10Z

This will need some updates to work with SYCL. Also, using a mutex is not the most efficient approach but it is good enough. I would have preferred to simple use malloc on the device, but I don't know if this works outside of CUDA.

stephenswat · 2024-05-28T12:27:34Z

Note that this also makes the code more run-time configurable my removing some of the compile-time constexpr parameters.

krasznaa

I'm fully on board with the proposed logic. Just have a gazillion of technical comments... 😛

device/common/include/traccc/clusterization/device/ccl_kernel.hpp

device/common/include/traccc/clusterization/device/ccl_kernel_definitions.hpp

device/common/include/traccc/clusterization/device/impl/ccl_kernel.ipp

device/cuda/src/clusterization/clusterization_algorithm.cu

device/sycl/src/clusterization/clusterization_algorithm.sycl

krasznaa · 2024-06-05T07:52:55Z

What's happening with this? 🤔 It would make my life a whole lot easier with running profiling, if this was cleaned up / pushed in. (Right now this prevents me from "quickly" collecting a set of throughput numbers on a Hopper chip. 😦)

stephenswat · 2024-06-05T10:07:02Z

The CUDA code is good to go, have to see what the SYCL CI has to say about it.

stephenswat · 2024-06-06T11:58:52Z

This should work now.

krasznaa

I'm very on board with the configuration changes. But as long as we do this, we should also update:

And of course also how it's used in all of our executables... 🤔

core/include/traccc/clusterization/clustering_config.hpp

device/cuda/src/clusterization/clusterization_algorithm.cu

stephenswat · 2024-06-07T13:51:39Z

This now depends on #607.

As I am finalizing acts-project#595, I am noticing that it is really quite a task to update the configuration of our algorithms. Not only do you need to update the configuration type and the ways your algorithm uses it, but you also need to update the way the CLI options are translated into those configuration files across _many_ executables. In this commit I am trying to make this process a little easier by specifying the `config_provider` mixin for command line options classes. This allows us to specify only once how options should be translated to configuration types, making the process easier and less prone to bugs.

krasznaa · 2024-06-28T13:51:32Z

Okay, you convinced me about the reentrancy aspect. 🤔

There are a few important caveats with using `unique_lock` in device code as I found out in acts-project#595. This commit adds a few warnings to the documentation to more clearly explain how this type should be used.

stephenswat · 2024-07-03T13:15:53Z

Let's put this on hold for a hot second while we finish up #558.

In acts-project#595, I equipped the CCA code with some edge case handling which allows it to handle oversized partitions. Although this makes sure the algorithm works, it also risks to slow down execution. In order to better understand how much performance we might be losing, this commit adds the ability for the SYCL and CUDA algorithms to print some warnings if they ever encounter this edge case.

krasznaa

I think the Alpaka code was just left with some outdated stuff. 🤔 Since the CUDA and SYCL code sets up the "mutex variable" much more simply than the Alpaka code does.

Other than cleaning up the Alpaka code, and resolving the merge conflict, I'm on board with the PR. 😉

core/include/traccc/clusterization/clusterization_algorithm.hpp

device/alpaka/include/traccc/alpaka/clusterization/clusterization_algorithm.hpp

device/alpaka/src/clusterization/clusterization_algorithm.cpp

This commit partially addresses acts-project#567. In the past, the CCL kernel was unable to deal with extremely large partitions. Although this is very unlikely to happen, our ODD samples contain a few cases of partitions so large it crashes the code. This commit equips the CCL code with some scratch memory which it can reserve using a mutex. This allows it enough space to do its work in global memory. Although this is, of course, slower, it should happen very infrequently. Parameters can be tuned to determine that frequency. This commit also contains a few optimizations to the code which reduce the running time on a μ = 200 event from about 1100 microseconds to 700 microseconds on an RTX A5000.

krasznaa

Let's do it finally.

In acts-project#595, I equipped the CCA code with some edge case handling which allows it to handle oversized partitions. Although this makes sure the algorithm works, it also risks to slow down execution. In order to better understand how much performance we might be losing, this commit adds the ability for the SYCL and CUDA algorithms to print some warnings if they ever encounter this edge case.

Since acts-project#595 was merged, some of the throughput examples started to fail. After some investigation, it turned out that this was not actually a mistake in acts-project#595, but rather a long-standing bug in the full chain algorithms. The situation was such that the full chain algorithms had custom destructors which destroyed the caching memory resources, which are pointers that need to be destroyed before the underlying memory resource they use. This creates a problem, namely that the cached memory resource is destroyed before _any other_ members of the class, including any long-standing memory allocations. When those allocations are then destroyed, the memory resource no longer exists and the program segfaults. Thankfully, the fix for this was very easy as the aforementioned destructors are not necessary at all, as the C++ standard guarantees that members are destroyed in reverse initialization order, and since our full chain algorithms always (correctly) specify the caching memory resource _after_ the base memory resource, the default destructors are more than sufficient. In order to fix the segmentation fault, this commit simply removes the offending destructors.

stephenswat added cuda Changes related to CUDA improvement Improve an existing feature sycl Changes related to SYCL labels May 28, 2024

stephenswat requested a review from krasznaa May 28, 2024 12:24

krasznaa requested changes May 29, 2024

View reviewed changes

stephenswat force-pushed the fix/robust_ccl branch 2 times, most recently from d1c54f7 to a84cfd8 Compare June 5, 2024 10:06

stephenswat force-pushed the fix/robust_ccl branch 2 times, most recently from bb86f4c to 57096f8 Compare June 6, 2024 11:58

stephenswat force-pushed the fix/robust_ccl branch 2 times, most recently from fcd2bce to 5b79e50 Compare June 6, 2024 12:19

krasznaa requested changes Jun 6, 2024

View reviewed changes

core/include/traccc/clusterization/clustering_config.hpp Outdated Show resolved Hide resolved

device/cuda/src/clusterization/clusterization_algorithm.cu Outdated Show resolved Hide resolved

stephenswat mentioned this pull request Jun 6, 2024

Increase the robustness of device_atomic_ref acts-project/vecmem#275

Merged

stephenswat marked this pull request as draft June 7, 2024 13:51

stephenswat force-pushed the fix/robust_ccl branch 2 times, most recently from 0d310f2 to 199e0a6 Compare June 14, 2024 11:44

stephenswat mentioned this pull request Jun 21, 2024

Add CLI options to configuration type conversion #626

Merged

stephenswat force-pushed the fix/robust_ccl branch 2 times, most recently from a1876fd to ffc043c Compare June 24, 2024 13:25

stephenswat marked this pull request as ready for review June 24, 2024 13:25

stephenswat force-pushed the fix/robust_ccl branch 2 times, most recently from 6db08a5 to b60a2ed Compare June 24, 2024 13:58

stephenswat force-pushed the fix/robust_ccl branch 4 times, most recently from 4621cbf to f259481 Compare July 3, 2024 07:56

stephenswat mentioned this pull request Jul 3, 2024

Improve documentation of unique_lock #638

Merged

stephenswat mentioned this pull request Jul 3, 2024

Alpaka: Clustering + Throughput #558

Merged

stephenswat force-pushed the fix/robust_ccl branch 2 times, most recently from be7aac2 to 693fd52 Compare July 10, 2024 15:59

stephenswat mentioned this pull request Jul 10, 2024

Add debug output to clustering algorithms #640

Open

stephenswat force-pushed the fix/robust_ccl branch 2 times, most recently from 43062f8 to 2a3d874 Compare July 16, 2024 14:04

stephenswat force-pushed the fix/robust_ccl branch 2 times, most recently from b8b0c13 to dda6b8b Compare July 24, 2024 15:29

krasznaa requested changes Jul 31, 2024

View reviewed changes

stephenswat force-pushed the fix/robust_ccl branch from dda6b8b to 7e35167 Compare July 31, 2024 14:14

krasznaa approved these changes Jul 31, 2024

View reviewed changes

stephenswat merged commit 87b4b1a into acts-project:main Jul 31, 2024
23 checks passed

stephenswat mentioned this pull request Aug 1, 2024

Fix full chain algorithm deallocation order #664

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve robustness and performance of CCL #595

Improve robustness and performance of CCL #595

stephenswat commented May 28, 2024 •

edited

Loading

stephenswat commented May 28, 2024

stephenswat commented May 28, 2024

krasznaa left a comment

krasznaa commented Jun 5, 2024

stephenswat commented Jun 5, 2024

stephenswat commented Jun 6, 2024

krasznaa left a comment

stephenswat commented Jun 7, 2024

krasznaa commented Jun 28, 2024

stephenswat commented Jul 3, 2024

krasznaa left a comment

krasznaa left a comment

Improve robustness and performance of CCL #595

Improve robustness and performance of CCL #595

Conversation

stephenswat commented May 28, 2024 • edited Loading

stephenswat commented May 28, 2024

stephenswat commented May 28, 2024

krasznaa left a comment

Choose a reason for hiding this comment

krasznaa commented Jun 5, 2024

stephenswat commented Jun 5, 2024

stephenswat commented Jun 6, 2024

krasznaa left a comment

Choose a reason for hiding this comment

stephenswat commented Jun 7, 2024

krasznaa commented Jun 28, 2024

stephenswat commented Jul 3, 2024

krasznaa left a comment

Choose a reason for hiding this comment

krasznaa left a comment

Choose a reason for hiding this comment

stephenswat commented May 28, 2024 •

edited

Loading