-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve robustness and performance of CCL #595
Conversation
This will need some updates to work with SYCL. Also, using a mutex is not the most efficient approach but it is good enough. I would have preferred to simple use |
Note that this also makes the code more run-time configurable my removing some of the compile-time constexpr parameters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fully on board with the proposed logic. Just have a gazillion of technical comments... 😛
device/common/include/traccc/clusterization/device/ccl_kernel.hpp
Outdated
Show resolved
Hide resolved
device/common/include/traccc/clusterization/device/ccl_kernel_definitions.hpp
Outdated
Show resolved
Hide resolved
device/common/include/traccc/clusterization/device/impl/ccl_kernel.ipp
Outdated
Show resolved
Hide resolved
device/common/include/traccc/clusterization/device/impl/ccl_kernel.ipp
Outdated
Show resolved
Hide resolved
What's happening with this? 🤔 It would make my life a whole lot easier with running profiling, if this was cleaned up / pushed in. (Right now this prevents me from "quickly" collecting a set of throughput numbers on a Hopper chip. 😦) |
d1c54f7
to
a84cfd8
Compare
The CUDA code is good to go, have to see what the SYCL CI has to say about it. |
bb86f4c
to
57096f8
Compare
This should work now. |
fcd2bce
to
5b79e50
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very on board with the configuration changes. But as long as we do this, we should also update:
- https://github.com/acts-project/traccc/blob/main/examples/options/include/traccc/options/clusterization.hpp
- https://github.com/acts-project/traccc/blob/main/examples/options/src/clusterization.cpp
And of course also how it's used in all of our executables... 🤔
This now depends on #607. |
0d310f2
to
199e0a6
Compare
As I am finalizing acts-project#595, I am noticing that it is really quite a task to update the configuration of our algorithms. Not only do you need to update the configuration type and the ways your algorithm uses it, but you also need to update the way the CLI options are translated into those configuration files across _many_ executables. In this commit I am trying to make this process a little easier by specifying the `config_provider` mixin for command line options classes. This allows us to specify only once how options should be translated to configuration types, making the process easier and less prone to bugs.
As I am finalizing acts-project#595, I am noticing that it is really quite a task to update the configuration of our algorithms. Not only do you need to update the configuration type and the ways your algorithm uses it, but you also need to update the way the CLI options are translated into those configuration files across _many_ executables. In this commit I am trying to make this process a little easier by specifying the `config_provider` mixin for command line options classes. This allows us to specify only once how options should be translated to configuration types, making the process easier and less prone to bugs.
a1876fd
to
ffc043c
Compare
6db08a5
to
b60a2ed
Compare
Okay, you convinced me about the reentrancy aspect. 🤔 |
4621cbf
to
f259481
Compare
There are a few important caveats with using `unique_lock` in device code as I found out in acts-project#595. This commit adds a few warnings to the documentation to more clearly explain how this type should be used.
There are a few important caveats with using `unique_lock` in device code as I found out in acts-project#595. This commit adds a few warnings to the documentation to more clearly explain how this type should be used.
Let's put this on hold for a hot second while we finish up #558. |
be7aac2
to
693fd52
Compare
In acts-project#595, I equipped the CCA code with some edge case handling which allows it to handle oversized partitions. Although this makes sure the algorithm works, it also risks to slow down execution. In order to better understand how much performance we might be losing, this commit adds the ability for the SYCL and CUDA algorithms to print some warnings if they ever encounter this edge case.
43062f8
to
2a3d874
Compare
In acts-project#595, I equipped the CCA code with some edge case handling which allows it to handle oversized partitions. Although this makes sure the algorithm works, it also risks to slow down execution. In order to better understand how much performance we might be losing, this commit adds the ability for the SYCL and CUDA algorithms to print some warnings if they ever encounter this edge case.
b8b0c13
to
dda6b8b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the Alpaka code was just left with some outdated stuff. 🤔 Since the CUDA and SYCL code sets up the "mutex variable" much more simply than the Alpaka code does.
Other than cleaning up the Alpaka code, and resolving the merge conflict, I'm on board with the PR. 😉
core/include/traccc/clusterization/clusterization_algorithm.hpp
Outdated
Show resolved
Hide resolved
device/alpaka/include/traccc/alpaka/clusterization/clusterization_algorithm.hpp
Outdated
Show resolved
Hide resolved
This commit partially addresses acts-project#567. In the past, the CCL kernel was unable to deal with extremely large partitions. Although this is very unlikely to happen, our ODD samples contain a few cases of partitions so large it crashes the code. This commit equips the CCL code with some scratch memory which it can reserve using a mutex. This allows it enough space to do its work in global memory. Although this is, of course, slower, it should happen very infrequently. Parameters can be tuned to determine that frequency. This commit also contains a few optimizations to the code which reduce the running time on a μ = 200 event from about 1100 microseconds to 700 microseconds on an RTX A5000.
dda6b8b
to
7e35167
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do it finally.
In acts-project#595, I equipped the CCA code with some edge case handling which allows it to handle oversized partitions. Although this makes sure the algorithm works, it also risks to slow down execution. In order to better understand how much performance we might be losing, this commit adds the ability for the SYCL and CUDA algorithms to print some warnings if they ever encounter this edge case.
In acts-project#595, I equipped the CCA code with some edge case handling which allows it to handle oversized partitions. Although this makes sure the algorithm works, it also risks to slow down execution. In order to better understand how much performance we might be losing, this commit adds the ability for the SYCL and CUDA algorithms to print some warnings if they ever encounter this edge case.
Since acts-project#595 was merged, some of the throughput examples started to fail. After some investigation, it turned out that this was not actually a mistake in acts-project#595, but rather a long-standing bug in the full chain algorithms. The situation was such that the full chain algorithms had custom destructors which destroyed the caching memory resources, which are pointers that need to be destroyed before the underlying memory resource they use. This creates a problem, namely that the cached memory resource is destroyed before _any other_ members of the class, including any long-standing memory allocations. When those allocations are then destroyed, the memory resource no longer exists and the program segfaults. Thankfully, the fix for this was very easy as the aforementioned destructors are not necessary at all, as the C++ standard guarantees that members are destroyed in reverse initialization order, and since our full chain algorithms always (correctly) specify the caching memory resource _after_ the base memory resource, the default destructors are more than sufficient. In order to fix the segmentation fault, this commit simply removes the offending destructors.
Since acts-project#595 was merged, some of the throughput examples started to fail. After some investigation, it turned out that this was not actually a mistake in acts-project#595, but rather a long-standing bug in the full chain algorithms. The situation was such that the full chain algorithms had custom destructors which destroyed the caching memory resources, which are pointers that need to be destroyed before the underlying memory resource they use. This creates a problem, namely that the cached memory resource is destroyed before _any other_ members of the class, including any long-standing memory allocations. When those allocations are then destroyed, the memory resource no longer exists and the program segfaults. Thankfully, the fix for this was very easy as the aforementioned destructors are not necessary at all, as the C++ standard guarantees that members are destroyed in reverse initialization order, and since our full chain algorithms always (correctly) specify the caching memory resource _after_ the base memory resource, the default destructors are more than sufficient. In order to fix the segmentation fault, this commit simply removes the offending destructors.
Since acts-project#595 was merged, some of the throughput examples started to fail. After some investigation, it turned out that this was not actually a mistake in acts-project#595, but rather a long-standing bug in the full chain algorithms. The situation was such that the full chain algorithms had custom destructors which destroyed the caching memory resources, which are pointers that need to be destroyed before the underlying memory resource they use. This creates a problem, namely that the cached memory resource is destroyed before _any other_ members of the class, including any long-standing memory allocations. When those allocations are then destroyed, the memory resource no longer exists and the program segfaults. Thankfully, the fix for this was very easy as the aforementioned destructors are not necessary at all, as the C++ standard guarantees that members are destroyed in reverse initialization order, and since our full chain algorithms always (correctly) specify the caching memory resource _after_ the base memory resource, the default destructors are more than sufficient. In order to fix the segmentation fault, this commit simply removes the offending destructors.
This commit partially addresses #567. In the past, the CCL kernel was unable to deal with extremely large partitions. Although this is very unlikely to happen, our ODD samples contain a few cases of partitions so large it crashes the code. This commit equips the CCL code with some scratch memory which it can reserve using a mutex. This allows it enough space to do its work in global memory. Although this is, of course, slower, it should happen very infrequently. Parameters can be tuned to determine that frequency. This commit also contains a few optimizations to the code which reduce the running time on a μ = 200 ODD ttbar event from about 1100 microseconds to 700 microseconds on an RTX A5000.