Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve robustness and performance of CCL #595

Merged
merged 1 commit into from
Jul 31, 2024

Conversation

stephenswat
Copy link
Member

@stephenswat stephenswat commented May 28, 2024

This commit partially addresses #567. In the past, the CCL kernel was unable to deal with extremely large partitions. Although this is very unlikely to happen, our ODD samples contain a few cases of partitions so large it crashes the code. This commit equips the CCL code with some scratch memory which it can reserve using a mutex. This allows it enough space to do its work in global memory. Although this is, of course, slower, it should happen very infrequently. Parameters can be tuned to determine that frequency. This commit also contains a few optimizations to the code which reduce the running time on a μ = 200 ODD ttbar event from about 1100 microseconds to 700 microseconds on an RTX A5000.

@stephenswat stephenswat added cuda Changes related to CUDA improvement Improve an existing feature sycl Changes related to SYCL labels May 28, 2024
@stephenswat stephenswat requested a review from krasznaa May 28, 2024 12:24
@stephenswat
Copy link
Member Author

This will need some updates to work with SYCL. Also, using a mutex is not the most efficient approach but it is good enough. I would have preferred to simple use malloc on the device, but I don't know if this works outside of CUDA.

@stephenswat
Copy link
Member Author

Note that this also makes the code more run-time configurable my removing some of the compile-time constexpr parameters.

Copy link
Member

@krasznaa krasznaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fully on board with the proposed logic. Just have a gazillion of technical comments... 😛

@krasznaa
Copy link
Member

krasznaa commented Jun 5, 2024

What's happening with this? 🤔 It would make my life a whole lot easier with running profiling, if this was cleaned up / pushed in. (Right now this prevents me from "quickly" collecting a set of throughput numbers on a Hopper chip. 😦)

@stephenswat stephenswat force-pushed the fix/robust_ccl branch 2 times, most recently from d1c54f7 to a84cfd8 Compare June 5, 2024 10:06
@stephenswat
Copy link
Member Author

The CUDA code is good to go, have to see what the SYCL CI has to say about it.

@stephenswat stephenswat force-pushed the fix/robust_ccl branch 2 times, most recently from bb86f4c to 57096f8 Compare June 6, 2024 11:58
@stephenswat
Copy link
Member Author

This should work now.

@stephenswat stephenswat force-pushed the fix/robust_ccl branch 2 times, most recently from fcd2bce to 5b79e50 Compare June 6, 2024 12:19
Copy link
Member

@krasznaa krasznaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very on board with the configuration changes. But as long as we do this, we should also update:

And of course also how it's used in all of our executables... 🤔

@stephenswat
Copy link
Member Author

This now depends on #607.

@stephenswat stephenswat marked this pull request as draft June 7, 2024 13:51
@stephenswat stephenswat force-pushed the fix/robust_ccl branch 2 times, most recently from 0d310f2 to 199e0a6 Compare June 14, 2024 11:44
stephenswat added a commit to stephenswat/traccc that referenced this pull request Jun 21, 2024
As I am finalizing acts-project#595, I am noticing that it is really quite a task to
update the configuration of our algorithms. Not only do you need to
update the configuration type and the ways your algorithm uses it, but
you also need to update the way the CLI options are translated into
those configuration files across _many_ executables.

In this commit I am trying to make this process a little easier by
specifying the `config_provider` mixin for command line options classes.
This allows us to specify only once how options should be translated to
configuration types, making the process easier and less prone to bugs.
stephenswat added a commit to stephenswat/traccc that referenced this pull request Jun 24, 2024
As I am finalizing acts-project#595, I am noticing that it is really quite a task to
update the configuration of our algorithms. Not only do you need to
update the configuration type and the ways your algorithm uses it, but
you also need to update the way the CLI options are translated into
those configuration files across _many_ executables.

In this commit I am trying to make this process a little easier by
specifying the `config_provider` mixin for command line options classes.
This allows us to specify only once how options should be translated to
configuration types, making the process easier and less prone to bugs.
@stephenswat stephenswat force-pushed the fix/robust_ccl branch 2 times, most recently from a1876fd to ffc043c Compare June 24, 2024 13:25
@stephenswat stephenswat marked this pull request as ready for review June 24, 2024 13:25
@stephenswat stephenswat force-pushed the fix/robust_ccl branch 2 times, most recently from 6db08a5 to b60a2ed Compare June 24, 2024 13:58
@krasznaa
Copy link
Member

Okay, you convinced me about the reentrancy aspect. 🤔

@stephenswat stephenswat force-pushed the fix/robust_ccl branch 4 times, most recently from 4621cbf to f259481 Compare July 3, 2024 07:56
stephenswat added a commit to stephenswat/traccc that referenced this pull request Jul 3, 2024
There are a few important caveats with using `unique_lock` in device
code as I found out in acts-project#595. This commit adds a few warnings to the
documentation to more clearly explain how this type should be used.
stephenswat added a commit to stephenswat/traccc that referenced this pull request Jul 3, 2024
There are a few important caveats with using `unique_lock` in device
code as I found out in acts-project#595. This commit adds a few warnings to the
documentation to more clearly explain how this type should be used.
@stephenswat
Copy link
Member Author

Let's put this on hold for a hot second while we finish up #558.

@stephenswat stephenswat force-pushed the fix/robust_ccl branch 2 times, most recently from be7aac2 to 693fd52 Compare July 10, 2024 15:59
stephenswat added a commit to stephenswat/traccc that referenced this pull request Jul 10, 2024
In acts-project#595, I equipped the CCA code with some edge case handling which
allows it to handle oversized partitions. Although this makes sure the
algorithm works, it also risks to slow down execution. In order to
better understand how much performance we might be losing, this commit
adds the ability for the SYCL and CUDA algorithms to print some warnings
if they ever encounter this edge case.
@stephenswat stephenswat force-pushed the fix/robust_ccl branch 2 times, most recently from 43062f8 to 2a3d874 Compare July 16, 2024 14:04
stephenswat added a commit to stephenswat/traccc that referenced this pull request Jul 17, 2024
In acts-project#595, I equipped the CCA code with some edge case handling which
allows it to handle oversized partitions. Although this makes sure the
algorithm works, it also risks to slow down execution. In order to
better understand how much performance we might be losing, this commit
adds the ability for the SYCL and CUDA algorithms to print some warnings
if they ever encounter this edge case.
@stephenswat stephenswat force-pushed the fix/robust_ccl branch 2 times, most recently from b8b0c13 to dda6b8b Compare July 24, 2024 15:29
Copy link
Member

@krasznaa krasznaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the Alpaka code was just left with some outdated stuff. 🤔 Since the CUDA and SYCL code sets up the "mutex variable" much more simply than the Alpaka code does.

Other than cleaning up the Alpaka code, and resolving the merge conflict, I'm on board with the PR. 😉

This commit partially addresses acts-project#567. In the past, the CCL kernel was
unable to deal with extremely large partitions. Although this is very
unlikely to happen, our ODD samples contain a few cases of partitions so
large it crashes the code. This commit equips the CCL code with some
scratch memory which it can reserve using a mutex. This allows it enough
space to do its work in global memory. Although this is, of course,
slower, it should happen very infrequently. Parameters can be tuned to
determine that frequency. This commit also contains a few optimizations
to the code which reduce the running time on a μ = 200 event from about
1100 microseconds to 700 microseconds on an RTX A5000.
Copy link
Member

@krasznaa krasznaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do it finally.

@stephenswat stephenswat merged commit 87b4b1a into acts-project:main Jul 31, 2024
23 checks passed
stephenswat added a commit to stephenswat/traccc that referenced this pull request Aug 1, 2024
In acts-project#595, I equipped the CCA code with some edge case handling which
allows it to handle oversized partitions. Although this makes sure the
algorithm works, it also risks to slow down execution. In order to
better understand how much performance we might be losing, this commit
adds the ability for the SYCL and CUDA algorithms to print some warnings
if they ever encounter this edge case.
stephenswat added a commit to stephenswat/traccc that referenced this pull request Aug 1, 2024
In acts-project#595, I equipped the CCA code with some edge case handling which
allows it to handle oversized partitions. Although this makes sure the
algorithm works, it also risks to slow down execution. In order to
better understand how much performance we might be losing, this commit
adds the ability for the SYCL and CUDA algorithms to print some warnings
if they ever encounter this edge case.
stephenswat added a commit to stephenswat/traccc that referenced this pull request Aug 1, 2024
Since acts-project#595 was merged, some of the throughput examples started to fail.
After some investigation, it turned out that this was not actually a
mistake in acts-project#595, but rather a long-standing bug in the full chain
algorithms.

The situation was such that the full chain algorithms had custom
destructors which destroyed the caching memory resources, which are
pointers that need to be destroyed before the underlying memory resource
they use. This creates a problem, namely that the cached memory resource
is destroyed before _any other_ members of the class, including any
long-standing memory allocations. When those allocations are then
destroyed, the memory resource no longer exists and the program
segfaults.

Thankfully, the fix for this was very easy as the aforementioned
destructors are not necessary at all, as the C++ standard guarantees
that members are destroyed in reverse initialization order, and since
our full chain algorithms always (correctly) specify the caching memory
resource _after_ the base memory resource, the default destructors are
more than sufficient.

In order to fix the segmentation fault, this commit simply removes the
offending destructors.
stephenswat added a commit to stephenswat/traccc that referenced this pull request Aug 1, 2024
Since acts-project#595 was merged, some of the throughput examples started to fail.
After some investigation, it turned out that this was not actually a
mistake in acts-project#595, but rather a long-standing bug in the full chain
algorithms.

The situation was such that the full chain algorithms had custom
destructors which destroyed the caching memory resources, which are
pointers that need to be destroyed before the underlying memory resource
they use. This creates a problem, namely that the cached memory resource
is destroyed before _any other_ members of the class, including any
long-standing memory allocations. When those allocations are then
destroyed, the memory resource no longer exists and the program
segfaults.

Thankfully, the fix for this was very easy as the aforementioned
destructors are not necessary at all, as the C++ standard guarantees
that members are destroyed in reverse initialization order, and since
our full chain algorithms always (correctly) specify the caching memory
resource _after_ the base memory resource, the default destructors are
more than sufficient.

In order to fix the segmentation fault, this commit simply removes the
offending destructors.
stephenswat added a commit to stephenswat/traccc that referenced this pull request Aug 2, 2024
Since acts-project#595 was merged, some of the throughput examples started to fail.
After some investigation, it turned out that this was not actually a
mistake in acts-project#595, but rather a long-standing bug in the full chain
algorithms.

The situation was such that the full chain algorithms had custom
destructors which destroyed the caching memory resources, which are
pointers that need to be destroyed before the underlying memory resource
they use. This creates a problem, namely that the cached memory resource
is destroyed before _any other_ members of the class, including any
long-standing memory allocations. When those allocations are then
destroyed, the memory resource no longer exists and the program
segfaults.

Thankfully, the fix for this was very easy as the aforementioned
destructors are not necessary at all, as the C++ standard guarantees
that members are destroyed in reverse initialization order, and since
our full chain algorithms always (correctly) specify the caching memory
resource _after_ the base memory resource, the default destructors are
more than sufficient.

In order to fix the segmentation fault, this commit simply removes the
offending destructors.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda Changes related to CUDA improvement Improve an existing feature sycl Changes related to SYCL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants