-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Access track parameters in a sorted order for KF #689
Conversation
106a8b7
to
80ba750
Compare
ae033ba
to
5c71fb6
Compare
One caveat is that this PR finds slightly less parameters than Command: PR:
|
1e245a2
to
8680ae5
Compare
8680ae5
to
fc07849
Compare
74cd6aa
to
23ce5c3
Compare
23ce5c3
to
5e1a3da
Compare
The KF sort is done. As I need to use the sort function in SYCL KF, I included oneDPL in this PR. I guess this is OK as uxlfoundation/oneDPL#1060 is resolved (@krasznaa). But I actually did not run the SYCL KF because the example executables are not being compiled at the moment (#655) Following is the performance of this PR This PR
Main
|
c064ea9
to
b1b8ee4
Compare
b1b8ee4
to
8c01fd8
Compare
8c01fd8
to
d0f0fc0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm on board! I was worried about adding oneDPL as part of this PR at first, but it's a small PR. So, why not?
And finally having an up-to-date version of oneDPL shall allow us to complete the full-chain on SYCL as well. 😄
So yeah, please rebase to the current main, and let's have it. 😉
I checked the performance again and I found that this PR makes the |
Did you check the MT performance earlier? Because I could imagine situations where the ST performance would improve, but the MT performance would decrease. In any case, as long as you have an idea what to do about it, I'll just let you propose an update. 😉 |
I used the MT performance for the previous attempt as well. Well let's see.. |
I think I was sane. For the previous attempt, I configured cmake with
@krasznaa Do you know what's going on here? Should I not use the CUDA architecture number for cmake configuration? The automatically configured architecture number is 52 |
|
I might be misunderstanding you, but what you found is that Thrust's sorting speed / efficiency is very optimization dependent, no? 😕 When we sort the tracks in an extra step, we need the total runtime of the sorting step to be shorter than the amount by which the other steps speed up because of the sorting. I don't think that our own code would change its speed all too much because of which exact architecture we build for. So it must be the sorting that is way faster with SM 8.6 compared to SM 5.2. The SM 5.2 value does not have any big significance. That's the minimum value that modern CUDA versions support without a warning. Normally I'd say that we should build the code for a couple of different architectures in parallel. The reason I didn't pursue this yet is because of our large memory usage as-is. 🤔 But yeah, as long as build system issues are sorted out, we could eventually go for https://cmake.org/cmake/help/latest/prop_tgt/CUDA_ARCHITECTURES.html What we could do for now: If we take the hit of raising our minimum CMake version to 3.24, we could set |
To be clear, the numbers in the table are the event throughput of
Probably right but I am not sure if the odd behavior of table numbeers is fully dependent on thrust sort speed. As you can see the behavior of the first and second column is totally opposite. If it is really on the thrust sort speed, they would behave consistently. Most of all, the sort function is called only once in the KF (unlike CKF where it is called more than 10 times)
Probably not. The first column way slower with SM 8.6. (But As I mentioned, I doubt that the thrust sort is the dominating factor.. considering the huge gap) Using
|
Ahh... so the throughput tests are slower with SM 8.6? 😕 That I didn't expect indeed. (And hence didn't even check the numbers, as this would seem like a very unlikely outcome.) This would be worth investigating in more detail, with some profiling. But then, do I understand correctly that for now we do nothing? Since the current setup, with SM 5.2, gives optimal performance in your tests apparently. 🤔 |
Yup, doing nothing at the moment is my suggestion. ( |
Based on #706
UPDATE
Now the KF also uses the sorting method, but with a different standard: the number of measurements ($$n$$ ) of the track.
The problem of using$$\theta$$ method is that the branching still can be an issue if the some of tracks in the warp end too early unexpectedly (or continue with too many measurements). We can prevent them by sorting the tracks w.r.t to the number of measurements.
One can still argue that the different step size between the planes can be problematic, and I think that is a valid point 🤔$$\theta$$ or $$n$$ is better until one check the results. At least the ODD data says that sorting with $$n$$ is beneficial.
It is not easy to know which one between
Of course we can try to improve this further by experimenting different values or combining them
The main purpose of the PR is sorting the track parameters with based on$$\theta$$ , to reduce the branching divergence. (Similar to the binning method that Markus mentioned in the parallelization meeting)
As it turned out that sorting the track parameters themselves is quite expensive, I decided to make a small vector holding$$\theta$$ and vector indices, which is sorted instead. In the kernel, the track parameters are accessed by the indices of this small vector.
Following is the benchmark result with
traccc_throughput_mt_cuda
ODD data andtraccc_benchmark_cuda
with toy geometryCommand for mt_cuda:
./bin/traccc_throughput_mt_cuda --detector-file=geometries/odd/odd-detray_geometry_detray.json --digitization-file=geometries/odd/odd-digi-geometric-config.json --grid-file=geometries/odd/odd-detray_surface_grids_detray.json --use-detray-detector --input-directory=../../../../../../../bld6/data/traccc/geant4_ttbar_mu100/ --cpu-threads 2
Command for cuda toy benchmark:
./bin/traccc_benchmark_cuda
Device: NVIDIA RTX A6000