Access track parameters in a sorted order for KF #689

beomki-yeo · 2024-08-23T07:37:36Z

Based on #706

UPDATE

Now the KF also uses the sorting method, but with a different standard: the number of measurements ($$n$$) of the track.

The problem of using $$\theta$$ method is that the branching still can be an issue if the some of tracks in the warp end too early unexpectedly (or continue with too many measurements). We can prevent them by sorting the tracks w.r.t to the number of measurements.

One can still argue that the different step size between the planes can be problematic, and I think that is a valid point 🤔
It is not easy to know which one between $$\theta$$ or $$n$$ is better until one check the results. At least the ODD data says that sorting with $$n$$ is beneficial.

	Sort CKF and KF (PR)	Sort CKF	No sort (main)
mt cuda [Hz]	19.01	17.85	15.0808

Of course we can try to improve this further by experimenting different values or combining them

The main purpose of the PR is sorting the track parameters with based on $$\theta$$, to reduce the branching divergence. (Similar to the binning method that Markus mentioned in the parallelization meeting)

As it turned out that sorting the track parameters themselves is quite expensive, I decided to make a small vector holding $$\theta$$ and vector indices, which is sorted instead. In the kernel, the track parameters are accessed by the indices of this small vector.

Following is the benchmark result with traccc_throughput_mt_cuda ODD data and traccc_benchmark_cuda with toy geometry

Command for mt_cuda:
./bin/traccc_throughput_mt_cuda --detector-file=geometries/odd/odd-detray_geometry_detray.json --digitization-file=geometries/odd/odd-digi-geometric-config.json --grid-file=geometries/odd/odd-detray_surface_grids_detray.json --use-detray-detector --input-directory=../../../../../../../bld6/data/traccc/geant4_ttbar_mu100/ --cpu-threads 2

Command for cuda toy benchmark:
./bin/traccc_benchmark_cuda

Device: NVIDIA RTX A6000

beomki-yeo · 2024-09-19T21:08:14Z

One caveat is that this PR finds slightly less parameters than main with default config (3313582 ->3307886). The difference gets a lot smaller with enough size of buffer for track candidates:

Command: ./bin/traccc_throughput_mt_cuda --detector-file=geometries/odd/odd-detray_geometry_detray.json --digitization-file=geometries/odd/odd-digi-geometric-config.json --grid-file=geometries/odd/odd-detray_surface_grids_detray.json --use-detray-detector --input-directory=../../../../../../../bld6/data/traccc/geant4_ttbar_mu200/ --cpu-threads 2 --max-num-branches-per-surface=100 --max-num-branches-per-seed=100 --nmax-per-seed=200

PR:

Reconstructed track parameters: 3438856
Time totals:
                  File reading  637 ms
            Warm-up processing  710 ms
              Event processing  5891 ms
Throughput:
            Warm-up processing  71.0011 ms/event, 14.0843 events/s
              Event processing  58.9137 ms/event, 16.974 events/s

main:

Reconstructed track parameters: 3438862
Time totals:
                  File reading  630 ms
            Warm-up processing  837 ms
              Event processing  6760 ms
Throughput:
            Warm-up processing  83.7662 ms/event, 11.938 events/s
              Event processing  67.6091 ms/event, 14.7909 events/s

beomki-yeo · 2024-09-25T17:15:53Z

The KF sort is done. As I need to use the sort function in SYCL KF, I included oneDPL in this PR. I guess this is OK as uxlfoundation/oneDPL#1060 is resolved (@krasznaa). But I actually did not run the SYCL KF because the example executables are not being compiled at the moment (#655)

Following is the performance of this PR

This PR

Reconstructed track parameters: 3299852
Time totals:
                  File reading  635 ms
            Warm-up processing  578 ms
              Event processing  4864 ms
Throughput:
            Warm-up processing  57.807 ms/event, 17.2989 events/s
              Event processing  48.641 ms/event, 20.5588 events/s

Main

Reconstructed track parameters: 3299932
Time totals:
                  File reading  622 ms
            Warm-up processing  606 ms
              Event processing  5285 ms
Throughput:
            Warm-up processing  60.6916 ms/event, 16.4767 events/s
              Event processing  52.8595 ms/event, 18.9181 events/s

krasznaa

I'm on board! I was worried about adding oneDPL as part of this PR at first, but it's a small PR. So, why not?

And finally having an up-to-date version of oneDPL shall allow us to complete the full-chain on SYCL as well. 😄

So yeah, please rebase to the current main, and let's have it. 😉

beomki-yeo · 2024-10-01T10:53:45Z

I checked the performance again and I found that this PR makes the mt_cuda slower.. No idea why I got the opposite result in the previous attempt.
let me do the fix

krasznaa · 2024-10-01T10:56:16Z

Did you check the MT performance earlier? Because I could imagine situations where the ST performance would improve, but the MT performance would decrease.

In any case, as long as you have an idea what to do about it, I'll just let you propose an update. 😉

beomki-yeo · 2024-10-01T11:15:03Z

I used the MT performance for the previous attempt as well. Well let's see..

beomki-yeo · 2024-10-01T12:00:20Z

I think I was sane. For the previous attempt, I configured cmake with -DCMAKE_CUDA_ARCHITECTURES=86 for A6000 (it seems the correct number according to here) and this literally makes an opposite outcome compared to the configuration without it. With the architecture number, the performance increases for $$n$$ sorting but drops for $$\theta$$. The behaviors are opposite with the other configuration..

A6000, Release build
Sort key	\|θ- π/2\|	Number of measurements ($$n$$)	globalIndex	No sort (Commit: `9b4c18d`)
mt_cuda (w/ cmake arch num: 86) [Hz]	16.69	20.87	18.5	19.47
mt_cuda (w/ cmake arch num: 52) [Hz]	21.3	20.19	20.3	20.787

@krasznaa Do you know what's going on here? Should I not use the CUDA architecture number for cmake configuration? The automatically configured architecture number is 52

beomki-yeo · 2024-10-01T12:01:32Z

mt_cuda Command is:

./bin/traccc_throughput_mt_cuda --detector-file=geometries/odd/odd-detray_geometry_detray.json --digitization-file=geometries/odd/odd-digi-geometric-config.json --grid-file=geometries/odd/odd-detray_surface_grids_detray.json --use-detray-detector --input-directory=../../../../../../../bld6/data/traccc/geant4_ttbar_mu200/ --cpu-threads 2

krasznaa · 2024-10-01T12:12:12Z

I might be misunderstanding you, but what you found is that Thrust's sorting speed / efficiency is very optimization dependent, no? 😕

When we sort the tracks in an extra step, we need the total runtime of the sorting step to be shorter than the amount by which the other steps speed up because of the sorting. I don't think that our own code would change its speed all too much because of which exact architecture we build for. So it must be the sorting that is way faster with SM 8.6 compared to SM 5.2.

The SM 5.2 value does not have any big significance. That's the minimum value that modern CUDA versions support without a warning. Normally I'd say that we should build the code for a couple of different architectures in parallel. The reason I didn't pursue this yet is because of our large memory usage as-is. 🤔 But yeah, as long as build system issues are sorted out, we could eventually go for CMAKE_CUDA_ARCHITECTURES=all-major in our build. See:

https://cmake.org/cmake/help/latest/prop_tgt/CUDA_ARCHITECTURES.html

What we could do for now: If we take the hit of raising our minimum CMake version to 3.24, we could set CMAKE_CUDA_ARCHITECTURES=native in our build. 🤔 That will make the code pretty unportable, but at this stage of development it may still be the best setup to have.

beomki-yeo · 2024-10-01T12:49:11Z

To be clear, the numbers in the table are the event throughput of mt_cuda, not the sorting speed.

I might be misunderstanding you, but what you found is that Thrust's sorting speed / efficiency is very optimization dependent, no? 😕

Probably right but I am not sure if the odd behavior of table numbeers is fully dependent on thrust sort speed. As you can see the behavior of the first and second column is totally opposite. If it is really on the thrust sort speed, they would behave consistently. Most of all, the sort function is called only once in the KF (unlike CKF where it is called more than 10 times)

So it must be the sorting that is way faster with SM 8.6 compared to SM 5.2.

Probably not. The first column way slower with SM 8.6. (But As I mentioned, I doubt that the thrust sort is the dominating factor.. considering the huge gap)

Using Native sounds good. Though I am keen to use the number which shows the peak performance for the specific device (yeah it is 5.2 for A6000 🤔...)

I added one more column, which does sorting with globalIndex (So there is nothing to be sorted as the key container is already sorted...). We can see the the performance drops noticeably compared to the last column. Yeah looking at this just calling the thrust::sort is not super cheap

krasznaa · 2024-10-01T12:55:02Z

Ahh... so the throughput tests are slower with SM 8.6? 😕 That I didn't expect indeed. (And hence didn't even check the numbers, as this would seem like a very unlikely outcome.)

This would be worth investigating in more detail, with some profiling.

But then, do I understand correctly that for now we do nothing? Since the current setup, with SM 5.2, gives optimal performance in your tests apparently. 🤔

beomki-yeo · 2024-10-01T12:58:28Z

Yup, doing nothing at the moment is my suggestion. (~~and I will probably make a PR sorting with $$\theta - \pi/2$$.~~) More profiling will be needed definitely

beomki-yeo marked this pull request as draft August 23, 2024 07:37

beomki-yeo force-pushed the sort-track-params-2 branch from 106a8b7 to 80ba750 Compare September 19, 2024 15:24

beomki-yeo changed the title ~~WIP: Sort track parameters for every CKF step~~ Sort track parameters for every CKF step Sep 19, 2024

beomki-yeo requested review from krasznaa, niermann999 and stephenswat September 19, 2024 16:03

beomki-yeo marked this pull request as ready for review September 19, 2024 16:03

beomki-yeo force-pushed the sort-track-params-2 branch 2 times, most recently from ae033ba to 5c71fb6 Compare September 19, 2024 20:49

beomki-yeo force-pushed the sort-track-params-2 branch 4 times, most recently from 1e245a2 to 8680ae5 Compare September 20, 2024 18:38

beomki-yeo changed the title ~~Sort track parameters for every CKF step~~ Access track parameters in a sorted order Sep 20, 2024

beomki-yeo mentioned this pull request Sep 20, 2024

Access track parameters in sorted order for CKF propagation #706

Merged

beomki-yeo changed the title ~~Access track parameters in a sorted order~~ Access track parameters in a sorted order for KF Sep 20, 2024

beomki-yeo force-pushed the sort-track-params-2 branch from 8680ae5 to fc07849 Compare September 20, 2024 19:26

beomki-yeo marked this pull request as draft September 25, 2024 11:15

beomki-yeo force-pushed the sort-track-params-2 branch 5 times, most recently from 74cd6aa to 23ce5c3 Compare September 25, 2024 16:37

beomki-yeo marked this pull request as ready for review September 25, 2024 16:44

beomki-yeo force-pushed the sort-track-params-2 branch from 23ce5c3 to 5e1a3da Compare September 25, 2024 17:10

beomki-yeo force-pushed the sort-track-params-2 branch 2 times, most recently from c064ea9 to b1b8ee4 Compare September 25, 2024 19:21

beomki-yeo mentioned this pull request Sep 25, 2024

Add oneDPL for SYCL #442

Closed

beomki-yeo force-pushed the sort-track-params-2 branch from b1b8ee4 to 8c01fd8 Compare September 25, 2024 19:34

Add track parameters in a sorted order in Kalman Fitting

d0f0fc0

beomki-yeo force-pushed the sort-track-params-2 branch from 8c01fd8 to d0f0fc0 Compare September 26, 2024 16:10

krasznaa approved these changes Oct 1, 2024

View reviewed changes

Merge branch 'main' into sort-track-params-2

a5eac23

beomki-yeo merged commit 2080638 into acts-project:main Oct 1, 2024
24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Access track parameters in a sorted order for KF #689

Access track parameters in a sorted order for KF #689

beomki-yeo commented Aug 23, 2024 •

edited

Loading

beomki-yeo commented Sep 19, 2024

beomki-yeo commented Sep 25, 2024

krasznaa left a comment

beomki-yeo commented Oct 1, 2024

krasznaa commented Oct 1, 2024

beomki-yeo commented Oct 1, 2024

beomki-yeo commented Oct 1, 2024 •

edited

Loading

beomki-yeo commented Oct 1, 2024

krasznaa commented Oct 1, 2024

beomki-yeo commented Oct 1, 2024 •

edited

Loading

krasznaa commented Oct 1, 2024

beomki-yeo commented Oct 1, 2024 •

edited

Loading

Access track parameters in a sorted order for KF #689

Access track parameters in a sorted order for KF #689

Conversation

beomki-yeo commented Aug 23, 2024 • edited Loading

UPDATE

beomki-yeo commented Sep 19, 2024

beomki-yeo commented Sep 25, 2024

krasznaa left a comment

Choose a reason for hiding this comment

beomki-yeo commented Oct 1, 2024

krasznaa commented Oct 1, 2024

beomki-yeo commented Oct 1, 2024

beomki-yeo commented Oct 1, 2024 • edited Loading

beomki-yeo commented Oct 1, 2024

krasznaa commented Oct 1, 2024

beomki-yeo commented Oct 1, 2024 • edited Loading

krasznaa commented Oct 1, 2024

beomki-yeo commented Oct 1, 2024 • edited Loading

beomki-yeo commented Aug 23, 2024 •

edited

Loading

beomki-yeo commented Oct 1, 2024 •

edited

Loading

beomki-yeo commented Oct 1, 2024 •

edited

Loading

beomki-yeo commented Oct 1, 2024 •

edited

Loading