Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Access track parameters in a sorted order for KF #689

Merged
merged 2 commits into from
Oct 1, 2024

Conversation

beomki-yeo
Copy link
Contributor

@beomki-yeo beomki-yeo commented Aug 23, 2024

Based on #706

UPDATE

Now the KF also uses the sorting method, but with a different standard: the number of measurements ($$n$$) of the track.

The problem of using $$\theta$$ method is that the branching still can be an issue if the some of tracks in the warp end too early unexpectedly (or continue with too many measurements). We can prevent them by sorting the tracks w.r.t to the number of measurements.

One can still argue that the different step size between the planes can be problematic, and I think that is a valid point 🤔
It is not easy to know which one between $$\theta$$ or $$n$$ is better until one check the results. At least the ODD data says that sorting with $$n$$ is beneficial.

  Sort CKF and KF (PR) Sort CKF No sort (main)
mt cuda [Hz] 19.01 17.85 15.0808

Of course we can try to improve this further by experimenting different values or combining them


The main purpose of the PR is sorting the track parameters with based on $$\theta$$, to reduce the branching divergence. (Similar to the binning method that Markus mentioned in the parallelization meeting)

As it turned out that sorting the track parameters themselves is quite expensive, I decided to make a small vector holding $$\theta$$ and vector indices, which is sorted instead. In the kernel, the track parameters are accessed by the indices of this small vector.

Following is the benchmark result with traccc_throughput_mt_cuda ODD data and traccc_benchmark_cuda with toy geometry

Command for mt_cuda:
./bin/traccc_throughput_mt_cuda --detector-file=geometries/odd/odd-detray_geometry_detray.json --digitization-file=geometries/odd/odd-digi-geometric-config.json --grid-file=geometries/odd/odd-detray_surface_grids_detray.json --use-detray-detector --input-directory=../../../../../../../bld6/data/traccc/geant4_ttbar_mu100/ --cpu-threads 2

Command for cuda toy benchmark:
./bin/traccc_benchmark_cuda

Device: NVIDIA RTX A6000

image

@beomki-yeo beomki-yeo marked this pull request as draft August 23, 2024 07:37
@beomki-yeo beomki-yeo changed the title WIP: Sort track parameters for every CKF step Sort track parameters for every CKF step Sep 19, 2024
@beomki-yeo beomki-yeo marked this pull request as ready for review September 19, 2024 16:03
@beomki-yeo beomki-yeo force-pushed the sort-track-params-2 branch 2 times, most recently from ae033ba to 5c71fb6 Compare September 19, 2024 20:49
@beomki-yeo
Copy link
Contributor Author

One caveat is that this PR finds slightly less parameters than main with default config (3313582 ->3307886). The difference gets a lot smaller with enough size of buffer for track candidates:

Command: ./bin/traccc_throughput_mt_cuda --detector-file=geometries/odd/odd-detray_geometry_detray.json --digitization-file=geometries/odd/odd-digi-geometric-config.json --grid-file=geometries/odd/odd-detray_surface_grids_detray.json --use-detray-detector --input-directory=../../../../../../../bld6/data/traccc/geant4_ttbar_mu200/ --cpu-threads 2 --max-num-branches-per-surface=100 --max-num-branches-per-seed=100 --nmax-per-seed=200

PR:

Reconstructed track parameters: 3438856
Time totals:
                  File reading  637 ms
            Warm-up processing  710 ms
              Event processing  5891 ms
Throughput:
            Warm-up processing  71.0011 ms/event, 14.0843 events/s
              Event processing  58.9137 ms/event, 16.974 events/s

main:

Reconstructed track parameters: 3438862
Time totals:
                  File reading  630 ms
            Warm-up processing  837 ms
              Event processing  6760 ms
Throughput:
            Warm-up processing  83.7662 ms/event, 11.938 events/s
              Event processing  67.6091 ms/event, 14.7909 events/s

@beomki-yeo beomki-yeo force-pushed the sort-track-params-2 branch 4 times, most recently from 1e245a2 to 8680ae5 Compare September 20, 2024 18:38
@beomki-yeo beomki-yeo changed the title Sort track parameters for every CKF step Access track parameters in a sorted order Sep 20, 2024
@beomki-yeo beomki-yeo changed the title Access track parameters in a sorted order Access track parameters in a sorted order for KF Sep 20, 2024
@beomki-yeo beomki-yeo marked this pull request as draft September 25, 2024 11:15
@beomki-yeo beomki-yeo force-pushed the sort-track-params-2 branch 5 times, most recently from 74cd6aa to 23ce5c3 Compare September 25, 2024 16:37
@beomki-yeo beomki-yeo marked this pull request as ready for review September 25, 2024 16:44
@beomki-yeo
Copy link
Contributor Author

The KF sort is done. As I need to use the sort function in SYCL KF, I included oneDPL in this PR. I guess this is OK as uxlfoundation/oneDPL#1060 is resolved (@krasznaa). But I actually did not run the SYCL KF because the example executables are not being compiled at the moment (#655)

Following is the performance of this PR

This PR

Reconstructed track parameters: 3299852
Time totals:
                  File reading  635 ms
            Warm-up processing  578 ms
              Event processing  4864 ms
Throughput:
            Warm-up processing  57.807 ms/event, 17.2989 events/s
              Event processing  48.641 ms/event, 20.5588 events/s

Main

Reconstructed track parameters: 3299932
Time totals:
                  File reading  622 ms
            Warm-up processing  606 ms
              Event processing  5285 ms
Throughput:
            Warm-up processing  60.6916 ms/event, 16.4767 events/s
              Event processing  52.8595 ms/event, 18.9181 events/s

@beomki-yeo beomki-yeo force-pushed the sort-track-params-2 branch 2 times, most recently from c064ea9 to b1b8ee4 Compare September 25, 2024 19:21
@beomki-yeo beomki-yeo mentioned this pull request Sep 25, 2024
Copy link
Member

@krasznaa krasznaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on board! I was worried about adding oneDPL as part of this PR at first, but it's a small PR. So, why not?

And finally having an up-to-date version of oneDPL shall allow us to complete the full-chain on SYCL as well. 😄

So yeah, please rebase to the current main, and let's have it. 😉

@beomki-yeo beomki-yeo merged commit 2080638 into acts-project:main Oct 1, 2024
24 checks passed
@beomki-yeo
Copy link
Contributor Author

I checked the performance again and I found that this PR makes the mt_cuda slower.. No idea why I got the opposite result in the previous attempt.
let me do the fix

@krasznaa
Copy link
Member

krasznaa commented Oct 1, 2024

Did you check the MT performance earlier? Because I could imagine situations where the ST performance would improve, but the MT performance would decrease.

In any case, as long as you have an idea what to do about it, I'll just let you propose an update. 😉

@beomki-yeo
Copy link
Contributor Author

I used the MT performance for the previous attempt as well. Well let's see..

@beomki-yeo
Copy link
Contributor Author

beomki-yeo commented Oct 1, 2024

I think I was sane. For the previous attempt, I configured cmake with -DCMAKE_CUDA_ARCHITECTURES=86 for A6000 (it seems the correct number according to here) and this literally makes an opposite outcome compared to the configuration without it. With the architecture number, the performance increases for $$n$$ sorting but drops for $$\theta$$. The behaviors are opposite with the other configuration..

A6000, Release build      
Sort key |θ- π/2| Number of measurements ($$n$$) globalIndex No sort (Commit: 9b4c18d)
mt_cuda (w/ cmake arch num: 86) [Hz] 16.69 20.87 18.5 19.47
mt_cuda (w/ cmake arch num: 52) [Hz] 21.3 20.19 20.3 20.787

@krasznaa Do you know what's going on here? Should I not use the CUDA architecture number for cmake configuration? The automatically configured architecture number is 52

@beomki-yeo
Copy link
Contributor Author

mt_cuda Command is:

./bin/traccc_throughput_mt_cuda --detector-file=geometries/odd/odd-detray_geometry_detray.json --digitization-file=geometries/odd/odd-digi-geometric-config.json --grid-file=geometries/odd/odd-detray_surface_grids_detray.json --use-detray-detector --input-directory=../../../../../../../bld6/data/traccc/geant4_ttbar_mu200/ --cpu-threads 2

@krasznaa
Copy link
Member

krasznaa commented Oct 1, 2024

I might be misunderstanding you, but what you found is that Thrust's sorting speed / efficiency is very optimization dependent, no? 😕

When we sort the tracks in an extra step, we need the total runtime of the sorting step to be shorter than the amount by which the other steps speed up because of the sorting. I don't think that our own code would change its speed all too much because of which exact architecture we build for. So it must be the sorting that is way faster with SM 8.6 compared to SM 5.2.

The SM 5.2 value does not have any big significance. That's the minimum value that modern CUDA versions support without a warning. Normally I'd say that we should build the code for a couple of different architectures in parallel. The reason I didn't pursue this yet is because of our large memory usage as-is. 🤔 But yeah, as long as build system issues are sorted out, we could eventually go for CMAKE_CUDA_ARCHITECTURES=all-major in our build. See:

https://cmake.org/cmake/help/latest/prop_tgt/CUDA_ARCHITECTURES.html

What we could do for now: If we take the hit of raising our minimum CMake version to 3.24, we could set CMAKE_CUDA_ARCHITECTURES=native in our build. 🤔 That will make the code pretty unportable, but at this stage of development it may still be the best setup to have.

@beomki-yeo
Copy link
Contributor Author

beomki-yeo commented Oct 1, 2024

To be clear, the numbers in the table are the event throughput of mt_cuda, not the sorting speed.

I might be misunderstanding you, but what you found is that Thrust's sorting speed / efficiency is very optimization dependent, no? 😕

Probably right but I am not sure if the odd behavior of table numbeers is fully dependent on thrust sort speed. As you can see the behavior of the first and second column is totally opposite. If it is really on the thrust sort speed, they would behave consistently. Most of all, the sort function is called only once in the KF (unlike CKF where it is called more than 10 times)

So it must be the sorting that is way faster with SM 8.6 compared to SM 5.2.

Probably not. The first column way slower with SM 8.6. (But As I mentioned, I doubt that the thrust sort is the dominating factor.. considering the huge gap)

Using Native sounds good. Though I am keen to use the number which shows the peak performance for the specific device (yeah it is 5.2 for A6000 🤔...)

  • I added one more column, which does sorting with globalIndex (So there is nothing to be sorted as the key container is already sorted...). We can see the the performance drops noticeably compared to the last column. Yeah looking at this just calling the thrust::sort is not super cheap

@krasznaa
Copy link
Member

krasznaa commented Oct 1, 2024

Ahh... so the throughput tests are slower with SM 8.6? 😕 That I didn't expect indeed. (And hence didn't even check the numbers, as this would seem like a very unlikely outcome.)

This would be worth investigating in more detail, with some profiling.

But then, do I understand correctly that for now we do nothing? Since the current setup, with SM 5.2, gives optimal performance in your tests apparently. 🤔

@beomki-yeo
Copy link
Contributor Author

beomki-yeo commented Oct 1, 2024

Yup, doing nothing at the moment is my suggestion. (and I will probably make a PR sorting with $$\theta - \pi/2$$.) More profiling will be needed definitely

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants