Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cell EDM Rewrite, main branch (2024.09.18.) #712

Merged
merged 17 commits into from
Oct 4, 2024

Conversation

krasznaa
Copy link
Member

This is the next monster PR... Exchanging traccc::cell_collection_types and traccc::cluster_container_types with SoA versions.

To jump right to the chase: It doesn't bring any performance improvement. 😦 This EDM change of course only affects clusterization. Which is already one of the fastest things that we run. Still if anything, I see an O(1%) performance drop during the TML $\mu$=200 throughput measurements with this update applied. 🤔

On my RTX3080 I get the following with the current main branch:

[bash][Legolas]:traccc > ./build-orig/bin/traccc_throughput_st_cuda --input-directory=tml_full/ttbar_mu200/ --input-events=10 --cold-run-events=100 --processed-events=1000
...
Using CUDA device: NVIDIA GeForce RTX 3080 [id: 0, bus: 1, device: 0]
Reconstructed track parameters: 0
Time totals:
                  File reading  5475 ms
            Warm-up processing  359 ms
              Event processing  2814 ms
Throughput:
            Warm-up processing  3.59883 ms/event, 277.868 events/s
              Event processing  2.81497 ms/event, 355.244 events/s
[bash][Legolas]:traccc >

While this PR produces the following:

[bash][Legolas]:traccc > ./out/build/cuda/bin/traccc_throughput_st_cuda --input-directory=tml_full/ttbar_mu200/ --input-events=10 --cold-run-events=100 --processed-events=1000
...
Using CUDA device: NVIDIA GeForce RTX 3080 [id: 0, bus: 1, device: 0]
Reconstructed track parameters: 0
Time totals:
                  File reading  5471 ms
            Warm-up processing  305 ms
              Event processing  2830 ms
Throughput:
            Warm-up processing  3.0542 ms/event, 327.418 events/s
              Event processing  2.83068 ms/event, 353.272 events/s
[bash][Legolas]:traccc >

(There is variation on these numbers, but the "new" code is always just a little slower. 😦)

About the code:

  • I chose traccc::edm::silicon_cell_collection and traccc::edm::silicon_cluster_collection as the name of these containers. But I'm not too fond of these names either. So I'm open to suggestions.
  • Most of the changes of the PR are pretty trivial. Since the new containers follow the naming of the current traccc::cell closely.
  • The larger changes are:
    • The clusterization functions shared between the CPU and GPU algorithms have a little different interfaces now.
    • Some logic changes in the "mapping code". There I had to make some larger changes. And keep in mind that the fundamental setup of that code does not work well with SoA containers. So as we make more of our EDM into SoAs, the less it will be possible to maintain that code. 🤔
    • The traccc::edm::silicon_cluster_collection type, only used in the host code, is now a jagged vector of cell indices. As such, traccc::host::measurement_creation_algorithm had to change its interface slightly.
  • As @stephenswat would've heard from me in the last days, I modified all of the "sanity code" that he wrote earlier, to work on either AoS or SoA containers. I'm fairly happy with what I did there, but I'm very open to suggestions with that code as well.

We'll have to do some profiling, but I suspect that the small performance drop comes from the fact that the PR's code always reads the cell data from global memory, whenever it needs it. Just loading some of the info into local registers in a couple of places will hopefully take us back to the previous performance. I just didn't want to complicate the code even further in this PR. 🤔

This PR also closes #691.

@krasznaa krasznaa added cuda Changes related to CUDA sycl Changes related to SYCL cpu Changes related to CPU code edm Changes to the data model kokkos Changes related to Kokkos alpaka Changes related to Alpaka labels Sep 21, 2024
@krasznaa krasznaa removed the kokkos Changes related to Kokkos label Sep 21, 2024
Copy link
Member

@stephenswat stephenswat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm generally on board, but in its current state this makes the code much less readable and generic; I'd strongly recommend looking into a proxy object for this as it would make the code more readable and require much fewer code line changes.

See, e.g. https://github.com/acts-project/acts/blob/main/Examples/Framework/include/ActsExamples/EventData/Measurement.hpp and https://github.com/acts-project/traccc/blob/main/core/include/traccc/utils/array_wrapper.hpp for inspiration.

@stephenswat
Copy link
Member

As mentioned, please increase the minimum vecmem version to 1.8.0.

@beomki-yeo
Copy link
Contributor

Hmm. are you going to go through this for every EDM class?

@beomki-yeo
Copy link
Contributor

This will also have lots of conflictions with #692 😿

@stephenswat
Copy link
Member

This will also have lots of conflictions with #692 😿

The good news is that when the proxy objects are implemented, most of the code will remain unchanged.

@krasznaa krasznaa marked this pull request as draft September 24, 2024 14:50
@krasznaa
Copy link
Member Author

This will also have lots of conflictions with #692 😿

The good news is that when the proxy objects are implemented, most of the code will remain unchanged.

Some further developments are indeed underway...

@krasznaa
Copy link
Member Author

All of you, hold onto your hats. 😄 If/once we settle on acts-project/vecmem#296, these are the types of updates that we will need to do to switch from the current AoS to a new SoA EDM:

image

At the same time I looked at profiles of the throughput application a little as well. This was very educational. As it turns out, the small slowdown is not due to the kernels. It seems to be due to the code spending a little more time on memory copies. 🤔

That's not great news, as apparently the vecmem::edm code is not quite as efficient wrt. CPU usage as I hoped. But at least the SoA layout doesn't seem to have much of an impact on clusterization after all. (Remember, even with the current AoS layout, since traccc::cell is tiny, the memory access pattern of clusterization is pretty efficient already.)

Comment on lines +51 to +54
out.push_back(v2);
}
} else if (tid == 0) {
out[atomicAdd(out_size, 1u)] = projection(in.at(tid));
out.push_back(projection(in.at(tid)));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stephenswat take note that I changed this code a little. Since you were literally implementing a resizable 1D vector by hand previously, I decided to rather just use the vecmem types here that were designed for exactly this use case. 🤔

@krasznaa krasznaa force-pushed the CellEDMRewrite-main-20240918 branch from ca8dd63 to adaecb6 Compare September 30, 2024 16:28
@krasznaa krasznaa marked this pull request as ready for review September 30, 2024 16:28
@krasznaa krasznaa force-pushed the CellEDMRewrite-main-20240918 branch 2 times, most recently from aabcb8e to 1999c5b Compare September 30, 2024 18:35
@krasznaa
Copy link
Member Author

The good news is that once the code finally starts working on all platforms with all compilers, this very latest version is finally delivering on the performance front. 😄

[bash][Legolas]:traccc > ./out/build/cuda/bin/traccc_throughput_st_cuda --input-directory=tml_full/ttbar_mu200/ --input-events=10 --cold-run-events=100 --processed-events=1000

Running Single-threaded CUDA GPU throughput tests

>>> Detector Options <<<
  Detector file       : tml_detector/trackml-detector.csv
  Material file       : 
  Surface grid file   : 
  Use detray::detector: no
  Digitization file   : tml_detector/default-geometric-config-generic.json
>>> Input Data Options <<<
  Input data format             : csv
  Input directory               : tml_full/ttbar_mu200/
  Number of input events        : 10
  Number of input events to skip: 0
>>> Clusterization Options <<<
  Threads per partition:      256
  Target cells per thread:    8
  Max cells per thread:       16
  Scratch space size mult.:   256
>>> Track Seeding Options <<<
  None
>>> Track Finding Options <<<
  Max number of branches per seed: 10
  Max number of branches per surface: 10
  Track candidates range   : 3:100
  Minimum step length for the next surface: 0.5 [mm] 
  Maximum step counts for the next surface: 100
  Maximum Chi2             : 30
  Maximum branches per step: 10
  Maximum number of skipped steps per candidates: 3
  PDG Number: 13
>>> Track Propagation Options <<<
Navigation
----------------------------
  Min. mask tolerance   : 1e-05 [mm]
  Max. mask tolerance   : 1 [mm]
  Mask tolerance scalor : 0.05
  Path tolerance        : 1 [um]
  Overstep tolerance    : -100 [um]
  Search window         : 0 x 0

Parameter Transport
----------------------------
  Min. Stepsize         : 0.0001 [mm]
  Runge-Kutta tolerance : 0.0001 [mm]
  Max. step updates     : 10000
  Stepsize  constraint  : 3.40282e+38 [mm]
  Path limit            : 5 [m]
  Use Bethe energy loss : true
  Do cov. transport     : true
  Use eloss gradient    : false
  Use B-field gradient  : false


>>> Throughput Measurement Options <<<
  Cold run event(s) : 100
  Processed event(s): 1000
  Log file          : 

WARNING: @traccc::io::csv::read_cells: 251 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/tml_full/ttbar_mu200/event000000000-cells.csv
WARNING: @traccc::io::csv::read_cells: 305 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/tml_full/ttbar_mu200/event000000001-cells.csv
WARNING: @traccc::io::csv::read_cells: 176 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/tml_full/ttbar_mu200/event000000002-cells.csv
WARNING: @traccc::io::csv::read_cells: 200 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/tml_full/ttbar_mu200/event000000003-cells.csv
WARNING: @traccc::io::csv::read_cells: 224 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/tml_full/ttbar_mu200/event000000004-cells.csv
WARNING: @traccc::io::csv::read_cells: 170 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/tml_full/ttbar_mu200/event000000005-cells.csv
WARNING: @traccc::io::csv::read_cells: 321 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/tml_full/ttbar_mu200/event000000006-cells.csv
WARNING: @traccc::io::csv::read_cells: 322 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/tml_full/ttbar_mu200/event000000007-cells.csv
WARNING: @traccc::io::csv::read_cells: 222 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/tml_full/ttbar_mu200/event000000008-cells.csv
WARNING: @traccc::io::csv::read_cells: 118 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/tml_full/ttbar_mu200/event000000009-cells.csv
Using CUDA device: NVIDIA GeForce RTX 3080 [id: 0, bus: 1, device: 0]
Reconstructed track parameters: 0
Time totals:
                  File reading  4968 ms
            Warm-up processing  358 ms
              Event processing  2715 ms
Throughput:
            Warm-up processing  3.58551 ms/event, 278.9 events/s
              Event processing  2.71537 ms/event, 368.274 events/s
[bash][Legolas]:traccc >

Though I am a little bit afraid of this possibly being artificial. Since the previous result was on x86_64-ubuntu2204-gcc11-opt, and these latest numbers are now on x86_64-ubuntu2404-gcc13-opt. (I upgraded my home PC during the weekend... 😛) Still, at least the hardware is still the same... 🤔

@krasznaa krasznaa force-pushed the CellEDMRewrite-main-20240918 branch from ccb8eb3 to df40fb2 Compare October 1, 2024 14:12
Copy link
Member

@stephenswat stephenswat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's getting there. 👍

@krasznaa
Copy link
Member Author

krasznaa commented Oct 2, 2024

Quality Gate Failed Quality Gate failed

Failed conditions 2 New Bugs (required ≤ 0) C Reliability Rating on New Code (required ≥ A)

See analysis details on SonarCloud

Catch issues before they fail your Quality Gate with our IDE extension SonarLint

Huhh... 🤔 What's your take on these errors @stephenswat?

@stephenswat
Copy link
Member

Huhh... 🤔 What's your take on these errors @stephenswat?

SonarCloud actually makes a really valid point here about constraining universal references; I'd suggest we go ahead and implement them.

@krasznaa
Copy link
Member Author

krasznaa commented Oct 2, 2024

Huhh... 🤔 What's your take on these errors @stephenswat?

SonarCloud actually makes a really valid point here about constraining universal references; I'd suggest we go ahead and implement them.

As long as you have a concrete idea of how to go about it, I'm happy to let you propose the improvement. 😉

@krasznaa krasznaa force-pushed the CellEDMRewrite-main-20240918 branch from df40fb2 to b65600e Compare October 3, 2024 14:32
@stephenswat
Copy link
Member

Okay, I guess we need to get vecmem 1.10.0 and then we can put this in, right?

Updated the algorithms in traccc::core to use the new types.
Updated the remaining clients that were still using these headers
by mistake.
Allowing the code to revert to a very similar setup that it had
with the old AoS cell collection.
Also introduced the usage of detector description proxy
objects in a few places.

Updated the sanity functions once again, to bring them back
to a setup closer to how they are at the moment in the
main branch. Also synchornized the CUDA and SYCL implementations
of the sanity functions a little.
Introduced some additional functions into edm::silicon_cell
to make it a bit easier to handle such objects/containers.

Fixed a long-standing mistake in traccc::device::ccl_core, which
was brought to light with these latest changes.
The aggregate_cluster code uses the same cell over and over again,
so using a proxy formalism makes sense here after all.
After switching to vecmem-1.10.0, to make this possible.
@krasznaa krasznaa force-pushed the CellEDMRewrite-main-20240918 branch from b65600e to bf7ac0d Compare October 4, 2024 10:03
Copy link

sonarqubecloud bot commented Oct 4, 2024

Quality Gate Failed Quality Gate failed

Failed conditions
C Reliability Rating on New Code (required ≥ A)
2 New Bugs (required ≤ 0)

See analysis details on SonarCloud

Catch issues before they fail your Quality Gate with our IDE extension SonarLint

Copy link
Member

@stephenswat stephenswat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I think this can go in now.

@krasznaa krasznaa merged commit 14d4882 into acts-project:main Oct 4, 2024
22 of 23 checks passed
@krasznaa krasznaa deleted the CellEDMRewrite-main-20240918 branch October 4, 2024 12:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
alpaka Changes related to Alpaka cpu Changes related to CPU code cuda Changes related to CUDA edm Changes to the data model sycl Changes related to SYCL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove module_link and cell_module
3 participants