Cell EDM Rewrite, main branch (2024.09.18.) #712

krasznaa · 2024-09-21T16:55:39Z

This is the next monster PR... Exchanging traccc::cell_collection_types and traccc::cluster_container_types with SoA versions.

To jump right to the chase: It doesn't bring any performance improvement. 😦 This EDM change of course only affects clusterization. Which is already one of the fastest things that we run. Still if anything, I see an O(1%) performance drop during the TML $\mu$=200 throughput measurements with this update applied. 🤔

On my RTX3080 I get the following with the current main branch:

[bash][Legolas]:traccc > ./build-orig/bin/traccc_throughput_st_cuda --input-directory=tml_full/ttbar_mu200/ --input-events=10 --cold-run-events=100 --processed-events=1000
...
Using CUDA device: NVIDIA GeForce RTX 3080 [id: 0, bus: 1, device: 0]
Reconstructed track parameters: 0
Time totals:
                  File reading  5475 ms
            Warm-up processing  359 ms
              Event processing  2814 ms
Throughput:
            Warm-up processing  3.59883 ms/event, 277.868 events/s
              Event processing  2.81497 ms/event, 355.244 events/s
[bash][Legolas]:traccc >

While this PR produces the following:

[bash][Legolas]:traccc > ./out/build/cuda/bin/traccc_throughput_st_cuda --input-directory=tml_full/ttbar_mu200/ --input-events=10 --cold-run-events=100 --processed-events=1000
...
Using CUDA device: NVIDIA GeForce RTX 3080 [id: 0, bus: 1, device: 0]
Reconstructed track parameters: 0
Time totals:
                  File reading  5471 ms
            Warm-up processing  305 ms
              Event processing  2830 ms
Throughput:
            Warm-up processing  3.0542 ms/event, 327.418 events/s
              Event processing  2.83068 ms/event, 353.272 events/s
[bash][Legolas]:traccc >

(There is variation on these numbers, but the "new" code is always just a little slower. 😦)

About the code:

I chose traccc::edm::silicon_cell_collection and traccc::edm::silicon_cluster_collection as the name of these containers. But I'm not too fond of these names either. So I'm open to suggestions.
Most of the changes of the PR are pretty trivial. Since the new containers follow the naming of the current traccc::cell closely.
The larger changes are:
- The clusterization functions shared between the CPU and GPU algorithms have a little different interfaces now.
- Some logic changes in the "mapping code". There I had to make some larger changes. And keep in mind that the fundamental setup of that code does not work well with SoA containers. So as we make more of our EDM into SoAs, the less it will be possible to maintain that code. 🤔
- The traccc::edm::silicon_cluster_collection type, only used in the host code, is now a jagged vector of cell indices. As such, traccc::host::measurement_creation_algorithm had to change its interface slightly.

As @stephenswat would've heard from me in the last days, I modified all of the "sanity code" that he wrote earlier, to work on either AoS or SoA containers. I'm fairly happy with what I did there, but I'm very open to suggestions with that code as well.

We'll have to do some profiling, but I suspect that the small performance drop comes from the fact that the PR's code always reads the cell data from global memory, whenever it needs it. Just loading some of the info into local registers in a couple of places will hopefully take us back to the previous performance. I just didn't want to complicate the code even further in this PR. 🤔

This PR also closes #691.

stephenswat

I'm generally on board, but in its current state this makes the code much less readable and generic; I'd strongly recommend looking into a proxy object for this as it would make the code more readable and require much fewer code line changes.

See, e.g. https://github.com/acts-project/acts/blob/main/Examples/Framework/include/ActsExamples/EventData/Measurement.hpp and https://github.com/acts-project/traccc/blob/main/core/include/traccc/utils/array_wrapper.hpp for inspiration.

core/include/traccc/clusterization/details/sparse_ccl.hpp

core/include/traccc/clusterization/impl/measurement_creation.ipp

core/include/traccc/clusterization/impl/sparse_ccl.ipp

core/include/traccc/sanity/contiguous_on.hpp

core/include/traccc/utils/projections.hpp

io/src/csv/read_cells.cpp

io/src/mapper.cpp

stephenswat · 2024-09-23T12:00:27Z

As mentioned, please increase the minimum vecmem version to 1.8.0.

beomki-yeo · 2024-09-24T12:57:32Z

Hmm. are you going to go through this for every EDM class?

beomki-yeo · 2024-09-24T12:59:50Z

This will also have lots of conflictions with #692 😿

stephenswat · 2024-09-24T13:13:19Z

This will also have lots of conflictions with #692 😿

The good news is that when the proxy objects are implemented, most of the code will remain unchanged.

krasznaa · 2024-09-24T14:51:04Z

This will also have lots of conflictions with #692 😿

The good news is that when the proxy objects are implemented, most of the code will remain unchanged.

Some further developments are indeed underway...

krasznaa · 2024-09-26T14:54:17Z

All of you, hold onto your hats. 😄 If/once we settle on acts-project/vecmem#296, these are the types of updates that we will need to do to switch from the current AoS to a new SoA EDM:

At the same time I looked at profiles of the throughput application a little as well. This was very educational. As it turns out, the small slowdown is not due to the kernels. It seems to be due to the code spending a little more time on memory copies. 🤔

That's not great news, as apparently the vecmem::edm code is not quite as efficient wrt. CPU usage as I hoped. But at least the SoA layout doesn't seem to have much of an impact on clusterization after all. (Remember, even with the current AoS layout, since traccc::cell is tiny, the memory access pattern of clusterization is pretty efficient already.)

krasznaa · 2024-09-30T16:18:38Z

device/cuda/src/sanity/contiguous_on.cuh

+            out.push_back(v2);
        }
    } else if (tid == 0) {
-        out[atomicAdd(out_size, 1u)] = projection(in.at(tid));
+        out.push_back(projection(in.at(tid)));


@stephenswat take note that I changed this code a little. Since you were literally implementing a resizable 1D vector by hand previously, I decided to rather just use the vecmem types here that were designed for exactly this use case. 🤔

krasznaa · 2024-09-30T18:48:38Z

The good news is that once the code finally starts working on all platforms with all compilers, this very latest version is finally delivering on the performance front. 😄

[bash][Legolas]:traccc > ./out/build/cuda/bin/traccc_throughput_st_cuda --input-directory=tml_full/ttbar_mu200/ --input-events=10 --cold-run-events=100 --processed-events=1000

Running Single-threaded CUDA GPU throughput tests

>>> Detector Options <<<
  Detector file       : tml_detector/trackml-detector.csv
  Material file       : 
  Surface grid file   : 
  Use detray::detector: no
  Digitization file   : tml_detector/default-geometric-config-generic.json
>>> Input Data Options <<<
  Input data format             : csv
  Input directory               : tml_full/ttbar_mu200/
  Number of input events        : 10
  Number of input events to skip: 0
>>> Clusterization Options <<<
  Threads per partition:      256
  Target cells per thread:    8
  Max cells per thread:       16
  Scratch space size mult.:   256
>>> Track Seeding Options <<<
  None
>>> Track Finding Options <<<
  Max number of branches per seed: 10
  Max number of branches per surface: 10
  Track candidates range   : 3:100
  Minimum step length for the next surface: 0.5 [mm] 
  Maximum step counts for the next surface: 100
  Maximum Chi2             : 30
  Maximum branches per step: 10
  Maximum number of skipped steps per candidates: 3
  PDG Number: 13
>>> Track Propagation Options <<<
Navigation
----------------------------
  Min. mask tolerance   : 1e-05 [mm]
  Max. mask tolerance   : 1 [mm]
  Mask tolerance scalor : 0.05
  Path tolerance        : 1 [um]
  Overstep tolerance    : -100 [um]
  Search window         : 0 x 0

Parameter Transport
----------------------------
  Min. Stepsize         : 0.0001 [mm]
  Runge-Kutta tolerance : 0.0001 [mm]
  Max. step updates     : 10000
  Stepsize  constraint  : 3.40282e+38 [mm]
  Path limit            : 5 [m]
  Use Bethe energy loss : true
  Do cov. transport     : true
  Use eloss gradient    : false
  Use B-field gradient  : false


>>> Throughput Measurement Options <<<
  Cold run event(s) : 100
  Processed event(s): 1000
  Log file          : 

WARNING: @traccc::io::csv::read_cells: 251 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/tml_full/ttbar_mu200/event000000000-cells.csv
WARNING: @traccc::io::csv::read_cells: 305 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/tml_full/ttbar_mu200/event000000001-cells.csv
WARNING: @traccc::io::csv::read_cells: 176 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/tml_full/ttbar_mu200/event000000002-cells.csv
WARNING: @traccc::io::csv::read_cells: 200 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/tml_full/ttbar_mu200/event000000003-cells.csv
WARNING: @traccc::io::csv::read_cells: 224 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/tml_full/ttbar_mu200/event000000004-cells.csv
WARNING: @traccc::io::csv::read_cells: 170 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/tml_full/ttbar_mu200/event000000005-cells.csv
WARNING: @traccc::io::csv::read_cells: 321 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/tml_full/ttbar_mu200/event000000006-cells.csv
WARNING: @traccc::io::csv::read_cells: 322 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/tml_full/ttbar_mu200/event000000007-cells.csv
WARNING: @traccc::io::csv::read_cells: 222 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/tml_full/ttbar_mu200/event000000008-cells.csv
WARNING: @traccc::io::csv::read_cells: 118 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/tml_full/ttbar_mu200/event000000009-cells.csv
Using CUDA device: NVIDIA GeForce RTX 3080 [id: 0, bus: 1, device: 0]
Reconstructed track parameters: 0
Time totals:
                  File reading  4968 ms
            Warm-up processing  358 ms
              Event processing  2715 ms
Throughput:
            Warm-up processing  3.58551 ms/event, 278.9 events/s
              Event processing  2.71537 ms/event, 368.274 events/s
[bash][Legolas]:traccc >

Though I am a little bit afraid of this possibly being artificial. Since the previous result was on x86_64-ubuntu2204-gcc11-opt, and these latest numbers are now on x86_64-ubuntu2404-gcc13-opt. (I upgraded my home PC during the weekend... 😛) Still, at least the hardware is still the same... 🤔

stephenswat

It's getting there. 👍

core/include/traccc/clusterization/details/sparse_ccl.hpp

core/include/traccc/clusterization/impl/measurement_creation.ipp

device/common/include/traccc/clusterization/device/impl/aggregate_cluster.ipp

device/common/include/traccc/clusterization/device/impl/ccl_kernel.ipp

device/common/include/traccc/clusterization/device/impl/reduce_problem_cell.ipp

device/sycl/src/sanity/contiguous_on.hpp

io/src/csv/read_cells.cpp

krasznaa · 2024-10-02T09:17:12Z

Quality Gate failed

Failed conditions 2 New Bugs (required ≤ 0) C Reliability Rating on New Code (required ≥ A)

See analysis details on SonarCloud

Catch issues before they fail your Quality Gate with our IDE extension SonarLint

Huhh... 🤔 What's your take on these errors @stephenswat?

stephenswat · 2024-10-02T09:42:17Z

Huhh... 🤔 What's your take on these errors @stephenswat?

SonarCloud actually makes a really valid point here about constraining universal references; I'd suggest we go ahead and implement them.

krasznaa · 2024-10-02T09:46:18Z

Huhh... 🤔 What's your take on these errors @stephenswat?

SonarCloud actually makes a really valid point here about constraining universal references; I'd suggest we go ahead and implement them.

As long as you have a concrete idea of how to go about it, I'm happy to let you propose the improvement. 😉

stephenswat · 2024-10-03T14:37:54Z

Okay, I guess we need to get vecmem 1.10.0 and then we can put this in, right?

Updated the algorithms in traccc::core to use the new types.

Updated the remaining clients that were still using these headers by mistake.

Allowing the code to revert to a very similar setup that it had with the old AoS cell collection.

Also introduced the usage of detector description proxy objects in a few places. Updated the sanity functions once again, to bring them back to a setup closer to how they are at the moment in the main branch. Also synchornized the CUDA and SYCL implementations of the sanity functions a little.

Introduced some additional functions into edm::silicon_cell to make it a bit easier to handle such objects/containers. Fixed a long-standing mistake in traccc::device::ccl_core, which was brought to light with these latest changes.

The aggregate_cluster code uses the same cell over and over again, so using a proxy formalism makes sense here after all.

After switching to vecmem-1.10.0, to make this possible.

sonarqubecloud · 2024-10-04T10:04:56Z

Quality Gate failed

Failed conditions
C Reliability Rating on New Code (required ≥ A)
2 New Bugs (required ≤ 0)

See analysis details on SonarCloud

Catch issues before they fail your Quality Gate with our IDE extension SonarLint

core/include/traccc/edm/impl/silicon_cell_collection.ipp

stephenswat

Okay, I think this can go in now.

krasznaa added cuda Changes related to CUDA sycl Changes related to SYCL cpu Changes related to CPU code edm Changes to the data model kokkos Changes related to Kokkos alpaka Changes related to Alpaka labels Sep 21, 2024

krasznaa requested review from paulgessinger, stephenswat and beomki-yeo September 21, 2024 16:55

krasznaa removed the kokkos Changes related to Kokkos label Sep 21, 2024

stephenswat requested changes Sep 23, 2024

View reviewed changes

krasznaa marked this pull request as draft September 24, 2024 14:50

krasznaa mentioned this pull request Sep 25, 2024

SoA Proxy, main branch (2024.09.25.) acts-project/vecmem#296

Merged

krasznaa force-pushed the CellEDMRewrite-main-20240918 branch from 9641666 to 34966b3 Compare September 26, 2024 14:43

krasznaa mentioned this pull request Sep 27, 2024

AoS / SoA Copy Benchmarks, main branch (2024.09.27.) acts-project/vecmem#297

Open

krasznaa force-pushed the CellEDMRewrite-main-20240918 branch from 34966b3 to 348ee3b Compare September 29, 2024 15:35

krasznaa mentioned this pull request Sep 30, 2024

Update to VecMem 1.9.0, main branch (2024.09.30.) #718

Merged

krasznaa force-pushed the CellEDMRewrite-main-20240918 branch from 348ee3b to c8b3f86 Compare September 30, 2024 14:55

krasznaa commented Sep 30, 2024

View reviewed changes

krasznaa force-pushed the CellEDMRewrite-main-20240918 branch from ca8dd63 to adaecb6 Compare September 30, 2024 16:28

krasznaa marked this pull request as ready for review September 30, 2024 16:28

krasznaa force-pushed the CellEDMRewrite-main-20240918 branch 2 times, most recently from aabcb8e to 1999c5b Compare September 30, 2024 18:35

krasznaa force-pushed the CellEDMRewrite-main-20240918 branch from ccb8eb3 to df40fb2 Compare October 1, 2024 14:12

stephenswat requested changes Oct 2, 2024

View reviewed changes

krasznaa force-pushed the CellEDMRewrite-main-20240918 branch from df40fb2 to b65600e Compare October 3, 2024 14:32

krasznaa added 17 commits October 4, 2024 12:01

Introduced SoA cell and cluster collections.

04f3ee4

Updated the algorithms in traccc::core to use the new types.

Adapted traccc::io to the SoA cell collection.

bd3fd96

Adapted traccc::cuda to the SoA cell collection.

a0c25cd

Adapted traccc::alpaka to the SoA cell collection.

819e75b

Adapted traccc::sycl to the SoA cell collection.

42e91a1

Adapted the tests to the SoA cell collection.

d0858fe

Adapted the examples to the SoA cell collection.

2c377a4

Removed traccc/edm/cell.hpp and traccc/edm/cluster.hpp.

1c7ddc2

Updated the remaining clients that were still using these headers by mistake.

Made the host sanity checks work with SoA containers.

d7e00af

Made the CUDA sanity checks work with SoA containers.

2d6b31b

Made the SYCL sanity checks work with SoA containers.

9d9d194

Made clusterization make use of vecmem::edm::proxy.

502d63f

Allowing the code to revert to a very similar setup that it had with the old AoS cell collection.

Simplified the I/O code with cell proxies.

1b89546

Introduced some additional functions into edm::silicon_cell to make it a bit easier to handle such objects/containers. Fixed a long-standing mistake in traccc::device::ccl_core, which was brought to light with these latest changes.

Some more proxy usage.

d405b4b

The aggregate_cluster code uses the same cell over and over again, so using a proxy formalism makes sense here after all.

Cleaned the silicon_cell code a little.

d27fd56

Switched to using cluster proxies in helper functions.

bf7ac0d

After switching to vecmem-1.10.0, to make this possible.

krasznaa force-pushed the CellEDMRewrite-main-20240918 branch from b65600e to bf7ac0d Compare October 4, 2024 10:03

stephenswat reviewed Oct 4, 2024

View reviewed changes

core/include/traccc/edm/impl/silicon_cell_collection.ipp Show resolved Hide resolved

stephenswat approved these changes Oct 4, 2024

View reviewed changes

krasznaa merged commit 14d4882 into acts-project:main Oct 4, 2024
22 of 23 checks passed

krasznaa deleted the CellEDMRewrite-main-20240918 branch October 4, 2024 12:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cell EDM Rewrite, main branch (2024.09.18.) #712

Cell EDM Rewrite, main branch (2024.09.18.) #712

krasznaa commented Sep 21, 2024

stephenswat left a comment

stephenswat commented Sep 23, 2024

beomki-yeo commented Sep 24, 2024

beomki-yeo commented Sep 24, 2024

stephenswat commented Sep 24, 2024

krasznaa commented Sep 24, 2024

krasznaa commented Sep 26, 2024

krasznaa Sep 30, 2024

krasznaa commented Sep 30, 2024

stephenswat left a comment

krasznaa commented Oct 2, 2024

Quality Gate failed

stephenswat commented Oct 2, 2024

krasznaa commented Oct 2, 2024

stephenswat commented Oct 3, 2024

sonarqubecloud bot commented Oct 4, 2024

stephenswat left a comment

Cell EDM Rewrite, main branch (2024.09.18.) #712

Cell EDM Rewrite, main branch (2024.09.18.) #712

Conversation

krasznaa commented Sep 21, 2024

stephenswat left a comment

Choose a reason for hiding this comment

stephenswat commented Sep 23, 2024

beomki-yeo commented Sep 24, 2024

beomki-yeo commented Sep 24, 2024

stephenswat commented Sep 24, 2024

krasznaa commented Sep 24, 2024

krasznaa commented Sep 26, 2024

krasznaa Sep 30, 2024

Choose a reason for hiding this comment

krasznaa commented Sep 30, 2024

stephenswat left a comment

Choose a reason for hiding this comment

krasznaa commented Oct 2, 2024

Quality Gate failed

stephenswat commented Oct 2, 2024

krasznaa commented Oct 2, 2024

stephenswat commented Oct 3, 2024

sonarqubecloud bot commented Oct 4, 2024

Quality Gate failed

stephenswat left a comment

Choose a reason for hiding this comment