Alpaka: Clustering + Throughput #558

CrossR · 2024-04-26T11:00:20Z

Slightly delayed due to some other work and then needing to get my branch back up to date...but this covers basically all the current Alpaka work, outside of the start of my Track finding testing and @StewMH HIP work.

This includes some of the clusterisation work, as well as the throughput examples. In some of the examples, there is some commented out code in there still, which is basically the parts from the CUDA examples that don't have an equivalent in Alpaka yet. Not sure if you'd prefer it moved, or left?

I'll also highlight some of the bits that I think people will have opinions on below.

Once there has been a brief look, I can look at squashing this down to something more reasonable.

Still ongoing (though it sounded like not a blocker in the previous meeting):

Debugging / verifying the work with a CUDA back end. Everything runs, but I want to verify if the memory issues I was hitting are gone, and ensure the results compare to the native CUDA. Right now, I can sometimes get results that don't directly match the CUDA with the same input (and a fixed seed) so I think I have some form of issue still.

Currently compiles, but the barrier code needs updating.

It looks like Alpaka always wants a WorkDiv of at least 1, whereas some of the CUDA code is fine with it being zero. For now, just clamp it to one, but investigate if any further changes are needed.

Currently, there doesn't seem to be a way to get that information from Alpaka, so just re-use the existing CUDA code here for now, as its just for user info.

This allows the ST example to run to completion, but may not actually be the underlying cause, as the MT example still fails.

This was only needed when the size of the extern was static.

Disabled the throughput building for now, to focus on the first two examples.

core/include/traccc/definitions/qualifiers.hpp

Assuming this is the only remaining issue, can fix this after.

examples/run/alpaka/CMakeLists.txt

StewMH · 2024-06-04T13:23:34Z

@krasznaa, @stephenswat, @beomki-yeo could we have a quick review? There aren't many changes outside the alpaka code.

beomki-yeo

Can we merge qualifiers.hpp and hints.hpp into a single file with the name of directives.hpp?

beomki-yeo

Nah. we can think of the file naming later

krasznaa

I'm very happy with how you added a bunch of ->ignore() and ->wait() calls on the various copy operations. However, since it's super easy to miss some of them, did you try to compile the code with -DVECMEM_FAIL_ON_ASYNC_ERRORS=TRUE? 🤔

https://github.com/acts-project/vecmem/blob/main/cmake/vecmem-options.cmake#L58-L60

I added that feature specifically to help with making sure that our code would only wait where it needs to. (Without that flag, if the user forgets to either wait for an event or ignore it, the code will wait for the event. Just to be safe.)

krasznaa · 2024-06-11T13:20:32Z

device/alpaka/src/clusterization/clusterization_algorithm.cpp

+struct CCLKernel {
+    template <typename TAcc>
+    ALPAKA_FN_ACC void operator()(
+        TAcc const& acc, const cell_collection_types::const_view cells_view,
+        const cell_module_collection_types::const_view modules_view,
+        const device::details::index_t max_cells_per_partition,
+        const device::details::index_t target_cells_per_partition,
+        measurement_collection_types::view measurements_view,
+        vecmem::data::vector_view<unsigned int> cell_links) const {


Generally, this setup is fine with me. I'm just thinking out loud here.

Would it not be nicer / more symmetric to call this something like traccc::alpaka::kernels::ccl_kernel? That's the sort of naming that we went with in traccc::cuda.

I think Alpaka allows such structs to carry member variables as well. Or is it not flexible enough for that? 🤔 I was just wondering, since you pass an actual object instance to alpaka::exec and not just the type of the struct, could some of these (many) arguments be delegated to member variables? Technically, the "configuration variables" would seem like appropriate members, while the data views indeed look more understandable as operator arguments. 🤔

Would it not be nicer / more symmetric to call this something like traccc::alpaka::kernels::ccl_kernel? That's the sort of naming that we went with in traccc::cuda.

I'd originally tried to split them up, so they were obviously distinct, but having them mirror the CUDA / actual kernel names could be nicer too. Regardless, I'll look at fixing that up in the follow up finding / fitting PR, just so I can change them all in one go.

I think Alpaka allows such structs to carry member variables as well. Or is it not flexible enough for that? 🤔 I was just wondering, since you pass an actual object instance to alpaka::exec and not just the type of the struct, could some of these (many) arguments be delegated to member variables? Technically, the "configuration variables" would seem like appropriate members, while the data views indeed look more understandable as operator arguments. 🤔

Possibly! I think the style shown here matches the Alpaka-recommended way at least, but it may be better for our needs to swap....I can do some testing soon and see, and when I look at renaming the kernels, I can see if swapping lots of args to member variables makes sense.

examples/run/alpaka/CMakeLists.txt

krasznaa · 2024-06-11T13:30:00Z

examples/run/alpaka/seeding_example_alpaka.cpp

+    // traccc::finding_performance_writer find_performance_writer(
+    //     traccc::finding_performance_writer::config{});
+    // traccc::fitting_performance_writer fit_performance_writer(
+    //     traccc::fitting_performance_writer::config{});


Indeed, let's not add a whole bunch of commented lines. 🤔 You should absolutely keep this code in a commit in your own repository. But I'd prefer not to put it in like this into the main repo.

stephenswat

Looks good, but can we clarify the thing where we are ignoring events without synchronizing them?

Also, could you perhaps implement the CCA test harness in Alpaka? That would be useful for testing; but can also be done in a future PR.

core/include/traccc/definitions/hints.hpp

device/alpaka/CMakeLists.txt

device/alpaka/include/traccc/alpaka/clusterization/clusterization_algorithm.hpp

stephenswat · 2024-07-03T12:56:14Z

device/alpaka/include/traccc/alpaka/clusterization/spacepoint_formation_algorithm.hpp

+    ///
+    output_type operator()(
+        const measurement_collection_types::const_view& measurements_view,
+        const cell_module_collection_types::const_view& modules_view)


Note that you are racing #627 here, but Attila is on a holiday so this can go in first.

device/alpaka/src/clusterization/spacepoint_formation_algorithm.cpp

stephenswat · 2024-07-03T13:08:26Z

device/alpaka/src/clusterization/spacepoint_formation_algorithm.cpp

+    // Create the result buffer.
+    spacepoint_collection_types::buffer spacepoints(num_measurements,
+                                                    m_mr.main);
+    m_copy.get().setup(spacepoints)->ignore();


Does Alpaka use stream ordering? I.e. do we need to wait for this before launching the kernel?

Potentially...I don't believe there is any stream ordering (or, at least the vecmem operations aren't part of any alpaka queue at least).

stephenswat · 2024-07-03T13:09:13Z

device/alpaka/src/seeding/seed_finding.cpp

@@ -359,7 +360,7 @@ seed_finding::output_type seed_finding::operator()(
    seed_collection_types::buffer seed_buffer(
        pBufHost_counter->m_nTriplets, m_mr.main,
        vecmem::data::buffer_type::resizable);
-    m_copy.setup(seed_buffer);
+    m_copy.setup(seed_buffer)->ignore();


Same as above, do we need to wait on this queue before launching any kernels?

device/alpaka/src/utils/barrier.hpp

stephenswat · 2024-07-03T13:13:26Z

examples/run/alpaka/full_chain_algorithm.hpp

+#ifdef ALPAKA_ACC_GPU_CUDA_ENABLED
+    /// Device memory resource
+    vecmem::cuda::device_memory_resource m_device_mr;
+    /// Memory copy object
+    vecmem::cuda::copy m_copy;
+#elif ALPAKA_ACC_GPU_HIP_ENABLED
+    /// Device memory resource
+    vecmem::hip::device_memory_resource m_device_mr;
+    /// Memory copy object
+    vecmem::hip::copy m_copy;
+#else
+    /// Device memory resource
+    vecmem::memory_resource& m_device_mr;
+    /// Memory copy object
+    vecmem::copy m_copy;
+#endif


This is really not ideal 😢 but we can fix it later.

I think it is becoming more and more obvious that potentially a vecmem::alpaka or something similar that deals with this complication once rather than every time we import and every time we use vecmem is likely a better solution.

CrossR · 2024-07-03T15:00:54Z

Looks good, but can we clarify the thing where we are ignoring events without synchronizing them?

Yep, will follow up after this HLT workshop to check that.

Also, could you perhaps implement the CCA test harness in Alpaka? That would be useful for testing; but can also be done in a future PR.

We've been speaking recently about how nice it would be to get the Alpaka work tested more thoroughly, so will follow up with this after this PR!

stephenswat · 2024-07-03T15:05:46Z

We've been speaking recently about how nice it would be to get the Alpaka work tested more thoroughly, so will follow up with this after this PR!

Alright! The test harness is designed to be platform independent, so hopefully supporting Alpaka won't be more than, say, 50 LOC. We will see.

stephenswat · 2024-07-03T15:06:47Z

That said @CrossR I think it would be excellent to get this in even if it is not entirely perfect yet; is it okay for you if I merge this?

CrossR · 2024-07-03T15:26:58Z

Just debugging the pragma issue with GCC in the CI, then should be good to go.

The alpaka code is self contained enough, outside of the pragma change, that there shouldn't be any other interactions so the few missing bits here can be added as part of follow up PRS (plus I can get my recent work on track finding and fitting in, and get the Alpaka code closer to matching the CUDA status).

stephenswat · 2024-07-04T12:41:18Z

@CrossR can you ping me when the issue is resolved?

CrossR · 2024-07-04T14:12:17Z

@stephenswat I've disabled the pragma in GCC for now (since the compiler was the problem one in the first place annoyingly), so this should be good to merge.

CrossR added 30 commits August 1, 2023 10:25

Swap trait code to use templated args.

9495e5e

Start work on clustering code.

0d63262

Currently compiles, but the barrier code needs updating.

Fix compilation warnings for barrier code.

2755dc5

Add example code for new alpaka algorithm.

1cf1684

Clamp alpaka WorkDiv size.

e50c19a

It looks like Alpaka always wants a WorkDiv of at least 1, whereas some of the CUDA code is fine with it being zero. For now, just clamp it to one, but investigate if any further changes are needed.

Fix block extent to match CUDA definition.

e774345

Correctly initialise value to zero.

4db5a27

Start adding in throughput examples.

3e74a44

Add multi-threaded throughput example.

3e15f26

Add back CUDA device info.

9a77362

Currently, there doesn't seem to be a way to get that information from Alpaka, so just re-use the existing CUDA code here for now, as its just for user info.

Format code.

30ed2f6

Fix some headers.

e879154

Zero the triplet_buffer memory.

a4fc492

This allows the ST example to run to completion, but may not actually be the underlying cause, as the MT example still fails.

Merge branch 'main' into alpaka_clustering

949c4f1

Merge remote-tracking branch 'origin/main' into alpaka_clustering

64c9f2f

Fix post-merge changes.

e0defcc

Fix for CLI options changing.

99d145a

Revert change to seeding_config.

fadca68

This was only needed when the size of the extern was static.

Merge branch 'alpaka_second_example' into alpaka_clustering

5a97c97

Fix up building of seeding and seq examples.

0904ada

Disabled the throughput building for now, to focus on the first two examples.

Re-enable the Alpaka throughput examples.

3ea39a4

Simplify the device output.

d3f2ab6

Barrier should be blockOr not blockCount.

3c4b64b

Merge branch 'alpaka_second_example' into alpaka_clustering

b82fd58

Merge branch 'alpaka_second_example' into alpaka_clustering

8b29c59

Merge branch 'alpaka_second_example' into alpaka_clustering

0028fbe

Fix post merge.

b770ced

Merge branch 'alpaka_second_example' into alpaka_clustering

805a5a6

Update for new alpaka code layout.

d75244a

Merge branch 'main' into alpaka_clustering

e59dfa2

CrossR force-pushed the alpaka_clustering branch 2 times, most recently from a2f85b4 to 0ef9cff Compare June 3, 2024 14:26

Fix post merge.

0fbaf9a

CrossR force-pushed the alpaka_clustering branch 2 times, most recently from da479a7 to 2169086 Compare June 4, 2024 08:25

StewMH reviewed Jun 4, 2024

View reviewed changes

core/include/traccc/definitions/qualifiers.hpp Outdated Show resolved Hide resolved

StewMH reviewed Jun 4, 2024

View reviewed changes

core/include/traccc/definitions/qualifiers.hpp Outdated Show resolved Hide resolved

Temporarily fix pragma unroll issue.

8dfb5b9

Assuming this is the only remaining issue, can fix this after.

CrossR force-pushed the alpaka_clustering branch from 2169086 to 8dfb5b9 Compare June 4, 2024 10:57

Move to hints file.

4da1b6d

CrossR force-pushed the alpaka_clustering branch from 23713da to 4da1b6d Compare June 4, 2024 11:38

StewMH reviewed Jun 4, 2024

View reviewed changes

examples/run/alpaka/CMakeLists.txt Outdated Show resolved Hide resolved

beomki-yeo reviewed Jun 11, 2024

View reviewed changes

beomki-yeo approved these changes Jun 11, 2024

View reviewed changes

krasznaa requested changes Jun 11, 2024

View reviewed changes

CrossR added 3 commits June 25, 2024 11:35

Merge remote-tracking branch 'origin/main' into alpaka_clustering

695e5c9

Fix up PR comments.

c7304d9

Merge remote-tracking branch 'origin/main' into alpaka_clustering

6006c4c

stephenswat reviewed Jul 3, 2024

View reviewed changes

stephenswat mentioned this pull request Jul 3, 2024

Improve robustness and performance of CCL #595

Merged

Start work on PR comments.

53dbaff

Don't use unroll on GCC for now.

4471d54

stephenswat merged commit 694581c into acts-project:main Jul 4, 2024
24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alpaka: Clustering + Throughput #558

Alpaka: Clustering + Throughput #558

CrossR commented Apr 26, 2024

StewMH commented Jun 4, 2024

beomki-yeo left a comment

beomki-yeo left a comment

krasznaa left a comment

krasznaa Jun 11, 2024

CrossR Jun 25, 2024

krasznaa Jun 11, 2024

stephenswat left a comment

stephenswat Jul 3, 2024

stephenswat Jul 3, 2024

CrossR Jul 3, 2024

stephenswat Jul 3, 2024

stephenswat Jul 3, 2024

CrossR Jul 3, 2024

CrossR commented Jul 3, 2024

stephenswat commented Jul 3, 2024

stephenswat commented Jul 3, 2024

CrossR commented Jul 3, 2024 •

edited

Loading

stephenswat commented Jul 4, 2024

CrossR commented Jul 4, 2024

Alpaka: Clustering + Throughput #558

Alpaka: Clustering + Throughput #558

Conversation

CrossR commented Apr 26, 2024

StewMH commented Jun 4, 2024

beomki-yeo left a comment

Choose a reason for hiding this comment

beomki-yeo left a comment

Choose a reason for hiding this comment

krasznaa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephenswat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CrossR commented Jul 3, 2024

stephenswat commented Jul 3, 2024

stephenswat commented Jul 3, 2024

CrossR commented Jul 3, 2024 • edited Loading

stephenswat commented Jul 4, 2024

CrossR commented Jul 4, 2024

CrossR commented Jul 3, 2024 •

edited

Loading