Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alpaka: Clustering + Throughput #558

Merged
merged 47 commits into from
Jul 4, 2024

Conversation

CrossR
Copy link
Contributor

@CrossR CrossR commented Apr 26, 2024

Slightly delayed due to some other work and then needing to get my branch back up to date...but this covers basically all the current Alpaka work, outside of the start of my Track finding testing and @StewMH HIP work.

This includes some of the clusterisation work, as well as the throughput examples. In some of the examples, there is some commented out code in there still, which is basically the parts from the CUDA examples that don't have an equivalent in Alpaka yet. Not sure if you'd prefer it moved, or left?

I'll also highlight some of the bits that I think people will have opinions on below.

Once there has been a brief look, I can look at squashing this down to something more reasonable.

Still ongoing (though it sounded like not a blocker in the previous meeting):

  • Debugging / verifying the work with a CUDA back end. Everything runs, but I want to verify if the memory issues I was hitting are gone, and ensure the results compare to the native CUDA. Right now, I can sometimes get results that don't directly match the CUDA with the same input (and a fixed seed) so I think I have some form of issue still.

CrossR added 30 commits August 1, 2023 10:25
Currently compiles, but the barrier code needs updating.
It looks like Alpaka always wants a WorkDiv of at least 1,
whereas some of the CUDA code is fine with it being zero.

For now, just clamp it to one, but investigate if any further changes
are needed.
Currently, there doesn't seem to be a way to get that information from
Alpaka, so just re-use the existing CUDA code here for now, as its
just for user info.
This allows the ST example to run to completion, but may not actually
be the underlying cause, as the MT example still fails.
This was only needed when the size of the extern was static.
Disabled the throughput building for now, to focus on the first two
examples.
@CrossR CrossR force-pushed the alpaka_clustering branch 2 times, most recently from a2f85b4 to 0ef9cff Compare June 3, 2024 14:26
@CrossR CrossR force-pushed the alpaka_clustering branch 2 times, most recently from da479a7 to 2169086 Compare June 4, 2024 08:25
Assuming this is the only remaining issue, can fix this after.
@CrossR CrossR force-pushed the alpaka_clustering branch from 2169086 to 8dfb5b9 Compare June 4, 2024 10:57
@CrossR CrossR force-pushed the alpaka_clustering branch from 23713da to 4da1b6d Compare June 4, 2024 11:38
@StewMH
Copy link
Contributor

StewMH commented Jun 4, 2024

@krasznaa, @stephenswat, @beomki-yeo could we have a quick review? There aren't many changes outside the alpaka code.

Copy link
Contributor

@beomki-yeo beomki-yeo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we merge qualifiers.hpp and hints.hpp into a single file with the name of directives.hpp?

Copy link
Contributor

@beomki-yeo beomki-yeo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah. we can think of the file naming later

Copy link
Member

@krasznaa krasznaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very happy with how you added a bunch of ->ignore() and ->wait() calls on the various copy operations. However, since it's super easy to miss some of them, did you try to compile the code with -DVECMEM_FAIL_ON_ASYNC_ERRORS=TRUE? 🤔

https://github.com/acts-project/vecmem/blob/main/cmake/vecmem-options.cmake#L58-L60

I added that feature specifically to help with making sure that our code would only wait where it needs to. (Without that flag, if the user forgets to either wait for an event or ignore it, the code will wait for the event. Just to be safe.)

Comment on lines +22 to +30
struct CCLKernel {
template <typename TAcc>
ALPAKA_FN_ACC void operator()(
TAcc const& acc, const cell_collection_types::const_view cells_view,
const cell_module_collection_types::const_view modules_view,
const device::details::index_t max_cells_per_partition,
const device::details::index_t target_cells_per_partition,
measurement_collection_types::view measurements_view,
vecmem::data::vector_view<unsigned int> cell_links) const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, this setup is fine with me. I'm just thinking out loud here.

  • Would it not be nicer / more symmetric to call this something like traccc::alpaka::kernels::ccl_kernel? That's the sort of naming that we went with in traccc::cuda.
  • I think Alpaka allows such structs to carry member variables as well. Or is it not flexible enough for that? 🤔 I was just wondering, since you pass an actual object instance to alpaka::exec and not just the type of the struct, could some of these (many) arguments be delegated to member variables? Technically, the "configuration variables" would seem like appropriate members, while the data views indeed look more understandable as operator arguments. 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it not be nicer / more symmetric to call this something like traccc::alpaka::kernels::ccl_kernel? That's the sort of naming that we went with in traccc::cuda.

I'd originally tried to split them up, so they were obviously distinct, but having them mirror the CUDA / actual kernel names could be nicer too. Regardless, I'll look at fixing that up in the follow up finding / fitting PR, just so I can change them all in one go.

I think Alpaka allows such structs to carry member variables as well. Or is it not flexible enough for that? 🤔 I was just wondering, since you pass an actual object instance to alpaka::exec and not just the type of the struct, could some of these (many) arguments be delegated to member variables? Technically, the "configuration variables" would seem like appropriate members, while the data views indeed look more understandable as operator arguments. 🤔

Possibly! I think the style shown here matches the Alpaka-recommended way at least, but it may be better for our needs to swap....I can do some testing soon and see, and when I look at renaming the kernels, I can see if swapping lots of args to member variables makes sense.

Comment on lines 120 to 123
// traccc::finding_performance_writer find_performance_writer(
// traccc::finding_performance_writer::config{});
// traccc::fitting_performance_writer fit_performance_writer(
// traccc::fitting_performance_writer::config{});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, let's not add a whole bunch of commented lines. 🤔 You should absolutely keep this code in a commit in your own repository. But I'd prefer not to put it in like this into the main repo.

Copy link
Member

@stephenswat stephenswat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but can we clarify the thing where we are ignoring events without synchronizing them?

Also, could you perhaps implement the CCA test harness in Alpaka? That would be useful for testing; but can also be done in a future PR.

///
output_type operator()(
const measurement_collection_types::const_view& measurements_view,
const cell_module_collection_types::const_view& modules_view)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that you are racing #627 here, but Attila is on a holiday so this can go in first.

// Create the result buffer.
spacepoint_collection_types::buffer spacepoints(num_measurements,
m_mr.main);
m_copy.get().setup(spacepoints)->ignore();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Alpaka use stream ordering? I.e. do we need to wait for this before launching the kernel?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially...I don't believe there is any stream ordering (or, at least the vecmem operations aren't part of any alpaka queue at least).

@@ -359,7 +360,7 @@ seed_finding::output_type seed_finding::operator()(
seed_collection_types::buffer seed_buffer(
pBufHost_counter->m_nTriplets, m_mr.main,
vecmem::data::buffer_type::resizable);
m_copy.setup(seed_buffer);
m_copy.setup(seed_buffer)->ignore();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, do we need to wait on this queue before launching any kernels?

Comment on lines +88 to +103
#ifdef ALPAKA_ACC_GPU_CUDA_ENABLED
/// Device memory resource
vecmem::cuda::device_memory_resource m_device_mr;
/// Memory copy object
vecmem::cuda::copy m_copy;
#elif ALPAKA_ACC_GPU_HIP_ENABLED
/// Device memory resource
vecmem::hip::device_memory_resource m_device_mr;
/// Memory copy object
vecmem::hip::copy m_copy;
#else
/// Device memory resource
vecmem::memory_resource& m_device_mr;
/// Memory copy object
vecmem::copy m_copy;
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really not ideal 😢 but we can fix it later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is becoming more and more obvious that potentially a vecmem::alpaka or something similar that deals with this complication once rather than every time we import and every time we use vecmem is likely a better solution.

@CrossR
Copy link
Contributor Author

CrossR commented Jul 3, 2024

Looks good, but can we clarify the thing where we are ignoring events without synchronizing them?

Yep, will follow up after this HLT workshop to check that.

Also, could you perhaps implement the CCA test harness in Alpaka? That would be useful for testing; but can also be done in a future PR.

We've been speaking recently about how nice it would be to get the Alpaka work tested more thoroughly, so will follow up with this after this PR!

@stephenswat
Copy link
Member

We've been speaking recently about how nice it would be to get the Alpaka work tested more thoroughly, so will follow up with this after this PR!

Alright! The test harness is designed to be platform independent, so hopefully supporting Alpaka won't be more than, say, 50 LOC. We will see.

@stephenswat
Copy link
Member

That said @CrossR I think it would be excellent to get this in even if it is not entirely perfect yet; is it okay for you if I merge this?

@CrossR
Copy link
Contributor Author

CrossR commented Jul 3, 2024

Just debugging the pragma issue with GCC in the CI, then should be good to go.

The alpaka code is self contained enough, outside of the pragma change, that there shouldn't be any other interactions so the few missing bits here can be added as part of follow up PRS (plus I can get my recent work on track finding and fitting in, and get the Alpaka code closer to matching the CUDA status).

@stephenswat
Copy link
Member

@CrossR can you ping me when the issue is resolved?

@CrossR
Copy link
Contributor Author

CrossR commented Jul 4, 2024

@stephenswat I've disabled the pragma in GCC for now (since the compiler was the problem one in the first place annoyingly), so this should be good to merge.

@stephenswat stephenswat merged commit 694581c into acts-project:main Jul 4, 2024
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants