-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alpaka: Clustering + Throughput #558
Conversation
Currently compiles, but the barrier code needs updating.
It looks like Alpaka always wants a WorkDiv of at least 1, whereas some of the CUDA code is fine with it being zero. For now, just clamp it to one, but investigate if any further changes are needed.
Currently, there doesn't seem to be a way to get that information from Alpaka, so just re-use the existing CUDA code here for now, as its just for user info.
This allows the ST example to run to completion, but may not actually be the underlying cause, as the MT example still fails.
This was only needed when the size of the extern was static.
Disabled the throughput building for now, to focus on the first two examples.
a2f85b4
to
0ef9cff
Compare
da479a7
to
2169086
Compare
Assuming this is the only remaining issue, can fix this after.
@krasznaa, @stephenswat, @beomki-yeo could we have a quick review? There aren't many changes outside the alpaka code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we merge qualifiers.hpp
and hints.hpp
into a single file with the name of directives.hpp
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nah. we can think of the file naming later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very happy with how you added a bunch of ->ignore()
and ->wait()
calls on the various copy operations. However, since it's super easy to miss some of them, did you try to compile the code with -DVECMEM_FAIL_ON_ASYNC_ERRORS=TRUE
? 🤔
https://github.com/acts-project/vecmem/blob/main/cmake/vecmem-options.cmake#L58-L60
I added that feature specifically to help with making sure that our code would only wait where it needs to. (Without that flag, if the user forgets to either wait for an event or ignore it, the code will wait for the event. Just to be safe.)
struct CCLKernel { | ||
template <typename TAcc> | ||
ALPAKA_FN_ACC void operator()( | ||
TAcc const& acc, const cell_collection_types::const_view cells_view, | ||
const cell_module_collection_types::const_view modules_view, | ||
const device::details::index_t max_cells_per_partition, | ||
const device::details::index_t target_cells_per_partition, | ||
measurement_collection_types::view measurements_view, | ||
vecmem::data::vector_view<unsigned int> cell_links) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, this setup is fine with me. I'm just thinking out loud here.
- Would it not be nicer / more symmetric to call this something like
traccc::alpaka::kernels::ccl_kernel
? That's the sort of naming that we went with intraccc::cuda
. - I think Alpaka allows such structs to carry member variables as well. Or is it not flexible enough for that? 🤔 I was just wondering, since you pass an actual object instance to
alpaka::exec
and not just the type of the struct, could some of these (many) arguments be delegated to member variables? Technically, the "configuration variables" would seem like appropriate members, while the data views indeed look more understandable as operator arguments. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it not be nicer / more symmetric to call this something like traccc::alpaka::kernels::ccl_kernel? That's the sort of naming that we went with in traccc::cuda.
I'd originally tried to split them up, so they were obviously distinct, but having them mirror the CUDA / actual kernel names could be nicer too. Regardless, I'll look at fixing that up in the follow up finding / fitting PR, just so I can change them all in one go.
I think Alpaka allows such structs to carry member variables as well. Or is it not flexible enough for that? 🤔 I was just wondering, since you pass an actual object instance to alpaka::exec and not just the type of the struct, could some of these (many) arguments be delegated to member variables? Technically, the "configuration variables" would seem like appropriate members, while the data views indeed look more understandable as operator arguments. 🤔
Possibly! I think the style shown here matches the Alpaka-recommended way at least, but it may be better for our needs to swap....I can do some testing soon and see, and when I look at renaming the kernels, I can see if swapping lots of args to member variables makes sense.
// traccc::finding_performance_writer find_performance_writer( | ||
// traccc::finding_performance_writer::config{}); | ||
// traccc::fitting_performance_writer fit_performance_writer( | ||
// traccc::fitting_performance_writer::config{}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, let's not add a whole bunch of commented lines. 🤔 You should absolutely keep this code in a commit in your own repository. But I'd prefer not to put it in like this into the main repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, but can we clarify the thing where we are ignoring events without synchronizing them?
Also, could you perhaps implement the CCA test harness in Alpaka? That would be useful for testing; but can also be done in a future PR.
device/alpaka/include/traccc/alpaka/clusterization/clusterization_algorithm.hpp
Outdated
Show resolved
Hide resolved
/// | ||
output_type operator()( | ||
const measurement_collection_types::const_view& measurements_view, | ||
const cell_module_collection_types::const_view& modules_view) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that you are racing #627 here, but Attila is on a holiday so this can go in first.
device/alpaka/src/clusterization/spacepoint_formation_algorithm.cpp
Outdated
Show resolved
Hide resolved
// Create the result buffer. | ||
spacepoint_collection_types::buffer spacepoints(num_measurements, | ||
m_mr.main); | ||
m_copy.get().setup(spacepoints)->ignore(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Alpaka use stream ordering? I.e. do we need to wait for this before launching the kernel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potentially...I don't believe there is any stream ordering (or, at least the vecmem operations aren't part of any alpaka queue at least).
@@ -359,7 +360,7 @@ seed_finding::output_type seed_finding::operator()( | |||
seed_collection_types::buffer seed_buffer( | |||
pBufHost_counter->m_nTriplets, m_mr.main, | |||
vecmem::data::buffer_type::resizable); | |||
m_copy.setup(seed_buffer); | |||
m_copy.setup(seed_buffer)->ignore(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, do we need to wait on this queue before launching any kernels?
#ifdef ALPAKA_ACC_GPU_CUDA_ENABLED | ||
/// Device memory resource | ||
vecmem::cuda::device_memory_resource m_device_mr; | ||
/// Memory copy object | ||
vecmem::cuda::copy m_copy; | ||
#elif ALPAKA_ACC_GPU_HIP_ENABLED | ||
/// Device memory resource | ||
vecmem::hip::device_memory_resource m_device_mr; | ||
/// Memory copy object | ||
vecmem::hip::copy m_copy; | ||
#else | ||
/// Device memory resource | ||
vecmem::memory_resource& m_device_mr; | ||
/// Memory copy object | ||
vecmem::copy m_copy; | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really not ideal 😢 but we can fix it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is becoming more and more obvious that potentially a vecmem::alpaka
or something similar that deals with this complication once rather than every time we import and every time we use vecmem
is likely a better solution.
Yep, will follow up after this HLT workshop to check that.
We've been speaking recently about how nice it would be to get the Alpaka work tested more thoroughly, so will follow up with this after this PR! |
Alright! The test harness is designed to be platform independent, so hopefully supporting Alpaka won't be more than, say, 50 LOC. We will see. |
That said @CrossR I think it would be excellent to get this in even if it is not entirely perfect yet; is it okay for you if I merge this? |
Just debugging the The alpaka code is self contained enough, outside of the pragma change, that there shouldn't be any other interactions so the few missing bits here can be added as part of follow up PRS (plus I can get my recent work on track finding and fitting in, and get the Alpaka code closer to matching the CUDA status). |
@CrossR can you ping me when the issue is resolved? |
@stephenswat I've disabled the |
Slightly delayed due to some other work and then needing to get my branch back up to date...but this covers basically all the current Alpaka work, outside of the start of my Track finding testing and @StewMH HIP work.
This includes some of the clusterisation work, as well as the throughput examples. In some of the examples, there is some commented out code in there still, which is basically the parts from the CUDA examples that don't have an equivalent in Alpaka yet. Not sure if you'd prefer it moved, or left?
I'll also highlight some of the bits that I think people will have opinions on below.
Once there has been a brief look, I can look at squashing this down to something more reasonable.
Still ongoing (though it sounded like not a blocker in the previous meeting):