Design Constraints/Limitations #28

crtrott · 2024-04-10T15:50:42Z

crtrott
Apr 10, 2024
Maintainer

Some thoughts I shared with Jan on the question of what the limitations of the interface for now should be, or at a minimum which order we should prioritize things. This is just some random things which came to mind:

Support for non-contiguous views (none, send by chunk [assuming there are contiguous chunks], pack/unpack
- if pack and unpack: where does buffer memory come from (runtime manages it, user provides it, malloc/free every time [feels like its performance prohibitive])
What differences in view type are allowed?
- const to non-const
- different layouts (assuming same extents) -- i.e. allow implicit transpose (requires effectively pack/unpack)?
- different memory spaces
- different memorytraits/accessors
asynchronous operations
- who keeps allocations alive?
- does Kokkos::fence() block outstanding KokkosComm async ops? (maybe we need a callback registration thing for Kokkos::fence, which could also be used by stuff like RemoteSpaces)
- if the interface does take execution space instances it feels like it needs to work with Kokkos::fence, since we imply that the operation is enqueued into the instance queue?

cedricchevalier19 · 2024-04-10T16:07:55Z

cedricchevalier19
Apr 10, 2024
Maintainer

We could have one discussion track for every one of your bullets!

What differences in view type are allowed?

different layouts (assuming same extents) -- i.e. allow implicit transpose (requires effectively pack/unpack)?

This point alone is not trivial to solve. We cannot know what's happening with the receiver, so dealing with layouts will be difficult, especially for non-SPMD codes.

asynchronous operations

All your concerns are genuine, and we must soon clarify what relationship we want to deal with between execution space instances and MPI communicators.

0 replies

janciesko · 2024-04-10T17:31:41Z

janciesko
Apr 10, 2024
Collaborator

We did experiment with a merged Kokkos::fence() + MPI_Win_flush_all or shmem_quite to simplify programming but it turned out problematic for a variety of reason, mainly around progress guarantees (on threaded back-ends) and deadlocks for prod-con situations. The good news is that the current stream-triggered and graph-triggered communication proposals require the user to submit something like a MPIX_Waitall_enqueue to the stream. In other words, request completion is a orthogonal concern to kernel completion (Kokkos::fence) and should be explicitly and independently addressed by the programmer.

If the user wants to advance program execution conditioned on a operation request (aka the handle from an async MPI call), he will likely need to either poll the KokkosComm::Request object on host (myRequest.waitAll) after a Kokkos::fence or submit a MPIX_Waitall_enqueue to the stream (both are valid btw.). The device initiated communication is a more complex topic that we can discuss when working on supporting KokkosComm on the device.

0 replies

cwpearson · 2024-04-11T22:53:07Z

cwpearson
Apr 11, 2024
Maintainer

Initial thoughts

* Support for non-contiguous views (none, send by chunk [assuming there are contiguous chunks], pack/unpack

Definitely want support for this - I see it as a main selling point for the initial MPI wrapper.

  * if pack and unpack: where does buffer memory come from (runtime manages it, user provides it, malloc/free every time [feels like its performance prohibitive])

I think we should expose some kind of trait to allow users to customize handling of non-contiguous data. We would also use the system internally, and provide a few implementations:
The default would be a runtime-managed version (whether that's an alloc/free each time, or some kind of pool is a separate question, maybe both are available).
We could also offer a version where the user would have to provide a buffer or an allocator or something.
The user might opt into this by providing a tag type as an optional template parameter, and then the trait specialized for that tag tyep would be used - this would allow different types at the send and recv sites too (maybe for a transpose or something, a la providing different MPI datatypes).

There is a very bare-bones version of this implemented right now, one of which uses an intermediate view + deep_copy, and another which constructs an MPI datatype. It would be cool to make the trait interface expressive enough to support send-by-chunk too.

This is also the extension point for folks who want to do research with special hardware or software - they can implement a custom specialization of the trait. Maybe we could allow the default to be overridden globally somehow.

* What differences in view type are allowed?
  
  * const to non-const

send from const (and implicitly non-const), recv into non-const

  * different layouts (assuming same extents) -- i.e. allow implicit transpose (requires effectively pack/unpack)?

I think we should support this. The default would be src(i, j) -> dst(i, j) regardless of layout.

  * different memory spaces

This may be "backend" specific, I don't know how we can detect at compile time that a send and recv are incompatible though. We could error out at run time or have a fallback path.

  * different memorytraits/accessors

I think we can strive to support this. May need to be integrated with non-contiguous handling. I'm unfamiliar with the consequences of accessor stuff at the moment.

* asynchronous operations
  
  * who keeps allocations alive?

I think we should keep allocations alive (lifetime errors for our users can be hard to debug). Currently type-erased copies of views are stashed in the Req object returned by async operations, and they are released when Req::wait is called.

  * does Kokkos::fence() block outstanding KokkosComm async ops? (maybe we need a callback registration thing for Kokkos::fence, which could also be used by stuff like RemoteSpaces)
  * if the interface does take execution space instances it feels like it needs to work with Kokkos::fence, since we imply that the operation is enqueued into the instance queue?

I envisioned the semantics as the operation is ordered w.r.t the provided instance. However, fencing that instance does not guarantee that the communication finished
It's a little strange:

Kokkos::Cuda space();
auto Req = irecv(space, ...); // calls MPI_Irecv, but doesn't actually use the space
space.fence(); // may or may not have some effect on communication
Req.wait() // implicitly use (and implicitly fence) space

Maybe our APIs should only accept a space argument at the point where something is (possibly) inserted into the space.

Kokkos::Cuda space();
auto Req = irecv(...); // no space work done since space is not an argument
space.fence(); // obviously unrelated to communication
Req.wait(space) // explicitly use (and implicitly fence) space

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design Constraints/Limitations #28

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Design Constraints/Limitations #28

crtrott Apr 10, 2024 Maintainer

Replies: 3 comments

cedricchevalier19 Apr 10, 2024 Maintainer

janciesko Apr 10, 2024 Collaborator

cwpearson Apr 11, 2024 Maintainer

crtrott
Apr 10, 2024
Maintainer

cedricchevalier19
Apr 10, 2024
Maintainer

janciesko
Apr 10, 2024
Collaborator

cwpearson
Apr 11, 2024
Maintainer