-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for shared memory provider of libfabric in SST + simple support for manual data progress via threading #3964
base: master
Are you sure you want to change the base?
Conversation
I can take a look at enabling a manual progress thread, or possibly using EVPath tools to tie progress to FDs if supported by CXI, but realistically I have one week before I disappear for two weeks, and given the other things on my plate the odds of this happening before January are unfortunately small. WRT the future work notes, yes, supporting different data planes between different ranks is probably impactical given how SST is architected. It would have to be a single data plane that supported both transport mechanisms, which is still a lot of work, but fits the way dataplanes integrate into SST. I've also long had in mind an extension to the data access mechanisms that might reduce the copy overheads for RDMA and shared memory data access, but it involves several changes from the BP5Deserializer, through the engine and down to the data plane, so it has remained on the to-do list for a long time. But it's something to re-examine at some point. |
No problem, this is not urgent.
Thank you for the info. My main motivation in posting these ideas was to get a rough estimation of how viable these are to implement. It does not surprise me very much that lots of work would be required. |
It seems that the current thread-based implementation runs into a libfabric bug, fixed by ofiwg/libfabric#9644. The bug means that calls to |
Ref. the meeting with the Maestro team: The fabric should not be bombarded with too many requests at once. Batch size currently hardcoded as 10.
Todo: Better than doing this, initialize endpoints on demand only
I think that this is now mostly ready for review. I tested this on our local system today, still need to test on Frontier, forgot my hardware key today. To summarize: 1. Batch processing If we decide to merge it, then this should be configurable as an engine parameter, the batch size is currently a hardcoded as a constant. 2. libfabric DP
Additionally I noticed, that some fabrics (such as psm3 on our local system) don't support Example from the output of
3. UCX DP This also adds a progress thread capability similar to the libfabric DP. I found no way of automatically figuring out if a worker needs manual progress, so it's turned off by default and must be enabled via Additionally fixes a forgotten call to For shared memory communication, the following environment variables must be set: export SST_UCX_PROGRESS_THREAD=yes
export UCX_TLS=shm # restricts UCX to shared memory
export UCX_POSIX_USE_PROC_LINK=n # workaround for some bug in UCX When I tested this on multiple nodes (data exchange via SST only intra-node), I noticed that the UCX DP reader side will contact all writers upon initialization in I did not yet document anything except for inline comments. |
Thanks Franz! I'm up to my neck in other things at the moment, but will look at this when I can... |
No problem, take the time you need. |
This commit does this, but I've not added it to the PR yet. I can either do that or submit it as a follow-up. |
@@ -798,6 +1027,11 @@ static int get_cxi_auth_key_from_env(CP_Services Svcs, void *CP_Stream, struct _ | |||
char const *slingshot_devices = getenv("SLINGSHOT_DEVICES"); | |||
char const *preferred_device = get_preferred_domain(Params); | |||
|
|||
if ((!preferred_device && strncmp("cxi", preferred_device, 3) != 0) || !slingshot_devices) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if ((!preferred_device && strncmp("cxi", preferred_device, 3) != 0) || !slingshot_devices) | |
if ((!preferred_device || strncmp("cxi", preferred_device, 3) != 0) || !slingshot_devices) |
I need to check this logic again, currently this is wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be fixed with ba7d6c1
Background: For some of our intended SST workflows, we exchange data only within nodes. Using the system networks is an unnecessary detour in that case. Since libfabric implements shared memory, this is a somewhat low-hanging fruit to enable truly in-memory staging workflows with SST.
Necessary changes:
FI_MR_BASIC
registration mode prints an error, but interestingly it still works anyway. This PR still replacesFI_MR_BASIC
with the equivalentFI_MR_VIRT_ADDR | FI_MR_ALLOCATED | FI_MR_PROV_KEY | FI_MR_LOCAL
.The manual data progress has turned out to be somewhat annoying. My idea was to spawn a thread that regularly pokes the provider, but this approach does not work well with any less than a busy loop.
Every call to
fi_read()
by the Reader requires an accompanying call tofi_cq_read()
by the Writer.fi_read()
will fail withEAGAIN
until the writer has acknowledged the load request. It seems that (at least with my current approach) this requires a ping-pong sort of protocol: I tried decreasing latencies by processingfi_read()
as well asfi_cq_read()
in batches and it made no difference, the provider only processes one load request at a time. In consequence, the current implementation has extreme latencies since it puts the progress thread to sleep before poking the provider again.@eisenhauer mentioned in a video call that the control plane of SST implements a potential alternative approach based on file-descriptor polling.
Further benefit of implementing manual progress:
One of the most common issues with a badly-configured installation of the libfabric dataplane are hangups. Having support in SST for manual progress might fix this in some instances.
I have observed that this PR also "unlocks" the tcp provider which previously did not work.
Future potential / ideas:
Both these following points are probably an immense amount of work. Just some ideas that I wanted to put out here on how SST might be used to implement zero-overhead staging workflows:
I imagine that this is probably not an easy change to implement, since the control plane would need to deal with two data planes at once.
void Engine::Get<T>(Variable<T>, T**) const;
for data from the same node (currently used by the Inline Engine).TODO: