Support for shared memory provider of libfabric in SST + simple support for manual data progress via threading #3964

franzpoeschel · 2023-12-12T17:37:52Z

Background: For some of our intended SST workflows, we exchange data only within nodes. Using the system networks is an unnecessary detour in that case. Since libfabric implements shared memory, this is a somewhat low-hanging fruit to enable truly in-memory staging workflows with SST.

Necessary changes:

The most important change is that the shm provider requires manual data progress. This is hence also a follow-up to Adapt libfabric dataplane of SST to Cray CXI provider #3672: The CXI provider supported by that PR also technically requires manual data progress, but effectively works fine without it.
FI_MR_BASIC registration mode prints an error, but interestingly it still works anyway. This PR still replaces FI_MR_BASIC with the equivalent FI_MR_VIRT_ADDR | FI_MR_ALLOCATED | FI_MR_PROV_KEY | FI_MR_LOCAL.
Some subtleties in address handling

The manual data progress has turned out to be somewhat annoying. My idea was to spawn a thread that regularly pokes the provider, but this approach does not work well with any less than a busy loop.
Every call to fi_read() by the Reader requires an accompanying call to fi_cq_read() by the Writer. fi_read() will fail with EAGAIN until the writer has acknowledged the load request. It seems that (at least with my current approach) this requires a ping-pong sort of protocol: I tried decreasing latencies by processing fi_read() as well as fi_cq_read() in batches and it made no difference, the provider only processes one load request at a time. In consequence, the current implementation has extreme latencies since it puts the progress thread to sleep before poking the provider again.

@eisenhauer mentioned in a video call that the control plane of SST implements a potential alternative approach based on file-descriptor polling.

Further benefit of implementing manual progress:
One of the most common issues with a badly-configured installation of the libfabric dataplane are hangups. Having support in SST for manual progress might fix this in some instances.
I have observed that this PR also "unlocks" the tcp provider which previously did not work.

Future potential / ideas:

Both these following points are probably an immense amount of work. Just some ideas that I wanted to put out here on how SST might be used to implement zero-overhead staging workflows:

This might be used to introduce a notion of memory hierarchy into SST. Local data can be loaded via shared memory, while remote data is sent via the network. This might immensely decrease the load on the network for large-scale application runs.
I imagine that this is probably not an easy change to implement, since the control plane would need to deal with two data planes at once.
Not sure if this is possible with libfabric's shared memory provider, but it might be possible to use the zero-copy Engine-Get call void Engine::Get<T>(Variable<T>, T**) const; for data from the same node (currently used by the Inline Engine).

TODO:

Lazy connecting of endpoints (endpoints might not be reachable in shm settings)
Parameterization: Batch reading, threaded reading in ucx
threaded reading in libfabric: make it depend on PROGRESS_MANUAL parameter
no threading on the reader end

eisenhauer · 2023-12-13T16:12:03Z

I can take a look at enabling a manual progress thread, or possibly using EVPath tools to tie progress to FDs if supported by CXI, but realistically I have one week before I disappear for two weeks, and given the other things on my plate the odds of this happening before January are unfortunately small.

WRT the future work notes, yes, supporting different data planes between different ranks is probably impactical given how SST is architected. It would have to be a single data plane that supported both transport mechanisms, which is still a lot of work, but fits the way dataplanes integrate into SST. I've also long had in mind an extension to the data access mechanisms that might reduce the copy overheads for RDMA and shared memory data access, but it involves several changes from the BP5Deserializer, through the engine and down to the data plane, so it has remained on the to-do list for a long time. But it's something to re-examine at some point.

franzpoeschel · 2023-12-14T15:43:37Z

I can take a look at enabling a manual progress thread, or possibly using EVPath tools to tie progress to FDs if supported by CXI, but realistically I have one week before I disappear for two weeks, and given the other things on my plate the odds of this happening before January are unfortunately small.

No problem, this is not urgent.
It turns out that the solution was simpler than I had expected. Instead of running fi_cq_read() (non-blocking) on the thread every five seconds, I now run fi_cq_sread() (blocking) on the thread with a timeout of five seconds. The shm provider becomes much more responsive with this change. With this, I expect that the control-plane-enabled solution might not be needed any longer.

WRT the future work notes, yes, supporting different data planes between different ranks is probably impactical given how SST is architected. It would have to be a single data plane that supported both transport mechanisms, which is still a lot of work, but fits the way dataplanes integrate into SST. I've also long had in mind an extension to the data access mechanisms that might reduce the copy overheads for RDMA and shared memory data access, but it involves several changes from the BP5Deserializer, through the engine and down to the data plane, so it has remained on the to-do list for a long time. But it's something to re-examine at some point.

Thank you for the info. My main motivation in posting these ideas was to get a rough estimation of how viable these are to implement. It does not surprise me very much that lots of work would be required.

franzpoeschel · 2023-12-15T10:23:52Z

It seems that the current thread-based implementation runs into a libfabric bug, fixed by ofiwg/libfabric#9644. The bug means that calls to fi_cq_sread() that the progress thread potentially might make at the end of the simulation will not return. Finalizing the dataplane will hence not work.

This reverts commit 5766f5a.

This reverts commit 4a312d3.

shm complains about it, so use the equivalent FI_MR_VIRT_ADDR | FI_MR_ALLOCATED | FI_MR_PROV_KEY | FI_MR_LOCAL

Revert "Request n items at once" This reverts commit 4b4909e. Revert "Try enqueuing the fi_read() on the thread" This reverts commit 6d550bc.

It currently seems that the worker threads don't successfully finalize..

Ref. the meeting with the Maestro team: The fabric should not be bombarded with too many requests at once. Batch size currently hardcoded as 10.

Todo: Better than doing this, initialize endpoints on demand only

franzpoeschel · 2024-07-26T16:02:09Z

@eisenhauer

I think that this is now mostly ready for review. I tested this on our local system today, still need to test on Frontier, forgot my hardware key today.

To summarize:

1. Batch processing
Implemented in engine/sst/SstReader.(c|h|t)pp. I'll leave it up to you if you want to merge this, it can also be reverted as it is orthogonal to the other changes.
Instead of pushing all operations to the data plane at once and then waiting for them, this enqueues and waits for them in batches.
This was necessary for full-scale runs on Frontier.

If we decide to merge it, then this should be configurable as an engine parameter, the batch size is currently a hardcoded as a constant.

2. libfabric DP
This adds:

A progress thread that is automatically launched at the writer side if the engine indicates manual data progress. It can be turned on or off manually (also for the reader side) using export FABRIC_PROGRESS_THREAD=yes.
On the reader side, fi_cq_sread() is called at appropriate places when the fabric requires manual progress.
An environment variable FABRIC_PROVIDER to specify the fabric module to select for. Together with the old FABRIC_IFACE which specifies the domain, this allows for a very precise fabric selection. (The same domain can be implemented by several fabrics, leading to easy-to-break setups)
With this, support for the shm fabric for shared memory communication. FABRIC_IFACE=shm must be specified, the fabric is not selected automatically since it does not allow cross-node communication.
Some small fixes: More error-tolerant address handling and retrieval of CXI credentials.

Additionally I noticed, that some fabrics (such as psm3 on our local system) don't support FI_MR_PROV_KEY, meaning that in fi_mr_reg(), the requested_key parameter must not be 0, but something else. Those fabrics currently don't work, but would be relatively trivial to also support by just counting keys starting from 1... Maybe as a follow-up PR.

Example from the output of fi_info:

provider: sockets
    fabric: fe80::/64
    domain: eno1
    version: 117.0
    type: FI_EP_MSG
    protocol: FI_PROTO_SOCK_TCP
provider: tcp
    fabric: fe80::/64
    domain: eno1
    version: 117.0
    type: FI_EP_MSG
    protocol: FI_PROTO_SOCK_TCP

FABRIC_IFACE is the same for both configurations above, namely eno1, but FABRIC_PROVIDER is set to either sockets or tcp.

3. UCX DP

This also adds a progress thread capability similar to the libfabric DP. I found no way of automatically figuring out if a worker needs manual progress, so it's turned off by default and must be enabled via export SST_UCX_PROGRESS_THREAD=yes. This is only implemented on the writer side as it leads to errors in the reader, possibly due to interfering requests on main thread and progress thread.

Additionally fixes a forgotten call to ucp_rkey_destroy().

For shared memory communication, the following environment variables must be set:

export SST_UCX_PROGRESS_THREAD=yes
export UCX_TLS=shm # restricts UCX to shared memory
export UCX_POSIX_USE_PROC_LINK=n # workaround for some bug in UCX

When I tested this on multiple nodes (data exchange via SST only intra-node), I noticed that the UCX DP reader side will contact all writers upon initialization in UcxProvideWriterDataToReader(). I don't know if this is intentional (seems like a potential scaling issue), but this obviously cannot work in such shared memory setups. So I changed that function to ignore a connection failure at that point and print a warning instead.

I did not yet document anything except for inline comments.

eisenhauer · 2024-07-26T16:41:44Z

Thanks Franz! I'm up to my neck in other things at the moment, but will look at this when I can...

franzpoeschel · 2024-07-30T09:49:36Z

No problem, take the time you need.
I just tested this in a two-node job on Frontier, RDMA via libfabric still works and now additionally shared memory via a self-compiled UCX installation. (Since the system libfabric does not have the shm provider, shared memory via libfabric cannot be used.)

franzpoeschel · 2024-08-05T08:56:08Z

Additionally I noticed, that some fabrics (such as psm3 on our local system) don't support FI_MR_PROV_KEY, meaning that in fi_mr_reg(), the requested_key parameter must not be 0, but something else. Those fabrics currently don't work, but would be relatively trivial to also support by just counting keys starting from 1... Maybe as a follow-up PR.

This commit does this, but I've not added it to the PR yet. I can either do that or submit it as a follow-up.

franzpoeschel · 2024-08-06T12:39:15Z

source/adios2/toolkit/sst/dp/rdma_dp.c

@@ -798,6 +1027,11 @@ static int get_cxi_auth_key_from_env(CP_Services Svcs, void *CP_Stream, struct _
    char const *slingshot_devices = getenv("SLINGSHOT_DEVICES");
    char const *preferred_device = get_preferred_domain(Params);

+    if ((!preferred_device && strncmp("cxi", preferred_device, 3) != 0) || !slingshot_devices)


Suggested change

if ((!preferred_device && strncmp("cxi", preferred_device, 3) != 0) || !slingshot_devices)

if ((!preferred_device || strncmp("cxi", preferred_device, 3) != 0) || !slingshot_devices)

I need to check this logic again, currently this is wrong

Should be fixed with ba7d6c1

franzpoeschel mentioned this pull request Dec 13, 2023

Adapt libfabric dataplane of SST to Cray CXI provider #3672

Merged

franzpoeschel force-pushed the shm branch from e5e0c3c to 0b771b6 Compare January 23, 2024 11:30

franzpoeschel force-pushed the shm branch from 218593e to ed3872e Compare March 12, 2024 11:09

franzpoeschel added 24 commits July 22, 2024 16:49

First attempt at shm support

1099d05

Keep mr_mode

86809ee

Hardcode address length lol

9c4e6e4

Tracing output

7aafa45

Revert "Hardcode address length lol"

80904b8

This reverts commit 5766f5a.

Guard against too small address buffers

1cfb02b

some stuff

e6f8179

Use manual progress via thread

674dfe4

Fix wrong memory access mode?

15be730

Tentatively working

7443356

Fixes

3f37d28

Bit better backoff for background thread

37d7337

Remove some debugging output

de90225

Revert "Tracing output"

aa6cdee

This reverts commit 4a312d3.

Cleanup

75be33d

Do not specify data_progress

df26306

No longer use FI_MR_BASIC

b284dc7

shm complains about it, so use the equivalent FI_MR_VIRT_ADDR | FI_MR_ALLOCATED | FI_MR_PROV_KEY | FI_MR_LOCAL

Try enqueuing the fi_read() on the thread

0995c7f

Request n items at once

5b90dd2

Revert last two commits

3e14cd3

Revert "Request n items at once" This reverts commit 4b4909e. Revert "Try enqueuing the fi_read() on the thread" This reverts commit 6d550bc.

Memory management

97f4d36

Use blocking fi_cq_sread() in the worker thread

9878ba5

It currently seems that the worker threads don't successfully finalize..

Batch processing in fi_cq_sread()

594b2e6

Yield to scheduler in busy loop

fe839f2

franzpoeschel added 6 commits July 22, 2024 16:49

Add missing key destroy

5fd6470

Replace sleeping with condition variables

dfc9437

Remove sched_yield() logic

756bb2e

Enqueue remote reads in batches

01fc7a1

Ref. the meeting with the Maestro team: The fabric should not be bombarded with too many requests at once. Batch size currently hardcoded as 10.

Tmp: Ignore unreachable endpoints in UCX

a17ab8d

Todo: Better than doing this, initialize endpoints on demand only

Only use thread in libfabric DP when needed

c57a7b2

franzpoeschel force-pushed the shm branch from ca621d2 to c57a7b2 Compare July 22, 2024 15:54

franzpoeschel and others added 10 commits July 23, 2024 13:37

Parameterize progress thread in UCX

dbb1ace

Request multithreading support from libfabric

b98059a

Use progress thread only on writer side in libfabric

3213575

More seamless batching for BP5

5b05c2e

Same for BP

1d37205

Some comments

644915f

Make this configurable via environment variable in libfabric

7efc224

Some cleanup in UCX

3de00c0

Some error resistance in CXI key retrieval

dc83736

Some final fixes

5ee3ec7

franzpoeschel added 2 commits July 29, 2024 12:40

Add an inline comment

c5ba305

Enable manual progress without thread on reader side

4b63824

franzpoeschel changed the title ~~[WIP] support for shared memory provider of libfabric in SST + simple support for manual data progress via threading~~ Support for shared memory provider of libfabric in SST + simple support for manual data progress via threading Jul 29, 2024

franzpoeschel commented Aug 6, 2024

View reviewed changes

franzpoeschel mentioned this pull request Aug 9, 2024

CMake scripts for the MPI dataplane of SST don't work #4297

Open

eisenhauer and others added 4 commits August 11, 2024 09:57

Merge branch 'master' into shm

5a89794

Format

6498e1e

warning

7f17c0c

Fix env var handling for CXI key retrieval

ba7d6c1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for shared memory provider of libfabric in SST + simple support for manual data progress via threading #3964

Support for shared memory provider of libfabric in SST + simple support for manual data progress via threading #3964

franzpoeschel commented Dec 12, 2023 •

edited

Loading

eisenhauer commented Dec 13, 2023

franzpoeschel commented Dec 14, 2023

franzpoeschel commented Dec 15, 2023 •

edited

Loading

franzpoeschel commented Jul 26, 2024 •

edited

Loading

eisenhauer commented Jul 26, 2024

franzpoeschel commented Jul 30, 2024

franzpoeschel commented Aug 5, 2024

franzpoeschel Aug 6, 2024

franzpoeschel Aug 12, 2024

	if ((!preferred_device && strncmp("cxi", preferred_device, 3) != 0) \|\| !slingshot_devices)
	if ((!preferred_device \|\| strncmp("cxi", preferred_device, 3) != 0) \|\| !slingshot_devices)

Support for shared memory provider of libfabric in SST + simple support for manual data progress via threading #3964

Are you sure you want to change the base?

Support for shared memory provider of libfabric in SST + simple support for manual data progress via threading #3964

Conversation

franzpoeschel commented Dec 12, 2023 • edited Loading

eisenhauer commented Dec 13, 2023

franzpoeschel commented Dec 14, 2023

franzpoeschel commented Dec 15, 2023 • edited Loading

franzpoeschel commented Jul 26, 2024 • edited Loading

eisenhauer commented Jul 26, 2024

franzpoeschel commented Jul 30, 2024

franzpoeschel commented Aug 5, 2024

franzpoeschel Aug 6, 2024

Choose a reason for hiding this comment

franzpoeschel Aug 12, 2024

Choose a reason for hiding this comment

franzpoeschel commented Dec 12, 2023 •

edited

Loading

franzpoeschel commented Dec 15, 2023 •

edited

Loading

franzpoeschel commented Jul 26, 2024 •

edited

Loading