Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

btl/uct: add support for using an another memory domain to form connections #12822

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

hjelmn
Copy link
Member

@hjelmn hjelmn commented Sep 23, 2024

The UCT BTL looks for a connect-to-iface interface in each memory domain to form connections for connect-to-endpoint transports. For example, with ib the btl will pick the UD transport as the means to setup RC. While there are connection transports available (RDMACM) I chose using UD (etc) to support networks that did not necessarily provide a connection transport.

I am currently working with improving support for Open MPI on a RoCEv2 system that does not provide support for UD (yet). This breaks the assumption that there will always be a connect-to-ifact transport available in all memory domains. To fix this issue this change updates the detection logic to locate a suitable transport for making connections (tcp by default). If a memory domain does not have a suitable connection transport the alternate will be used instead. This has been tested on our broken-UD system and works well.

It a connection-only transport is not needed the extra transport module is destroyed and the in-memory domain connection transport is used.

…ctions

The UCT BTL looks for a connect-to-iface interface in each memory domain to
form connections for connect-to-endpoint transports. For example, with ib the
btl will pick the UD transport as the means to setup RC. While there are
connection transports available (RDMACM) I chose using UD (etc) to support
networks that did not necessarily provide a connection transport.

I am currently working with improving support for Open MPI on a RoCEv2 system
that does not provide support for UD (yet). This breaks the assumption that
there will always be a connect-to-ifact transport available in all memory
domains. To fix this issue this change updates the detection logic to locate
a suitable transport for making connections (tcp by default). If a memory
domain does not have a suitable connection transport the alternate will be
used instead. This has been tested on our broken-UD system and works well.

It a connection-only transport is not needed the extra transport module is
destroyed and the in-memory domain connection transport is used.

Signed-off-by: Nathan Hjelm <hjelmn@google.com>
…_connecting_endpoints_by_utilizing_tcp_on_systems_without_working_ud
@hjelmn
Copy link
Member Author

hjelmn commented Oct 4, 2024

I may abandon this for now. We determined what was wrong with ud_verbs on iRDMA so we can now use btl/uct without any modification. This change does contain some other useful cleanup so will probably open PRs for those changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant