Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

btl/ofi: Disable EFA provider in versions earlier than libfabric 1.12.0 #7973

Merged
merged 2 commits into from
Aug 12, 2020

Conversation

wckzhang
Copy link
Contributor

EFA incorrectly implements FI_DELIVERY_COMPLETE in earlier libfabric
versions. While FI_DELIVERY_COMPLETE would be advertised by the
provider, completions would return too early by not accounting for
bounce buffers on the receive side. This would cause the BTL
to receive early completions that lead to correctness issues.

This is not an issue in the mtl/ofi as it does not require
FI_DELIVERY_COMPLETE.

Signed-off-by: William Zhang wilzhang@amazon.com

@wckzhang wckzhang requested a review from hppritcha July 30, 2020 02:32
@bwbarrett
Copy link
Member

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@wckzhang
Copy link
Contributor Author

Tweaking the logic to avoid any combination of layered providers as well

EFA incorrectly implements FI_DELIVERY_COMPLETE in earlier libfabric
versions. While FI_DELIVERY_COMPLETE would be advertised by the
provider, completions would return too early by not accounting for
bounce buffers on the receive side. This would cause the BTL
to receive early completions that lead to correctness issues.

This is not an issue in the mtl/ofi as it does not require
FI_DELIVERY_COMPLETE.

Signed-off-by: William Zhang <wilzhang@amazon.com>
@wckzhang
Copy link
Contributor Author

Switched to using: !strncasecmp(info->fabric_attr->prov_name, "efa", 3) instead of a direct strcmp, this will also exclude provider combinations such as efa;ofi_rxd

@bwbarrett
Copy link
Member

Re-adding the WIP label. We are having an internal discussion about utility provider behaviors.

The ofi_rxm provider is dependent upon the underlying hardware for its
implementation of FI_DELIVERY_COMPLETE. Since this can lead to early
completions, we disable the provider to avoid correctness issues.

This is not an issue in the mtl/ofi as it does not require
FI_DELIVERY_COMPLETE.

Signed-off-by: William Zhang <wilzhang@amazon.com>
@bwbarrett bwbarrett merged commit f3832c1 into open-mpi:master Aug 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants