-
Notifications
You must be signed in to change notification settings - Fork 389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prov/ofi_rxm Not working, Need core provider, skipping ofi_rxm #9820
Comments
This is not a bug. You need to use verbs;ofi_rxm as shown from your fi_info output. |
Thanks Chien. I had also tried with verbs;ofi_rxm, but although fi_info works, fi_pingpong fails (it looks for ofi_rxm at the end, instead of verbs;ofi_rxm): $ FI_PROVIDER="verbs;ofi_rxm" FI_LOG_LEVEL=Debug fi_pingpong Thank you. |
by default, fi_pingpong uses FI_EP_DGRAM. try |
With fi_pingpong -e rdm and also -e rdm -p verbs, the output is: libfabric:4117195:1708032872::core:core:fi_param_get_():373 variable perf_cntr= |
verbs supports msg endpoints (you would need You can run |
From your fi_info and log, I'm guessing you do not have IPoIB set up. fi_pingpong requires either IPv4 or IPv6 address. After you have that configured, use verbs;ofi_rxm with -e rdm, that should work for you. |
Describe the bug
I'm trying to execute an application with FI_PROVIDER=ofi_rxm, but it fails to find ofi_rxm. Tried the same with fi_pingpong to check if the problem was the application or not, but fi_pingpong also fails.
Then I tried fi_getinfo, which returns -61 (No data available) when FI_PROVIDER=ofi_rxm, but it appears with fi_info.
To Reproduce
Steps to reproduce the behavior:
FI_PROVIDER=ofi_rxm FI_LOG_LEVEL=Debug fi_info
Expected behavior
Should print the same output as fi_info.
Output
$ FI_PROVIDER=ofi_rxm FI_LOG_LEVEL=Debug
libfabric:4082300:1708029534::core:core:fi_param_get_():373 variable perf_cntr=
libfabric:4082300:1708029534::core:core:fi_param_get_():373 variable hook=
libfabric:4082300:1708029534::core:core:fi_param_get_():373 variable hmem=
libfabric:4082300:1708029534::core:core:ofi_hmem_init():607 Hmem iface FI_HMEM_CUDA not supported
libfabric:4082300:1708029534::core:core:ofi_hmem_init():607 Hmem iface FI_HMEM_ROCR not supported
libfabric:4082300:1708029534::core:core:ofi_hmem_init():607 Hmem iface FI_HMEM_ZE not supported
libfabric:4082300:1708029534::core:core:ofi_hmem_init():607 Hmem iface FI_HMEM_NEURON not supported
libfabric:4082300:1708029534::core:core:ofi_hmem_init():607 Hmem iface FI_HMEM_SYNAPSEAI not supported
libfabric:4082300:1708029534::core:core:fi_param_get_():373 variable hmem_disable_p2p=
libfabric:4082300:1708029534::core:core:fi_param_get_():373 variable mr_cache_max_size=
libfabric:4082300:1708029534::core:core:fi_param_get_():373 variable mr_cache_max_count=
libfabric:4082300:1708029534::core:core:fi_param_get_():373 variable mr_cache_monitor=
libfabric:4082300:1708029534::core:core:fi_param_get_():373 variable mr_cuda_cache_monitor_enabled=
libfabric:4082300:1708029534::core:core:fi_param_get_():373 variable mr_rocr_cache_monitor_enabled=
libfabric:4082300:1708029534::core:core:fi_param_get_():373 variable mr_ze_cache_monitor_enabled=
libfabric:4082300:1708029534::core:mr:ofi_default_cache_size():79 default cache size=526983472
libfabric:4082300:1708029534::core:core:fi_param_get_():382 read string var provider=ofi_rxm
libfabric:4082300:1708029534::core:core:fi_param_get_():373 variable universe_size=
libfabric:4082300:1708029534::core:core:fi_param_get_():373 variable av_remove_cleanup=
libfabric:4082300:1708029534::core:core:fi_param_get_():373 variable offload_coll_provider=
libfabric:4082300:1708029534::core:core:fi_param_get_():382 read string var provider_path=/storage/usersb/jalcaraz/spack/opt/spack/linux-rhel8-zen/gcc-8.5.0/libfabric-1.20.1-oqyfclnlyosaabt3jzrbkrrzj6q4cirf/lib/
libfabric:4082300:1708029534::ofi_rxm:core:fi_param_get_():373 variable enable_passthru=
libfabric:4082300:1708029534::ofi_rxm:core:fi_param_get_():373 variable buffer_size=
libfabric:4082300:1708029534::ofi_rxm:core:fi_param_get_():373 variable tx_size=
libfabric:4082300:1708029534::ofi_rxm:core:fi_param_get_():373 variable rx_size=
libfabric:4082300:1708029534::ofi_rxm:core:fi_param_get_():373 variable msg_tx_size=
libfabric:4082300:1708029534::ofi_rxm:core:fi_param_get_():373 variable msg_rx_size=
libfabric:4082300:1708029534::ofi_rxm:core:fi_param_get_():373 variable cm_progress_interval=
libfabric:4082300:1708029534::ofi_rxm:core:fi_param_get_():373 variable cq_eq_fairness=
libfabric:4082300:1708029534::ofi_rxm:core:fi_param_get_():373 variable data_auto_progress=
libfabric:4082300:1708029534::ofi_rxm:core:fi_param_get_():373 variable use_rndv_write=
libfabric:4082300:1708029534::ofi_rxm:core:fi_param_get_():373 variable def_wait_obj=
libfabric:4082300:1708029534::ofi_rxm:core:fi_param_get_():373 variable def_tcp_wait_obj=
libfabric:4082300:1708029534::core:core:ofi_register_provider():506 registering provider: ofi_rxm (120.10)
libfabric:4082300:1708029534::core:core:ofi_register_provider():506 registering provider: verbs (120.10)
libfabric:4082300:1708029534::core:core:ofi_register_provider():533 "verbs" filtered by provider include/exclude list, skipping
libfabric:4082300:1708029534::core:core:ofi_register_provider():506 registering provider: ofi_hook_perf (120.10)
libfabric:4082300:1708029534::core:core:ofi_register_provider():506 registering provider: ofi_hook_trace (120.10)
libfabric:4082300:1708029534::core:core:ofi_register_provider():506 registering provider: ofi_hook_debug (120.10)
libfabric:4082300:1708029534::core:core:fi_param_get_():373 variable hmem=
libfabric:4082300:1708029534::core:core:ofi_hmem_init():607 Hmem iface FI_HMEM_CUDA not supported
libfabric:4082300:1708029534::core:core:ofi_hmem_init():607 Hmem iface FI_HMEM_ROCR not supported
libfabric:4082300:1708029534::core:core:ofi_hmem_init():607 Hmem iface FI_HMEM_ZE not supported
libfabric:4082300:1708029534::core:core:ofi_hmem_init():607 Hmem iface FI_HMEM_NEURON not supported
libfabric:4082300:1708029534::core:core:ofi_hmem_init():607 Hmem iface FI_HMEM_SYNAPSEAI not supported
libfabric:4082300:1708029534::core:core:fi_param_get_():373 variable hmem_disable_p2p=
libfabric:4082300:1708029534::core:core:ofi_register_provider():506 registering provider: ofi_hook_hmem (120.10)
libfabric:4082300:1708029534::core:core:ofi_register_provider():506 registering provider: ofi_hook_dmabuf_peer_mem (120.10)
libfabric:4082300:1708029534::core:core:ofi_register_provider():506 registering provider: ofi_hook_noop (120.10)
libfabric:4082300:1708029534::core:core:ofi_register_provider():506 registering provider: off_coll (120.10)
libfabric:4082300:1708029534::ofi_rxm:core:fi_param_get_():373 variable use_srx=
libfabric:4082300:1708029534:ofi_rxm:core:core:ofi_layering_ok():1194 Need core provider, skipping ofi_rxm
libfabric:4082300:1708029534::ofi_rxm:core:fi_param_get_():373 variable use_srx=
libfabric:4082300:1708029534:ofi_rxm:core:core:ofi_layering_ok():1194 Need core provider, skipping ofi_rxm
libfabric:4082300:1708029534::ofi_rxm:core:fi_param_get_():373 variable use_srx=
libfabric:4082300:1708029534:ofi_rxm:core:core:ofi_layering_ok():1194 Need core provider, skipping ofi_rxm
libfabric:4082300:1708029534::ofi_rxm:core:fi_param_get_():373 variable use_srx=
libfabric:4082300:1708029534:ofi_rxm:core:core:ofi_layering_ok():1194 Need core provider, skipping ofi_rxm
libfabric:4082300:1708029534::core:core:fi_getinfo_():1304 fi_getinfo: provider ofi_rxm returned -61 (No data available)
$ fi_info
provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_0
version: 120.10
type: FI_EP_MSG
protocol: FI_PROTO_RDMA_CM_IB_RC
provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_0
version: 120.10
type: FI_EP_MSG
protocol: FI_PROTO_RDMA_CM_IB_RC
provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_0-xrc
version: 120.10
type: FI_EP_MSG
protocol: FI_PROTO_RDMA_CM_IB_XRC
provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_0-xrc
version: 120.10
type: FI_EP_MSG
protocol: FI_PROTO_RDMA_CM_IB_XRC
provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_0-dgram
version: 120.10
type: FI_EP_DGRAM
protocol: FI_PROTO_IB_UD
provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_0-dgram
version: 120.10
type: FI_EP_DGRAM
protocol: FI_PROTO_IB_UD
provider: verbs;ofi_rxm
fabric: IB-0xfe80000000000000
domain: mlx5_0
version: 120.10
type: FI_EP_RDM
protocol: FI_PROTO_RXM
provider: verbs;ofi_rxm
fabric: IB-0xfe80000000000000
domain: mlx5_0
version: 120.10
type: FI_EP_RDM
protocol: FI_PROTO_RXM
provider: verbs;ofi_rxm
fabric: IB-0xfe80000000000000
domain: mlx5_0
version: 120.10
type: FI_EP_RDM
protocol: FI_PROTO_RXM
provider: verbs;ofi_rxm
fabric: IB-0xfe80000000000000
domain: mlx5_0
version: 120.10
type: FI_EP_RDM
protocol: FI_PROTO_RXM
provider: verbs;ofi_rxm
fabric: IB-0xfe80000000000000
domain: mlx5_0-xrc
version: 120.10
type: FI_EP_RDM
protocol: FI_PROTO_RXM
provider: verbs;ofi_rxm
fabric: IB-0xfe80000000000000
domain: mlx5_0-xrc
version: 120.10
type: FI_EP_RDM
protocol: FI_PROTO_RXM
provider: verbs;ofi_rxm
fabric: IB-0xfe80000000000000
domain: mlx5_0
version: 120.10
type: FI_EP_RDM
protocol: FI_PROTO_RXM
provider: verbs;ofi_rxm
fabric: IB-0xfe80000000000000
domain: mlx5_0
version: 120.10
type: FI_EP_RDM
protocol: FI_PROTO_RXM
provider: verbs;ofi_rxm
fabric: IB-0xfe80000000000000
domain: mlx5_0
version: 120.10
type: FI_EP_RDM
protocol: FI_PROTO_RXM
provider: verbs;ofi_rxm
fabric: IB-0xfe80000000000000
domain: mlx5_0
version: 120.10
type: FI_EP_RDM
protocol: FI_PROTO_RXM
Environment:
Red Hat Enterprise Linux 8.8 (Ootpa)
The text was updated successfully, but these errors were encountered: