Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/efa: fi_info crash in a system with mlnx but no efa defice #7805

Closed
ghost opened this issue Jun 6, 2022 · 7 comments
Closed

prov/efa: fi_info crash in a system with mlnx but no efa defice #7805

ghost opened this issue Jun 6, 2022 · 7 comments
Assignees
Labels

Comments

@ghost
Copy link

ghost commented Jun 6, 2022

Describe the bug
libfabric src on 54edc09, configured with debug and valgrind.

Running fi_info on a system with a Mellanox device without Efa produced this output:

fi_info
double free or corruption (!prev)
Aborted (core dumped)

Here is the gdb stack trace:
(gdb) bt
#0 0x00007ffff5e1437f in raise () from //lib64/libc.so.6
#1 0x00007ffff5dfedb5 in abort () from //lib64/libc.so.6
#2 0x00007ffff5e574e7 in __libc_message () from //lib64/libc.so.6
#3 0x00007ffff5e5e5ec in malloc_printerr () from //lib64/libc.so.6
#4 0x00007ffff5e6039c in _int_free () from //lib64/libc.so.6
#5 0x00007ffff4cfd3f2 in mlx5_free_context (ibctx=0x676190) at providers/mlx5/mlx5.c:1407
#6 0x00007ffff6bf08b5 in _ibv_close_device_1_1 (context=) at libibverbs/device.c:384
#7 0x00007ffff77b3ca7 in efa_device_destruct (device=0x66fb20) at prov/efa/src/efa_device.c:180
#8 0x00007ffff77b3ecd in efa_device_list_finalize () at prov/efa/src/efa_device.c:254
#9 0x00007ffff77b3e54 in efa_device_list_initialize () at prov/efa/src/efa_device.c:237
#10 0x00007ffff77c28ea in efa_prov_initialize () at prov/efa/src/efa_fabric.c:269
#11 0x00007ffff77c92dd in fi_efa_ini () at prov/efa/src/rxr/rxr_prov.c:111
#12 0x00007ffff770d6ff in fi_ini () at src/fabric.c:856
#13 0x00007ffff770e093 in fi_getinfo
(version=65552, node=0x0, service=0x0, flags=0, hints=0x0, info=0x7fffffffd740) at src/fabric.c:1101
#14 0x0000000000401cc0 in run (hints=0x0, node=0x0, port=0x0, flags=0) at util/info.c:324
#15 0x0000000000402110 in main (argc=1, argv=0x7fffffffd888) at util/info.c:448

To Reproduce
Use libfabric src on sha 54edc09, configured with debug and valgrind and run fi_info on a system with mellanox but no efa. Probably any verbs capable device will do, other than efa.

Expected behavior
fi_info to display info and not crash

Output
see description.

Environment:
Reproduced on RHEL 8.2 and 8.5 with Mellanox and rdma-core installed.

Additional context
Add any other context about the problem here.

@ghost ghost added the bug label Jun 6, 2022
@j-xiong
Copy link
Contributor

j-xiong commented Jun 6, 2022

@ofiwg/aws-efa-team

@wzamazon wzamazon self-assigned this Jun 6, 2022
@wzamazon
Copy link
Contributor

wzamazon commented Jun 6, 2022

looking into it.

@wzamazon
Copy link
Contributor

wzamazon commented Jun 6, 2022

#7806 should fix the issue. @chien-intel would you please try this patch?

@ghost
Copy link
Author

ghost commented Jun 6, 2022

PR #7806 fixed this issue. Feel free to close this issue after PR is merged.

@wzamazon
Copy link
Contributor

wzamazon commented Jun 6, 2022

Thank you! will merge after CI finish.

wzamazon added a commit to wzamazon/libfabric-1 that referenced this issue Jun 7, 2022
This patch added a unit test for the error handling of function
efa_device_construct(), this is to reproduce the GitHub issue:

ofiwg#7805

Signed-off-by: Wei Zhang <wzam@amazon.com>
wzamazon added a commit to wzamazon/libfabric-1 that referenced this issue Jun 7, 2022
This patch added a unit test for the error handling of function
efa_device_construct(), this is to reproduce the GitHub issue:

ofiwg#7805

Signed-off-by: Wei Zhang <wzam@amazon.com>
@wzamazon
Copy link
Contributor

wzamazon commented Jun 7, 2022

PR merged. I also checked that this issue only apply to main branch, therefore no backport is needed.

Closing ...

@ghost
Copy link
Author

ghost commented Jun 7, 2022

thank you.

@ghost ghost closed this as completed Jun 7, 2022
jtamzn pushed a commit to jtamzn/libfabric that referenced this issue Oct 19, 2022
This patch added a unit test for the error handling of function
efa_device_construct(), this is to reproduce the GitHub issue:

ofiwg#7805

Signed-off-by: Wei Zhang <wzam@amazon.com>
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants