-
Notifications
You must be signed in to change notification settings - Fork 398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prov/verbs;ofi_rxm: rxm_handle_error():793<warn> fi_eq_readerr: err: Connection refused (111), prov_err: Unknown error -8 (-8) #7880
Comments
Driver is always involved when creating/destroying a QP with a verbs capable device. |
Hello @chien-intel and first of all thanks for your quick reply and ideas! Unfortunately, the
did not produce any log/trace in dmesg during the critical phase :-( And you will find attached the output ( The io500 compute job we use to reproduce, used 32 nodes with 64 tasks/node, connected to a single DAOS server/node with 2 server-instances/engines for 20 targets/engine (one dedicated HDR interface per each engine), so leading to 80K concurrent connections (40K on each HDR). This seems to be coherent to the number of QPs that is being reported for each IB interface. |
Can you double check your system setup and make sure all nodes are using the same version of software? |
Oops, I need to apologize, but the unwinded stack I have added comes from our first investigations when running with libfrabrix 1.14.0 :-( , sorry about that, and I will get a new version of it and attached it here soon. |
So the new (ie with libfabric-1.15.1) stack of the DAOS engine threads very likely suspected to participate with the performance degradation now looks like :
|
is the stack trace from the client or server? |
Stack is from server side. |
Sorry for the delay. Ulimit from Server side, same values are displayed for each DAOS server/engine :
Ulimit from clients side (I have checked that, as expected, all clients run with the same environment) always looks like for each process/task :
But no "create_qp" msgs in dmesg nor in /var/log/messages, and at both server/clients sides. |
what you said at the end is interesting. If server is trying to send messages to clients after(?) all clients have exited then of course server would get the reject. What is the server side trying to do? What was the last successful operation and the first failure on the client side before it exited? |
But the reject messages have started in a loop a long time ago where clients task still had to run for a while.
Well, the stack should tell you.
When I say the clients have exited, I mean graceful exit, not due to some error/exception. The problem here is performance issue due to overhead implied by the unexplained behaviour described. |
I'm standing in for Bruno who is on vacation. The cluster on which this is reproduced/debugged has meanwhile been upgraded to MLNX_OFED_LINUX-5.6-2.0.9.0 and daos-2.1.104 which includes libfabric-1.15.1-2.el8.x86_64. No change in behavior. Can we work on an action plan how to further debug this performance issue? This is severely limiting the scalability of the DAOS storage solution in terms of number of client tasks (by at least 10x), and we need to get to the root cause of the issue. |
Thanks @chien-intel for the constructive call. Action plan:
|
This issue is stale because it has been open 360 days with no activity. Remove stale label or comment, otherwise it will be closed in 7 days. |
Well, how I can I do that ?? |
You just did by adding a comment. |
Have not observed this issue with latest version of DAOS, closing. |
Describe the bug
We experience a huge performance slow-down when running mdtest-hard-read phase of IO500 with 32 client nodes, and 64 tasks/node, vs a single DAOS server.
There are tons/billions of such libfabric log messages sequence during the same period of time :
libfabric:1522427:1657212692::ofi_rxm:ep_ctrl:rxm_handle_error():793 fi_eq_readerr: err: Connection refused (111), prov_err: Unknown error -8 (-8)
libfabric:1522427:1657212692::ofi_rxm:ep_ctrl:rxm_process_reject():543 Processing reject for handle: 0x7f19d4984880
libfabric:1522427:1657212692::ofi_rxm:ep_ctrl:rxm_process_reject():565 closing conn 0x7f19d4984880, reason 1
and taking some server stack dumps on the fly, always shows threads with the following stack/context :
#0 0x00007fc6270ad62b in ioctl () from /lib64/libc.so.6
#1 0x00007fc620d5325a in execute_ioctl () from /lib64/libibverbs.so.1
#2 0x00007fc620d525bf in _execute_ioctl_fallback () from /lib64/libibverbs.so.1
#3 0x00007fc620d54dbf in ibv_cmd_destroy_qp () from /lib64/libibverbs.so.1
#4 0x00007fc5f41fbe59 in mlx5_destroy_qp () from /usr/lib64/libibverbs/libmlx5-rdmav34.so
#5 0x00007fc620b336a1 in rdma_destroy_qp () from /lib64/librdmacm.so.1
#6 0x00007fc620b35dd4 in rdma_destroy_ep () from /lib64/librdmacm.so.1
#7 0x00007fc6215e9d8c in vrb_ep_close (fid=0x7fbee113c420) at prov/verbs/src/verbs_ep.c:513
#8 0x00007fc62160b083 in fi_close (fid=) at ./include/rdma/fabric.h:603
#9 rxm_close_conn (conn=) at prov/rxm/src/rxm_conn.c:88
#10 rxm_close_conn (conn=0x7fc3e507ed38) at prov/rxm/src/rxm_conn.c:58
#11 0x00007fc62160c01d in rxm_process_reject (entry=0x47e3340, entry=0x47e3340, conn=) at prov/rxm/src/rxm_conn.c:446
#12 rxm_handle_error (ep=ep@entry=0x7fc218055fe0) at prov/rxm/src/rxm_conn.c:660
#13 0x00007fc62160c3a0 in rxm_conn_progress (ep=ep@entry=0x7fc218055fe0) at prov/rxm/src/rxm_conn.c:703
#14 0x00007fc62160c465 in rxm_get_conn (ep=ep@entry=0x7fc218055fe0, addr=addr@entry=1176, conn=conn@entry=0x47e34a8) at prov/rxm/src/rxm_conn.c:393
#15 0x00007fc62161160d in rxm_ep_tsend (ep_fid=0x7fc218055fe0, buf=, len=, desc=, dest_addr=1176, tag=409959, context=0x7fc2185c13f8) at prov/rxm/src/rxm_ep.c:2120
#16 0x00007fc625bb7aec in na_ofi_progress () from /lib64/libna.so.2
#17 0x00007fc625baee73 in NA_Progress () from /lib64/libna.so.2
#18 0x00007fc625fe597e in hg_core_progress_na () from /lib64/libmercury.so.2
#19 0x00007fc625fe7f53 in hg_core_progress () from /lib64/libmercury.so.2
#20 0x00007fc625fed05b in HG_Core_progress () from /lib64/libmercury.so.2
#21 0x00007fc625fdf193 in HG_Progress () from /lib64/libmercury.so.2
#22 0x00007fc628a41616 in crt_hg_progress (hg_ctx=hg_ctx@entry=0x7fc218049138, timeout=timeout@entry=0) at src/cart/crt_hg.c:1285
#23 0x00007fc628a01607 in crt_progress (crt_ctx=0x7fc218049120, timeout=0) at src/cart/crt_context.c:1472
#24 0x000000000043c8fd in dss_srv_handler (arg=0x46f2a30) at src/engine/srv.c:486
#25 0x00007fc627c72ece in ABTD_ythread_func_wrapper (p_arg=0x47e3ce0) at arch/abtd_ythread.c:21
#26 0x00007fc627c73071 in make_fcontext () from /lib64/libabt.so.1
which confirms the messages being logged and also the performance cost since Kernel is involved to access the IB board and remove QPs.
To Reproduce
Steps to reproduce the behavior:
This is 100% reproducible with the indicated configuration before.
libfabric-1.15.1 is being used with MOFED 5.5
Expected behavior
If needed, a clear and concise description of what you expected to happen.
Output
If applicable, add output to help explain your problem. (e.g. backtrace, debug logs)
Environment:
OS (if not Linux), provider, endpoint type, etc.
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: