You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I encountered a major issue, as soon as I upgraded the Linux kernel
of the Debian GNU/Linux box where OpenSM runs (in order to manage the InfiniBand
network of an HPC cluster).
Basically, with Linux kernel version 6.10.11, everything works.
As soon as I reboot with a more recent Linux kernel (6.11.2, 6.11.4, 6.11.5), I find the following in the logs:
138670 [F0D43740] 0x03 -> OpenSM 3.3.23
138934 [F0D43740] 0x80 -> OpenSM 3.3.23
140409 [F0D43740] 0x02 -> osm_vendor_init: 1000 pending umads specified
140628 [F0D43740] 0x80 -> Entering DISCOVERING state
140711 [F0D43740] 0x02 -> osm_vendor_bind: Mgmt class 0x81 binding to port GUID 0x9c63c00300033240
165028 [F0D43740] 0x01 -> osm_vendor_bind: ERR 5426: Unable to register class 129 version 1
165157 [F0D43740] 0x01 -> osm_sm_mad_ctrl_bind: ERR 3118: Vendor specific bind failed
165162 [F0D43740] 0x01 -> osm_sm_bind: ERR 2E10: SM MAD Controller bind failed (IB_ERROR)
165173 [F0D43740] 0x01 -> perfmgr_mad_unbind: ERR 5405: No previous bind
165176 [F0D43740] 0x01 -> osm_congestion_control_shutdown: ERR C108: No previous bind
165217 [F0D43740] 0x01 -> osm_sa_mad_ctrl_unbind: ERR 1A11: No previous bind
165630 [F0D43740] 0x80 -> Exiting SM
OpenSM fails to start and the InfiniBand network does not work:
# ps aux | grep opens[m]
# ibnodesibwarn: [1795] mad_rpc_open_port: client_register for mgmt 1 failed./libibnetdisc/ibnetdisc.c:798; can't open MAD port ((null):0)/usr/sbin/ibnetdiscover: iberror: failed: discover failedibwarn: [1800] mad_rpc_open_port: client_register for mgmt 1 failed./libibnetdisc/ibnetdisc.c:798; can't open MAD port ((null):0)/usr/sbin/ibnetdiscover: iberror: failed: discover failed
If I reboot with the previous Linux kernel version, everything
works again.
I cannot understand what's going on.
Is there any important change in the Linux kernel that OpenSM needs
to adapt for?
Or is this a bug in the newer Linux kernel version (that needs to
be fixed there)?
Please note that the other cluster nodes can run with the newest
version (6.11.x) of the Linux kernel and connect to the Infiniband
network, as long as the node which runs OpenSM is using the 6.10.11
Linux kernel version. Hence, it does not seem that the Linux kernel
version 6.11.x broke its support for (mlx5) Infiniband networks:
it's just that OpenSM 3.3.23 and Linux kernel 6.11.x don't seem to work together...
For more information, please see the bug report on the Debian BTS.
Please investigate this issue.
Thanks for any help you may provide!
The text was updated successfully, but these errors were encountered:
I encountered a major issue, as soon as I upgraded the Linux kernel
of the Debian GNU/Linux box where OpenSM runs (in order to manage the InfiniBand
network of an HPC cluster).
Basically, with Linux kernel version 6.10.11, everything works.
As soon as I reboot with a more recent Linux kernel (6.11.2, 6.11.4, 6.11.5), I find the following in the logs:
OpenSM fails to start and the InfiniBand network does not work:
If I reboot with the previous Linux kernel version, everything
works again.
I cannot understand what's going on.
Is there any important change in the Linux kernel that OpenSM needs
to adapt for?
Or is this a bug in the newer Linux kernel version (that needs to
be fixed there)?
Please note that the other cluster nodes can run with the newest
version (6.11.x) of the Linux kernel and connect to the Infiniband
network, as long as the node which runs OpenSM is using the 6.10.11
Linux kernel version. Hence, it does not seem that the Linux kernel
version 6.11.x broke its support for (mlx5) Infiniband networks:
it's just that OpenSM 3.3.23 and Linux kernel 6.11.x don't seem to work together...
For more information, please see the bug report on the Debian BTS.
Please investigate this issue.
Thanks for any help you may provide!
The text was updated successfully, but these errors were encountered: