Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenSM fails to start after Linux kernel upgrade to 6.11.2 #37

Open
frx-wintermute opened this issue Nov 7, 2024 · 0 comments
Open

Comments

@frx-wintermute
Copy link

I encountered a major issue, as soon as I upgraded the Linux kernel
of the Debian GNU/Linux box where OpenSM runs (in order to manage the InfiniBand
network of an HPC cluster).

Basically, with Linux kernel version 6.10.11, everything works.

As soon as I reboot with a more recent Linux kernel (6.11.2, 6.11.4, 6.11.5), I find the following in the logs:

  138670 [F0D43740] 0x03 -> OpenSM 3.3.23
  138934 [F0D43740] 0x80 -> OpenSM 3.3.23
  140409 [F0D43740] 0x02 -> osm_vendor_init: 1000 pending umads specified
  140628 [F0D43740] 0x80 -> Entering DISCOVERING state
  140711 [F0D43740] 0x02 -> osm_vendor_bind: Mgmt class 0x81 binding to port GUID 0x9c63c00300033240
  165028 [F0D43740] 0x01 -> osm_vendor_bind: ERR 5426: Unable to register class 129 version 1
  165157 [F0D43740] 0x01 -> osm_sm_mad_ctrl_bind: ERR 3118: Vendor specific bind failed
  165162 [F0D43740] 0x01 -> osm_sm_bind: ERR 2E10: SM MAD Controller bind failed (IB_ERROR)
  165173 [F0D43740] 0x01 -> perfmgr_mad_unbind: ERR 5405: No previous bind
  165176 [F0D43740] 0x01 -> osm_congestion_control_shutdown: ERR C108: No previous bind
  165217 [F0D43740] 0x01 -> osm_sa_mad_ctrl_unbind: ERR 1A11: No previous bind
  165630 [F0D43740] 0x80 -> Exiting SM

OpenSM fails to start and the InfiniBand network does not work:

# ps aux | grep opens[m]

# ibnodes
ibwarn: [1795] mad_rpc_open_port: client_register for mgmt 1 failed
./libibnetdisc/ibnetdisc.c:798; can't open MAD port ((null):0)
/usr/sbin/ibnetdiscover: iberror: failed: discover failed
ibwarn: [1800] mad_rpc_open_port: client_register for mgmt 1 failed
./libibnetdisc/ibnetdisc.c:798; can't open MAD port ((null):0)
/usr/sbin/ibnetdiscover: iberror: failed: discover failed

If I reboot with the previous Linux kernel version, everything
works again.

I cannot understand what's going on.

Is there any important change in the Linux kernel that OpenSM needs
to adapt for?
Or is this a bug in the newer Linux kernel version (that needs to
be fixed there)?

Please note that the other cluster nodes can run with the newest
version (6.11.x) of the Linux kernel and connect to the Infiniband
network, as long as the node which runs OpenSM is using the 6.10.11
Linux kernel version. Hence, it does not seem that the Linux kernel
version 6.11.x broke its support for (mlx5) Infiniband networks:
it's just that OpenSM 3.3.23 and Linux kernel 6.11.x don't seem to work together...

For more information, please see the bug report on the Debian BTS.

Please investigate this issue.
Thanks for any help you may provide!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant