Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]v3.11.8,版本增加一张IB 网卡的时候正常, 无法再增加一张IB网卡 #21704

Closed
khw934 opened this issue Nov 26, 2024 · 11 comments
Assignees
Labels

Comments

@khw934
Copy link

khw934 commented Nov 26, 2024

问题描述/What happened:
v3.11.8 版本无法增加两张 IB 网卡。
环境/Environment:

  • OS (e.g. cat /etc/os-release):

root@hnrsjia-node:# cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.4 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.4 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
root@hnrsjia-node:
#

  • Kernel (e.g. uname -a):

root@hnrsjia-node:# uname -a
Linux hnrsjia-node 5.15.0-25-generic #25-Ubuntu SMP Wed Mar 30 15:54:22 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
root@hnrsjia-node:
#

  • Host: (e.g. dmidecode | egrep -i 'manufacturer|product' |sort -u)

root@hnrsjia-node:# dmidecode | egrep -i 'manufacturer|product' |sort -u
Manufacturer: 3C0A
Manufacturer: DELTA
Manufacturer: Giga Computing
Manufacturer: Intel(R) Corporation
Manufacturer: Micron
Memory Subsystem Controller Manufacturer ID: Unknown
Memory Subsystem Controller Product ID: Unknown
Module Manufacturer ID: Bank 1, Hex 0x2C
Module Product ID: 0x2C00
Product Name: G593-SD2-AAX1-000
Product Name: MSB3-G40-000
idProduct: 0xffb0
root@hnrsjia-node:
#

  • Service Version (e.g. kubectl exec -n onecloud $(kubectl get pods -n onecloud | grep climc | awk '{print $1}') -- climc version-list):
0bda281f3ae019990c3fca59527d565 17656fd1ba0a860c239bee9a19b78ad f2abba66b386718a8ecf26430d20c83
@khw934 khw934 added the bug Something isn't working label Nov 26, 2024
@wanyaoqi
Copy link
Member

@khw934 麻烦详细描述一下你的操作和环境,导出一份host日志

@khw934
Copy link
Author

khw934 commented Nov 27, 2024

@khw934 麻烦详细描述一下你的操作和环境,导出一份host日志

计算节点是 GPU 节点有 8 张卡, 和 8 张IB 网卡, 目前开出的虚拟机, 按照 SR-IOV 方式分配一张IB 网卡是正常, 如果分配 2 张,同步配置的时候, 就报错了

按照 这个文档 https://www.cloudpods.org/docs/guides/onpremise/vminstance/passthrough/sriov/ 已经把IB 网卡正常虚拟化了
image

增加IB 网卡的时候,错误日志如下:
image

iberror-host.log

@wanyaoqi
Copy link
Member

看起来是开机添加ib 网卡的逻辑有点问题

@khw934
Copy link
Author

khw934 commented Nov 27, 2024

看起来是开机添加ib 网卡的逻辑有点问题

要关机添加??

@wanyaoqi
Copy link
Member

@khw934 你先关机加试试,IB网卡测试的少,可能有些场景有问题

@khw934
Copy link
Author

khw934 commented Nov 27, 2024

@khw934 你先关机加试试,IB网卡测试的少,可能有些场景有问题

关机添加是正常了

@wanyaoqi
Copy link
Member

Mellanox/iproute2#1

@wanyaoqi
Copy link
Member

好像是需要安装 ofed 包里面的 ip 包

@khw934
Copy link
Author

khw934 commented Nov 27, 2024

好像是需要安装 ofed 包里面的 ip 包

是代码的问题?? 还是少安装了那个组件?? 目前应该怎么操作?

@wanyaoqi
Copy link
Member

ip command 的问题,您可以安装 mlnx 驱动包里面的 iproute包再试试,iproute包在您编译mlnx的驱动包里面找
https://docs.nvidia.com/networking/display/mlnxofedv561033/known+issues
image

Copy link

github-actions bot commented Jan 4, 2025

If you do not provide feedback for more than 37 days, we will close the issue and you can either reopen it or submit a new issue.

您超过 37 天未反馈信息,我们将关闭该 issue,如有需求您可以重新打开或者提交新的 issue。

@github-actions github-actions bot closed this as completed Jan 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants