You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am testing my 2 P100 in 2 nodes with 2 cx555 NICs.
It is only successful from one direction but failed in the other.
Success
./ib_write_bw --use_cuda=0 -a 10.10.10.11
./ib_write_bw -d mlx5_0 --use_cuda=0 -a
Fail
./ib_write_bw --use_cuda=0 -a
ethernet_read_keys: Couldn't read remote address
Unable to read to socket/rdma_cm
Failed to exchange data between server and clients
./ib_write_bw -d mlx5_0 --use_cuda=0 -a 10.10.10.10
Completion with error at client
Failed status 4: wr_id 0 syndrom 0x51
scnt=128, ccnt=0
Failed to complete run_iter_bw function successfully
For the testing between both cx555 NICs the bandwidth testings work well.
Driver and Kernel:
Both cx555 are the same driver and firmware
Both P100 are th same driver but different vbios
I am not using Nvidia open source kernel since P100 is not supported but i think it is not the problem of the kernel otherwise why one direction is still working.
Thanks, i have noted this post and tried to find the coresponding setting in my bios (Z690 mainboard) and found one 4GB MMO one. In the default setting it links with Resize bar and i can disable it if i disable Resize Bar, i tried but failed. The direction which have mentioned issue still can not work but the other direction can. Hope someone else can share their solution or give some insigts. Thanks anyway.
Hello,
I am testing my 2 P100 in 2 nodes with 2 cx555 NICs.
It is only successful from one direction but failed in the other.
Success
./ib_write_bw --use_cuda=0 -a 10.10.10.11
./ib_write_bw -d mlx5_0 --use_cuda=0 -a
Fail
./ib_write_bw --use_cuda=0 -a
ethernet_read_keys: Couldn't read remote address
Unable to read to socket/rdma_cm
Failed to exchange data between server and clients
./ib_write_bw -d mlx5_0 --use_cuda=0 -a 10.10.10.10
Completion with error at client
Failed status 4: wr_id 0 syndrom 0x51
scnt=128, ccnt=0
Failed to complete run_iter_bw function successfully
For the testing between both cx555 NICs the bandwidth testings work well.
Driver and Kernel:
Both cx555 are the same driver and firmware
Both P100 are th same driver but different vbios
I am not using Nvidia open source kernel since P100 is not supported but i think it is not the problem of the kernel otherwise why one direction is still working.
For IOMMU
10.10.10.11
sudo dmesg | grep -i dmar
[ 0.173076] DMAR: IOMMU disabled
sudo dmesg | grep -i iommu
[ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.0-41-generic root=UUID=44a5d7a3-4f19-4106-8a8c-66301c2c9d14 ro intel_iommu=off quiet splash vt.handoff=7
[ 0.173010] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-6.8.0-41-generic root=UUID=44a5d7a3-4f19-4106-8a8c-66301c2c9d14 ro intel_iommu=off quiet splash vt.handoff=7
[ 0.173076] DMAR: IOMMU disabled
[ 2.245922] iommu: Default domain type: Translated
[ 2.245922] iommu: DMA domain TLB invalidation policy: lazy mode
10.10.10.10
sudo dmesg | grep -i dmar
No iputput
sudo dmesg | grep -i iommu
[ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.0-41-generic root=UUID=6e849e25-4931-4c06-8684-bb553962f200 ro amd_iommu=off quiet splash vt.handoff=7
[ 0.030879] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-6.8.0-41-generic root=UUID=6e849e25-4931-4c06-8684-bb553962f200 ro amd_iommu=off quiet splash vt.handoff=7
[ 1.861879] iommu: Default domain type: Translated
[ 1.861879] iommu: DMA domain TLB invalidation policy: lazy mode
i have set both iommu=off in the kernel but ouput are different.
What will the possible casue for this issue and how can i go deep to find the casue and find the solution.
Thanks
The text was updated successfully, but these errors were encountered: