Skip to content

[btl_openib_component.c:3644:handle_wc REMOTE ACCESS ERROR #5047

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pryancsiro opened this issue Apr 10, 2018 · 15 comments
Closed

[btl_openib_component.c:3644:handle_wc REMOTE ACCESS ERROR #5047

pryancsiro opened this issue Apr 10, 2018 · 15 comments
Assignees

Comments

@pryancsiro
Copy link

Since we installed 2.1.3 we have been seeing errors with one sided calls (MPI_Get in this case) when the number of ranks on each node is not the same (though not in all configurations). I can reproduce the issue with the Intel MPI benchmarks;

$ mpirun -np 10 -npernode 4 ./IMB-RMA All_get_all -npmin 10

[[16715,1],8][btl_openib_component.c:3529:handle_wc] from c016 to: c009 error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 70ec78 opcode 2  vendor error 136 qp_idx 3
[[16715,1],0][btl_openib_component.c:3529:handle_wc] from c008 to: c009 error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 5487a8 opcode 2  vendor error 136 qp_idx 3
[[16715,1],9][btl_openib_component.c:3529:handle_wc] from c016 to: c009 error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 548028 opcode 2  vendor error 136 qp_idx 3
[[16715,1],1][btl_openib_component.c:3529:handle_wc] from c008 to: c009 error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 54a208 opcode 2  vendor error 136 qp_idx 3
[[16715,1],2][btl_openib_component.c:3529:handle_wc] from c008 to: c009 error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 548a28 opcode 2  vendor error 136 qp_idx 3
[[16715,1],3][btl_openib_component.c:3529:handle_wc] from c008 to: c009 error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 54a4a8 opcode -1  vendor error 136 qp_idx 3
[[16715,1],7][btl_openib_component.c:3529:handle_wc] from c009 to: c008 error polling LP CQ with status WORK REQUEST FLUSHED ERROR status number 5 for wr_id 54a208 opcode -1  vendor error 249 qp_idx 3

I've also tested this against 3.0.1/3.1.0rc3 and the master branch and we still have the same issue. 1.8.8 works ok.

If I set -mca osc ^rdma it runs, or if I remove fetching-atomics with -mca btl_openib_flags send,put,get,need-ack,need-csum,hetero-rdma.

Perhaps this is a duplicate of #4529, though we don't see the system call error.

@jsquyres
Copy link
Member

@hjelmn Could you have a look?

@hjelmn
Copy link
Member

hjelmn commented Apr 10, 2018

Known issue in the v2.x series I thought we fixed. Please try v3.0.1

@hjelmn
Copy link
Member

hjelmn commented Apr 10, 2018

Hmm. Or you already did. Odd

@hjelmn
Copy link
Member

hjelmn commented Apr 10, 2018

Try a nightly master tarball

@hjelmn
Copy link
Member

hjelmn commented Apr 10, 2018

I think I can actually test this again. We have a mlx5 system now.

@hjelmn
Copy link
Member

hjelmn commented Apr 10, 2018

Definitely fixed on master.

@hjelmn
Copy link
Member

hjelmn commented Apr 10, 2018

What interconnect are you using? I can not reproduce this failure with a mlx5 system:

#----------------------------------------------------------------
# Benchmarking All_get_all 
# #processes = 10 
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.17         0.20         0.20
            1         1000         3.54         6.13         4.48
            2         1000         3.94         6.23         4.73
            4         1000         3.96         6.30         4.76
            8         1000         3.88         6.18         4.69
           16         1000         3.93         6.22         4.72
           32         1000         4.00         6.67         4.92
           64         1000         4.05         8.74         4.96
          128         1000         4.33         4.92         4.52
          256         1000         4.49         5.16         4.87
          512         1000         6.48         8.14         7.34
         1024         1000         7.50        11.04         8.48
         2048         1000        11.68        13.26        12.21
         4096         1000        19.04        19.78        19.44
         8192         1000        29.43        87.50        43.94
        16384         1000        70.20        71.37        71.01
        32768         1000       124.57       127.46       126.97
        65536          640       247.87       253.14       251.56
       131072          320       501.17       511.13       509.45
       262144          160      1038.54      1048.44      1045.57
       524288           80      2053.14      2062.29      2058.58
      1048576           40      4219.49      4247.98      4237.42
      2097152           20      8316.02      8407.93      8374.42
      4194304           10     16951.13     17119.45     17085.86


# All processes entering MPI_Finalize

Verified it is using btl/openib and osc/rdma.

@hjelmn
Copy link
Member

hjelmn commented Apr 10, 2018

Nevermind. I think I got a related failure with -n 3. Looking into it.

@hjelmn
Copy link
Member

hjelmn commented Apr 10, 2018

Nope, was just a missing commit from master that caused that failure.

I can't get this error with any configuration on this system. Will try others.

@pryancsiro
Copy link
Author

It is a Mellanox ConnectX-3 system (1st system below), though I've tested the nightly master tarball on 4 systems and get the same issue - though with some extra errors on the ConnectX-4 systems.

The 1st and 2nd system are our clusters, the 3rd and 4th system are external.

  • 1st system (in original post)
$ ofed_info -s
MLNX_OFED_LINUX-4.2-1.2.0.0:
$
$ lspci | grep -i mell
04:00.0 Infiniband controller: Mellanox Technologies MT27500 Family [ConnectX-3]
$
$ ibv_devices
    device                 node GUID
    ------              ----------------
    mlx4_0              e41d2d0300a31ea0
$
  • 2nd system
$ ofed_info -s
MLNX_OFED_LINUX-4.2-1.2.0.0:
$
$ /sbin/lspci | grep -i mell
05:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
$
$ ibv_devices
    device                 node GUID
    ------              ----------------
    mlx5_0              248a070300ad4786
$
  • 3rd system
$ ofed_info -s
MLNX_OFED_LINUX-4.2-1.0.0.0:
$
$ /sbin/lspci | grep -i mell
06:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
$
$ ibv_devices
    device                 node GUID
    ------              ----------------
    mlx5_0              248a0703008cab12
$
  • 4th system
$ ofed_info -s
MLNX_OFED_LINUX-4.2-1.0.0.0:
$
$ /sbin/lspci | grep -i mell
b0:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]
$
$ ibv_devices
    device                 node GUID
    ------              ----------------
    mlx4_0              0002c903001ac160
$

On the 2nd system we get the error:

mlx5: b109: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
0000000c 00000000 00000000 00000000
00000000 00008813 10003fbb 000002d2
[[38911,1],8][btl_openib_component.c:3644:handle_wc] from b109 to: unknown error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 9a4668 opcode 2  vendor error 136 qp_idx 3
mlx5: b021: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
0000001c 00000000 00000000 00000000
00000000 00008813 10000cae 000031d2
[[38911,1],0][btl_openib_component.c:3644:handle_wc] from b021 to: unknown error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 74d178 opcode 2  vendor error 136 qp_idx 3
mlx5: b109: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000010 00000000 00000000 00000000
00000000 00008813 10003fc6 000169d2
[[38911,1],9][btl_openib_component.c:3644:handle_wc] from b109 to: unknown error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 74cf58 opcode 2  vendor error 136 qp_idx 3
mlx5: b021: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000008 00000000 00000000 00000000
00000000 00008813 10000c7d 0000c5d2
[[38911,1],3][btl_openib_component.c:3644:handle_wc] from b021 to: unknown error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 74d528 opcode 2  vendor error 136 qp_idx 3
mlx5: b021: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000010 00000000 00000000 00000000
00000000 00008813 10000c96 000009d2
[[38911,1],1][btl_openib_component.c:3644:handle_wc] from b021 to: unknown error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 74d258 opcode 2  vendor error 136 qp_idx 3
[[38911,1],7][btl_openib_component.c:3644:handle_wc] from b022 to: unknown error polling LP CQ with status WORK REQUEST FLUSHED ERROR status number 5 for wr_id 74d2b8 opcode 2  vendor error 244 qp_idx 3

On the 3rd system:

mlx5: r3767: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
0000000c 00000000 00000000 00000000
00000000 00008813 100001cf 000042d2
[[7064,1],8][btl_openib_component.c:3644:handle_wc] from r3767 to: unknown error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 1588828 opcode 2  vendor error 136 qp_idx 3
mlx5: r3767: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000010 00000000 00000000 00000000
00000000 00008813 100001d7 000146d2
[[7064,1],9][btl_openib_component.c:3644:handle_wc] from r3767 to: unknown error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 1a9bcc8 opcode 2  vendor error 136 qp_idx 3
[[7064,1],7][btl_openib_component.c:3644:handle_wc] from r3727 to: unknown error polling LP CQ with status WORK REQUEST FLUSHED ERROR status number 5 for wr_id 1355068 opcode 2  vendor error 244 qp_idx 3
mlx5: r3722: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
0000001c 00000000 00000000 00000000
00000000 00008813 100093c1 0000c1d2
[[7064,1],0][btl_openib_component.c:3644:handle_wc] from r3722 to: unknown error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 2494bd8 opcode 2  vendor error 136 qp_idx 3
mlx5: r3722: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000010 00000000 00000000 00000000
00000000 00008813 100093b1 0000b1d2
[[7064,1],1][btl_openib_component.c:3644:handle_wc] from r3722 to: unknown error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 225ed48 opcode 2  vendor error 136 qp_idx 3
mlx5: r3722: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000008 00000000 00000000 00000000
00000000 00008813 1000938f 0000a9d2
[[7064,1],3][btl_openib_component.c:3644:handle_wc] from r3722 to: unknown error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id f6e0d8 opcode 2  vendor error 136 qp_idx 3

On the 4th system:

[[49099,1],0][btl_openib_component.c:3644:handle_wc] from r54 to: unknown error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 14e1e38 opcode 2  vendor error 136 qp_idx 3
[[49099,1],8][btl_openib_component.c:3644:handle_wc] from r1004 to: unknown error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 2081e98 opcode 2  vendor error 136 qp_idx 3
[[49099,1],9][btl_openib_component.c:3644:handle_wc] from r1004 to: unknown error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 16b7ee8 opcode 2  vendor error 136 qp_idx 3
[[49099,1],1][btl_openib_component.c:3644:handle_wc] from r54 to: unknown error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 1d82fb8 opcode 2  vendor error 136 qp_idx 3
[[49099,1],3][btl_openib_component.c:3644:handle_wc] from r54 to: unknown error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 2715228 opcode -1  vendor error 136 qp_idx 3
[[49099,1],7][btl_openib_component.c:3644:handle_wc] from r59 to: unknown error polling LP CQ with status WORK REQUEST FLUSHED ERROR status number 5 for wr_id 1e3c958 opcode -1  vendor error 249 qp_idx 3

@hjelmn
Copy link
Member

hjelmn commented Apr 12, 2018

Interesting. I will continue trying to reproduce this problem. Maybe the issue happens when there are 3 nodes (I could only get 2 the other day).

@pryancsiro
Copy link
Author

I had to use 3 nodes as a minimum with 4, 4 & 2 per node. Though 4, 3 & 3 per node also worked. There are other combinations at higher node/rank counts that produced this as well.

@cniethammer
Copy link
Contributor

I get reproducibly the same error running the one-sided tests of the mpi_test_suite on our system with ConnectX-3 on two nodes and 40 processes.

@jsquyres
Copy link
Member

jsquyres commented Jul 9, 2018

This issue is likely about to become moot:

  1. The vendor-recommended way forward for Mellanox hardware is to use the UCX PML.
  2. In Open MPI v4.0.0, the openib BTL will only work with RoCE and iWARP by default (and it is possible that in Open MPI v5.0.0, the openib BTL will disappear altogether).

@jsquyres
Copy link
Member

There's been no activity on this for months; I'm going to mark this as wontfix.

The Way Forward is to use the UCX PML.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants