[btl_openib_component.c:3644:handle_wc REMOTE ACCESS ERROR #5047

pryancsiro · 2018-04-10T04:30:31Z

Since we installed 2.1.3 we have been seeing errors with one sided calls (MPI_Get in this case) when the number of ranks on each node is not the same (though not in all configurations). I can reproduce the issue with the Intel MPI benchmarks;

$ mpirun -np 10 -npernode 4 ./IMB-RMA All_get_all -npmin 10

[[16715,1],8][btl_openib_component.c:3529:handle_wc] from c016 to: c009 error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 70ec78 opcode 2  vendor error 136 qp_idx 3
[[16715,1],0][btl_openib_component.c:3529:handle_wc] from c008 to: c009 error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 5487a8 opcode 2  vendor error 136 qp_idx 3
[[16715,1],9][btl_openib_component.c:3529:handle_wc] from c016 to: c009 error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 548028 opcode 2  vendor error 136 qp_idx 3
[[16715,1],1][btl_openib_component.c:3529:handle_wc] from c008 to: c009 error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 54a208 opcode 2  vendor error 136 qp_idx 3
[[16715,1],2][btl_openib_component.c:3529:handle_wc] from c008 to: c009 error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 548a28 opcode 2  vendor error 136 qp_idx 3
[[16715,1],3][btl_openib_component.c:3529:handle_wc] from c008 to: c009 error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 54a4a8 opcode -1  vendor error 136 qp_idx 3
[[16715,1],7][btl_openib_component.c:3529:handle_wc] from c009 to: c008 error polling LP CQ with status WORK REQUEST FLUSHED ERROR status number 5 for wr_id 54a208 opcode -1  vendor error 249 qp_idx 3

I've also tested this against 3.0.1/3.1.0rc3 and the master branch and we still have the same issue. 1.8.8 works ok.

If I set -mca osc ^rdma it runs, or if I remove fetching-atomics with -mca btl_openib_flags send,put,get,need-ack,need-csum,hetero-rdma.

Perhaps this is a duplicate of #4529, though we don't see the system call error.

The text was updated successfully, but these errors were encountered:

jsquyres · 2018-04-10T13:55:55Z

@hjelmn Could you have a look?

hjelmn · 2018-04-10T14:34:42Z

Known issue in the v2.x series I thought we fixed. Please try v3.0.1

hjelmn · 2018-04-10T14:35:08Z

Hmm. Or you already did. Odd

hjelmn · 2018-04-10T14:35:29Z

Try a nightly master tarball

hjelmn · 2018-04-10T14:44:41Z

I think I can actually test this again. We have a mlx5 system now.

hjelmn · 2018-04-10T15:08:53Z

Definitely fixed on master.

hjelmn · 2018-04-10T15:23:34Z

What interconnect are you using? I can not reproduce this failure with a mlx5 system:

#----------------------------------------------------------------
# Benchmarking All_get_all 
# #processes = 10 
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.17         0.20         0.20
            1         1000         3.54         6.13         4.48
            2         1000         3.94         6.23         4.73
            4         1000         3.96         6.30         4.76
            8         1000         3.88         6.18         4.69
           16         1000         3.93         6.22         4.72
           32         1000         4.00         6.67         4.92
           64         1000         4.05         8.74         4.96
          128         1000         4.33         4.92         4.52
          256         1000         4.49         5.16         4.87
          512         1000         6.48         8.14         7.34
         1024         1000         7.50        11.04         8.48
         2048         1000        11.68        13.26        12.21
         4096         1000        19.04        19.78        19.44
         8192         1000        29.43        87.50        43.94
        16384         1000        70.20        71.37        71.01
        32768         1000       124.57       127.46       126.97
        65536          640       247.87       253.14       251.56
       131072          320       501.17       511.13       509.45
       262144          160      1038.54      1048.44      1045.57
       524288           80      2053.14      2062.29      2058.58
      1048576           40      4219.49      4247.98      4237.42
      2097152           20      8316.02      8407.93      8374.42
      4194304           10     16951.13     17119.45     17085.86


# All processes entering MPI_Finalize

Verified it is using btl/openib and osc/rdma.

hjelmn · 2018-04-10T15:49:40Z

Nevermind. I think I got a related failure with -n 3. Looking into it.

hjelmn · 2018-04-10T21:48:39Z

Nope, was just a missing commit from master that caused that failure.

I can't get this error with any configuration on this system. Will try others.

pryancsiro · 2018-04-11T00:45:17Z

It is a Mellanox ConnectX-3 system (1st system below), though I've tested the nightly master tarball on 4 systems and get the same issue - though with some extra errors on the ConnectX-4 systems.

The 1st and 2nd system are our clusters, the 3rd and 4th system are external.

1st system (in original post)

$ ofed_info -s
MLNX_OFED_LINUX-4.2-1.2.0.0:
$

$ lspci | grep -i mell
04:00.0 Infiniband controller: Mellanox Technologies MT27500 Family [ConnectX-3]
$

$ ibv_devices
    device                 node GUID
    ------              ----------------
    mlx4_0              e41d2d0300a31ea0
$

2nd system

$ ofed_info -s
MLNX_OFED_LINUX-4.2-1.2.0.0:
$

$ /sbin/lspci | grep -i mell
05:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
$

$ ibv_devices
    device                 node GUID
    ------              ----------------
    mlx5_0              248a070300ad4786
$

3rd system

$ ofed_info -s
MLNX_OFED_LINUX-4.2-1.0.0.0:
$

$ /sbin/lspci | grep -i mell
06:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
$

$ ibv_devices
    device                 node GUID
    ------              ----------------
    mlx5_0              248a0703008cab12
$

4th system

$ ofed_info -s
MLNX_OFED_LINUX-4.2-1.0.0.0:
$

$ /sbin/lspci | grep -i mell
b0:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]
$

$ ibv_devices
    device                 node GUID
    ------              ----------------
    mlx4_0              0002c903001ac160
$

On the 2nd system we get the error:

mlx5: b109: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
0000000c 00000000 00000000 00000000
00000000 00008813 10003fbb 000002d2
[[38911,1],8][btl_openib_component.c:3644:handle_wc] from b109 to: unknown error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 9a4668 opcode 2  vendor error 136 qp_idx 3
mlx5: b021: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
0000001c 00000000 00000000 00000000
00000000 00008813 10000cae 000031d2
[[38911,1],0][btl_openib_component.c:3644:handle_wc] from b021 to: unknown error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 74d178 opcode 2  vendor error 136 qp_idx 3
mlx5: b109: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000010 00000000 00000000 00000000
00000000 00008813 10003fc6 000169d2
[[38911,1],9][btl_openib_component.c:3644:handle_wc] from b109 to: unknown error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 74cf58 opcode 2  vendor error 136 qp_idx 3
mlx5: b021: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000008 00000000 00000000 00000000
00000000 00008813 10000c7d 0000c5d2
[[38911,1],3][btl_openib_component.c:3644:handle_wc] from b021 to: unknown error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 74d528 opcode 2  vendor error 136 qp_idx 3
mlx5: b021: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000010 00000000 00000000 00000000
00000000 00008813 10000c96 000009d2
[[38911,1],1][btl_openib_component.c:3644:handle_wc] from b021 to: unknown error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 74d258 opcode 2  vendor error 136 qp_idx 3
[[38911,1],7][btl_openib_component.c:3644:handle_wc] from b022 to: unknown error polling LP CQ with status WORK REQUEST FLUSHED ERROR status number 5 for wr_id 74d2b8 opcode 2  vendor error 244 qp_idx 3

On the 3rd system:

mlx5: r3767: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
0000000c 00000000 00000000 00000000
00000000 00008813 100001cf 000042d2
[[7064,1],8][btl_openib_component.c:3644:handle_wc] from r3767 to: unknown error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 1588828 opcode 2  vendor error 136 qp_idx 3
mlx5: r3767: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000010 00000000 00000000 00000000
00000000 00008813 100001d7 000146d2
[[7064,1],9][btl_openib_component.c:3644:handle_wc] from r3767 to: unknown error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 1a9bcc8 opcode 2  vendor error 136 qp_idx 3
[[7064,1],7][btl_openib_component.c:3644:handle_wc] from r3727 to: unknown error polling LP CQ with status WORK REQUEST FLUSHED ERROR status number 5 for wr_id 1355068 opcode 2  vendor error 244 qp_idx 3
mlx5: r3722: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
0000001c 00000000 00000000 00000000
00000000 00008813 100093c1 0000c1d2
[[7064,1],0][btl_openib_component.c:3644:handle_wc] from r3722 to: unknown error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 2494bd8 opcode 2  vendor error 136 qp_idx 3
mlx5: r3722: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000010 00000000 00000000 00000000
00000000 00008813 100093b1 0000b1d2
[[7064,1],1][btl_openib_component.c:3644:handle_wc] from r3722 to: unknown error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 225ed48 opcode 2  vendor error 136 qp_idx 3
mlx5: r3722: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000008 00000000 00000000 00000000
00000000 00008813 1000938f 0000a9d2
[[7064,1],3][btl_openib_component.c:3644:handle_wc] from r3722 to: unknown error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id f6e0d8 opcode 2  vendor error 136 qp_idx 3

On the 4th system:

[[49099,1],0][btl_openib_component.c:3644:handle_wc] from r54 to: unknown error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 14e1e38 opcode 2  vendor error 136 qp_idx 3
[[49099,1],8][btl_openib_component.c:3644:handle_wc] from r1004 to: unknown error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 2081e98 opcode 2  vendor error 136 qp_idx 3
[[49099,1],9][btl_openib_component.c:3644:handle_wc] from r1004 to: unknown error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 16b7ee8 opcode 2  vendor error 136 qp_idx 3
[[49099,1],1][btl_openib_component.c:3644:handle_wc] from r54 to: unknown error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 1d82fb8 opcode 2  vendor error 136 qp_idx 3
[[49099,1],3][btl_openib_component.c:3644:handle_wc] from r54 to: unknown error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 2715228 opcode -1  vendor error 136 qp_idx 3
[[49099,1],7][btl_openib_component.c:3644:handle_wc] from r59 to: unknown error polling LP CQ with status WORK REQUEST FLUSHED ERROR status number 5 for wr_id 1e3c958 opcode -1  vendor error 249 qp_idx 3

hjelmn · 2018-04-12T01:28:19Z

Interesting. I will continue trying to reproduce this problem. Maybe the issue happens when there are 3 nodes (I could only get 2 the other day).

pryancsiro · 2018-04-12T01:31:40Z

I had to use 3 nodes as a minimum with 4, 4 & 2 per node. Though 4, 3 & 3 per node also worked. There are other combinations at higher node/rank counts that produced this as well.

cniethammer · 2018-07-06T12:07:39Z

I get reproducibly the same error running the one-sided tests of the mpi_test_suite on our system with ConnectX-3 on two nodes and 40 processes.

jsquyres · 2018-07-09T22:28:04Z

This issue is likely about to become moot:

The vendor-recommended way forward for Mellanox hardware is to use the UCX PML.
In Open MPI v4.0.0, the openib BTL will only work with RoCE and iWARP by default (and it is possible that in Open MPI v5.0.0, the openib BTL will disappear altogether).

jsquyres · 2018-09-14T18:46:04Z

There's been no activity on this for months; I'm going to mark this as wontfix.

The Way Forward is to use the UCX PML.

jsquyres assigned hjelmn Apr 10, 2018

jsquyres added bug State: wontfix labels Sep 14, 2018

jsquyres closed this as completed Sep 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[btl_openib_component.c:3644:handle_wc REMOTE ACCESS ERROR #5047

[btl_openib_component.c:3644:handle_wc REMOTE ACCESS ERROR #5047

pryancsiro commented Apr 10, 2018

jsquyres commented Apr 10, 2018

Uh oh!

hjelmn commented Apr 10, 2018

Uh oh!

hjelmn commented Apr 10, 2018

Uh oh!

hjelmn commented Apr 10, 2018

Uh oh!

hjelmn commented Apr 10, 2018

Uh oh!

hjelmn commented Apr 10, 2018

Uh oh!

hjelmn commented Apr 10, 2018 •

edited

Loading

Uh oh!

hjelmn commented Apr 10, 2018

Uh oh!

hjelmn commented Apr 10, 2018

Uh oh!

pryancsiro commented Apr 11, 2018

Uh oh!

hjelmn commented Apr 12, 2018

Uh oh!

pryancsiro commented Apr 12, 2018

Uh oh!

cniethammer commented Jul 6, 2018

Uh oh!

jsquyres commented Jul 9, 2018

Uh oh!

jsquyres commented Sep 14, 2018

Uh oh!

[btl_openib_component.c:3644:handle_wc REMOTE ACCESS ERROR #5047

[btl_openib_component.c:3644:handle_wc REMOTE ACCESS ERROR #5047

Comments

pryancsiro commented Apr 10, 2018

jsquyres commented Apr 10, 2018

Uh oh!

hjelmn commented Apr 10, 2018

Uh oh!

hjelmn commented Apr 10, 2018

Uh oh!

hjelmn commented Apr 10, 2018

Uh oh!

hjelmn commented Apr 10, 2018

Uh oh!

hjelmn commented Apr 10, 2018

Uh oh!

hjelmn commented Apr 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hjelmn commented Apr 10, 2018

Uh oh!

hjelmn commented Apr 10, 2018

Uh oh!

pryancsiro commented Apr 11, 2018

Uh oh!

hjelmn commented Apr 12, 2018

Uh oh!

pryancsiro commented Apr 12, 2018

Uh oh!

cniethammer commented Jul 6, 2018

Uh oh!

jsquyres commented Jul 9, 2018

Uh oh!

jsquyres commented Sep 14, 2018

Uh oh!

hjelmn commented Apr 10, 2018 •

edited

Loading