Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

openmpi hang observed with RoCE transport and P_Write_Indv test #3378

Closed
sbasavapatna opened this issue Apr 19, 2017 · 19 comments
Closed

openmpi hang observed with RoCE transport and P_Write_Indv test #3378

sbasavapatna opened this issue Apr 19, 2017 · 19 comments

Comments

@sbasavapatna
Copy link

sbasavapatna commented Apr 19, 2017

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

This problem has been observed with OpenMPI versions v2.0.1, v2.1.0 and nightly version 2.1.1a1.

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Downloaded source tarball from here, built and installed it.

https://www.open-mpi.org/software/ompi/v2.0/downloads/openmpi-2.0.1.tar.gz
https://www.open-mpi.org/software/ompi/v2.1/downloads/openmpi-2.1.0.tar.gz
https://www.open-mpi.org/nightly/v2.x/openmpi-v2.x-201704120416-9d7e7a8.tar.gz

Please describe the system on which you are running

  • Operating system/version: RHEL7.3
  • Computer hardware: Dell PowerEdge R710
  • Network type: 10G-Ethernet/RoCE

Details of the problem

I'm seeing this issue with Openmpi Versions 2.0.1, 2.1.0 and v2.1.1a1.

The setup uses 2 nodes with 1 process on each node and the test case
is P_Write_Indv. The problem occurs when the test runs 4MB byte size
and the mode is NON-AGGREGATE. The test just hangs at that point.
Here's the command/options that's being used, followed by the
output log and the stack trace (with gdb):

/usr/local/mpi/openmpi/bin/mpirun -np 2 -hostfile hostfile -mca btl
self,sm,open ib -mca btl_openib_receive_queues P,65536,256,192,128 -mca
btl_openib_cpc_include rdmacm -mca orte_base_help_aggregate 0
--allow-run-as-root --bind-to none --map-by node
/usr/local/imb/openmpi/IMB-IO -msglog 21:22 -include P_Write_Indv -time 300

-----------------------------------------------------------------------------
 Benchmarking P_Write_Indv
#processes = 1
( 1 additional process waiting in MPI_Barrier)
-----------------------------------------------------------------------------
#
    MODE: AGGREGATE
#
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]   Mbytes/sec
            0         1000         0.56         0.56         0.56         0.00
      2097152           20     31662.31     31662.31     31662.31        63.17
      4194304           10     64159.89     64159.89     64159.89        62.34

-----------------------------------------------------------------------------
Benchmarking P_Write_Indv
#processes = 1
( 1 additional process waiting in MPI_Barrier)
-----------------------------------------------------------------------------
#
    MODE: NON-AGGREGATE
#
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]   Mbytes/sec
            0          100       570.14       570.14       570.14         0.00
      2097152           20     55007.33     55007.33     55007.33        36.36
      4194304           10     85838.17     85838.17     85838.17        46.60


#1  0x00007f08bf5af2a3 in poll_device () from
/usr/local/mpi/openmpi/lib/libopen-pal.so.20
#2  0x00007f08bf5afe15 in btl_openib_component_progress ()
   from /usr/local/mpi/openmpi/lib/libopen-pal.so.20
#3  0x00007f08bf55b89c in opal_progress () from
/usr/local/mpi/openmpi/lib/libopen-pal.so.20
#4  0x00007f08c0294a55 in ompi_request_default_wait_all ()
   from /usr/local/mpi/openmpi/lib/libmpi.so.20
#5  0x00007f08c02da295 in ompi_coll_base_allreduce_intra_recursivedoubling ()
   from /usr/local/mpi/openmpi/lib/libmpi.so.20
#6  0x00007f08c034b399 in mca_io_ompio_set_view_internal ()
   from /usr/local/mpi/openmpi/lib/libmpi.so.20
#7  0x00007f08c034b76c in mca_io_ompio_file_set_view ()
   from /usr/local/mpi/openmpi/lib/libmpi.so.20
#8  0x00007f08c02cc00f in PMPI_File_set_view () from
/usr/local/mpi/openmpi/lib/libmpi.so.20
#9  0x000000000040e4fd in IMB_init_transfer ()
#10 0x0000000000405f5c in IMB_init_buffers_iter ()
#11 0x0000000000402d25 in main ()
(gdb)

I did some analysis, please see the details below:

The mpi threads on the Root-node and the Peer-node are hung trying to
get completions on the RQ and SQ. There are no new completions as per the
data in the RQ/SQ and corresponding CQs in the device. Here's the sequence
as per the queue status observed in the device (while the thread is stuck
waiting for new completions):

       Root-node (RQ)                                        Peer-node (SQ)


                                 0 <---------------------------0
                                               .................
                                               .................
                                        17 SEND-Inline WRs
                                        (17 RQ-CQEs seen)
                                                ................
                                                ................
                               16 <----------------------------16



                               17<----------------------------- 17
                               1 RDMA-WRITE-Inline+Signaled
                                      (1 SQ-CQE generated)


                               18 <-----------------------------18
                                                .................
                                                .................
                                    19 RDMA-WRITE-Inlines
                                                 ................
                                                 ................
                                36 <----------------------------36

Like shown in the above diagram here's the sequence of events (Work
Requests and Completions) that occurred between the Root node and the
Peer node.

  1. Peer node posts 17 Send WRs with Inline flag set
  2. Root node receives all these 17 pkts in its RQ
  3. 17 CQEs are generated in the Root node in its RCQ
  4. Peer node posts an RDMA-WRITE WR with Inline and Signaled flag bits set
  5. Operation completes on the SQ and a CQE is generated in the SCQ
  6. There's no CQE on the Root node since it is an RDMA-WRITE operation
  7. Peer node posts 19 RDMA-WRITE WRs with Inline flag, but no Signaled flag
  8. No CQEs on the Peer node, since they are not Signaled
  9. No CQEs on the Root node, since they are RDMA-WRITEs

At this point, the Root node is polling on its RCQ for new completions
and there aren't any, since all SENDs are already received and CQEs seen.

Similarly, the Peer node is polling on its SCQ for new completions
and there aren't any, since the 19 RDMA-WRITEs are not signaled.

There is an exact similar condition in the reverse direction too.
That is, Root node issues a bunch of SENDs to the Peer node, followed
by some RDMA-WRITEs. The Peer node gets CQEs for the SENDs and looks
for more CQEs but there won't be any, since the subsequent ones are
all RDMA-WRITEs. The Root node itself is polling on its SCQ and it won't
find any new completions since there are no more signaled WRs.

So the 2 nodes are now in a hung state, polling on both the SCQ and RCQ,
while there's no such operations pending that can generate new CQEs.

Thanks,
-Harsha

EDIT: Added some markdown notation for verbatim and bullet lists

@irulz
Copy link

irulz commented Apr 19, 2017

@edgargabriel Can you please take a look at this ? I got your reference from Jeff Squyres.

@jsquyres
Copy link
Member

Hey @edgargabriel, I suggested that @sbasavapatna ping you about this because I'm not familiar with this IMB test, but it appears to be in the IO suite. Is this a corner case in OMPI-IO?

@sbasavapatna Can you run this case with ROMIO and see if it passes? E.g., something like mpirun ... --mca io romio ... IMB-IO ...?

@edgargabriel
Copy link
Member

should not be the case, I used the IMB test suite a lot. But I will double check. THat being said, I do not have access to the 10GE network, I will have to use an InfiniBand cluster for that. Just out of curiosity, is setting all of these openib parameters truly necessary? I do remember that I was running into issues if the various queues in openib are 'mis-configured', but that was independent of IO.

@sbasavapatna
Copy link
Author

@edgargabriel, those parameters are used by UNH in their plugfest and so we are using them in our tests.

@ggouaillardet
Copy link
Contributor

@sbasavapatna at first, i would try without setting the openib parameters and see if it helps.
(at least we will know if we are troubleshooting Open MPI or UNH plugfest)
also, can you try with mpirun --mca btl ^openib ... just to make sure btl/tcp is working fine ?

then, in ompi/mca/coll/base/coll_base_allreduce.c you can replace

        /* Exchange the data */
        ret = MCA_PML_CALL(irecv(tmprecv, count, dtype, remote,
                                 MCA_COLL_BASE_TAG_ALLREDUCE, comm, &reqs[0]));
        if (MPI_SUCCESS != ret) { line = __LINE__; goto error_hndl; }
        ret = MCA_PML_CALL(isend(tmpsend, count, dtype, remote,
                                 MCA_COLL_BASE_TAG_ALLREDUCE,
                                 MCA_PML_BASE_SEND_STANDARD, comm, &reqs[1]));
        if (MPI_SUCCESS != ret) { line = __LINE__; goto error_hndl; }
        ret = ompi_request_wait_all(2, reqs, MPI_STATUSES_IGNORE);
        if (MPI_SUCCESS != ret) { line = __LINE__; goto error_hndl; }

with

        /* Exchange the data */
        ret = MCA_PML_CALL(irecv(tmprecv, count, dtype, remote,
                                 MCA_COLL_BASE_TAG_ALLREDUCE, comm, &reqs[0]));
        if (MPI_SUCCESS != ret) { line = __LINE__; goto error_hndl; }
        ret = MCA_PML_CALL(send(tmpsend, count, dtype, remote,
                                 MCA_COLL_BASE_TAG_ALLREDUCE,
                                 MCA_PML_BASE_SEND_STANDARD, comm));
        if (MPI_SUCCESS != ret) { line = __LINE__; goto error_hndl; }
        ret = ompi_request_wait(reqs, MPI_STATUS_IGNORE);
        if (MPI_SUCCESS != ret) { line = __LINE__; goto error_hndl; }

and see if it helps too.

@edgargabriel
Copy link
Member

edgargabriel commented Apr 19, 2017

I can unfortunately not reproduce the problem on my clusters. I tried on two different InfiniBand networks (DDR, QDR), as well as on a local workstation, and the test passes and finishes in all cases. ( I used master for this test).

gabriel@crill:~/ParallelIO/imb/src> mpirun --mca btl openib,sm,self --mca btl_openib_receive_queues
 P,65536,256,192,128   -mca btl_openib_cpc_include rdmacm  -np 2 ./IMB-IO P_write_indv 
-msglog 21:22 -time 300
 benchmarks to run P_write_indv 
#------------------------------------------------------------
#    Intel (R) MPI Benchmarks 4.1, MPI-IO part   
#------------------------------------------------------------
# Date                  : Wed Apr 19 09:08:21 2017
# Machine               : x86_64
# System                : Linux
# Release               : 3.11.10-21-desktop
# Version               : #1 SMP PREEMPT Mon Jul 21 15:28:46 UTC 2014 (9a9565d)
# MPI Version           : 3.1
# MPI Thread Environment: 

# New default behavior from Version 3.2 on:

# the number of iterations per message size is cut down 
# dynamically when a certain run time (per message size sample) 
# is expected to be exceeded. Time limit is defined by variable 
# "SECS_PER_SAMPLE" (=> IMB_settings.h) 
# or through the flag => -time 
'''  


# Calling sequence was: 

# ./IMB-IO P_write_indv -msglog 21:22 -time 300

# Minimum io portion in bytes:   0
# Maximum io portion in bytes:   4194304
#
#
#

# List of Benchmarks to run:

# P_Write_Indv

#-----------------------------------------------------------------------------
# Benchmarking P_Write_Indv 
# #processes = 1 
# ( 1 additional process waiting in MPI_Barrier)
#-----------------------------------------------------------------------------
#
#    MODE: AGGREGATE 
#
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]   Mbytes/sec
            0          100         0.07         0.07         0.07         0.00
      2097152          100     54547.89     54547.89     54547.89        36.67
      4194304          100    107778.24    107778.24    107778.24        37.11

#-----------------------------------------------------------------------------
# Benchmarking P_Write_Indv 
# #processes = 1 
# ( 1 additional process waiting in MPI_Barrier)
#-----------------------------------------------------------------------------
#
#    MODE: NON-AGGREGATE 
#
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]   Mbytes/sec
            0          100         1.01         1.01         1.01         0.00
      2097152          100    113486.17    113486.17    113486.17        17.62
      4194304          100    177328.34    177328.34    177328.34        22.56

#-----------------------------------------------------------------------------
# Benchmarking P_Write_Indv 
# #processes = 2 
#-----------------------------------------------------------------------------
#
#    MODE: AGGREGATE 
#
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]   Mbytes/sec
            0          100         0.23         0.24         0.24         0.00
      2097152          100     65482.65     65482.87     65482.76        30.54
      4194304          100    103413.03    103413.12    103413.07        38.68

#-----------------------------------------------------------------------------
# Benchmarking P_Write_Indv 
# #processes = 2 
#-----------------------------------------------------------------------------
#
#    MODE: NON-AGGREGATE 
#
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]   Mbytes/sec
            0          100         8.96         8.98         8.97         0.00
      2097152          100    107234.89    107234.91    107234.90        18.65
      4194304          100    179246.29    179246.70    179246.50        22.32


# All processes entering MPI_Finalize

@edgargabriel
Copy link
Member

The one other comment that I have however (not sure whether it is important), I am fairly sure that based on your output the 4MB test in non-aggregate mode has finished (since you have the output line), and where it is actually hanging is the 0-byte case in aggregate mode for 2 processes, which is basically the next test.

@ggouaillardet
Copy link
Contributor

@sbasavapatna also, what if you skip the test with only one MPI process ?

@ggouaillardet
Copy link
Contributor

@edgargabriel an other odd thing is the header (#processes = 2) is not displayed in the initial logs,
it is likely in an unflushed buffer, so i am not sure of where this is hanging...
stack trace shows IMB_init_transfer()
@sbasavapatna which filesystem are you running on ?

@sbasavapatna
Copy link
Author

@ggouaillardet, dropping the btl_openib params didn't help.

@edgargabriel, your comment about number of processes seems right; i.e, when I tried with 1 process (-np 1), it doesn't hang and goes on to the next test cases - S_Write_Indv in aggregate mode, S_Write_Indv in non-aggregate mode, and so on.
The filesystem in use is xfs.

@sbasavapatna
Copy link
Author

@edgargabriel, could you please try this in a RoCE setup if you have one ?

@edgargabriel
Copy link
Member

I don't have access to a RoCE setup unfortunately. The test on my local workstation also used xfs.

But just to make sure I understand correctly, you use xfs, but still two nodes, or just one node? And if its two nodes, how did you setup the xfs file system to make it accessible from both nodes? MPI I/O requires that all nodes have access to exactly the same file system.

@sbasavapatna
Copy link
Author

It's a 2 node setup. Both nodes are running the same OS (RHEL7.3) with xfs file system. When you said the file system has to be accessible from both nodes, you meant it has to be a shared file system ? like nfs ?

@edgargabriel
Copy link
Member

yes, it is not sufficient to have an xfs file system on both nodes under the same name. Process zero has to have access to exactly the same files as process 1, which is not the case if you do not have a shared file system

@sbasavapatna
Copy link
Author

Ok, my setup is using xfs on both nodes with the same directory structure, but not shared. I'll change it to use nfs and let you know.

@sbasavapatna
Copy link
Author

@edgargabriel, with a nfs mount the test is running fine. Our QA team has raised this issue with the RoCE driver. I'll check with them that this is how (without a shared filesystem) they are also running the test, confirm that the problem is resolved in their setup too with nfs and then close this bug. Thanks for your help. I'll update this bug by COB tomorrow.

@edgargabriel
Copy link
Member

@sbasavapatna glad we found the problem! I am still trying to understand conceptually where the hang comes from in that scenario, but I unfortunately don't have the time right now to dive into a debugging session for that.

@jsquyres
Copy link
Member

Sounds like we should close this issue, then.

Is this something we can catch in OMPIO at runtime, perchance (to realize that a filesystem isn't shared)? I'm guessing the answer is "no".

@edgargabriel
Copy link
Member

there is a probably a simple test that could be done where a process creates a file, and all other process check whether they can see it. The trouble is that it a) could be fairly expensive, and b) it is a bit delicate because it might take a while until the file becomes visible to other nodes. But we could think about adding a check along those lines at least in debug mode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants