-
Notifications
You must be signed in to change notification settings - Fork 871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vader BTL crashes #473
Comments
Trying to wrap my head around this stack trace. It should not be possible. Check fboxes reads from a pointer in a shared memory segment owned by the sender. The sender's fast box data pointer is initialized by the sender before it is sent to the receiver. This rules out the possibility that the sending process ran out of room in its shared memory segment. Only thing i can think of is memory corruption and by luck it is working with sm. |
Even more suspicious. The line number of the crash is not the first read from either the fast box or the endpoint. This is not likely Vader's fault. |
@opoplawski Can you post the program here that caused the problem? |
Note that the stack trace is from one of the hang conditions where each process spins in opal_condition_wait(). Other times it will trigger and MPI error. I'm afraid the test program is fairly involved. It's the nc_test4/tst_nc4perf.c (https://github.com/Unidata/netcdf-c/blob/master/nc_test4/tst_nc4perf.c) test program in netcdf 4.3.3.1 that makes use of HDF5 IO - in this case its hdf5 1.8.14. This is running on Fedora Rawhide which is using gcc 5.0.0. So far I've only seen it on i686. |
I should also note that I'm not entirely sure if this is a regression or not (or when it started). I've seen odd behavior for quite a while with netcdf's MPI tests. |
Ah, ok. If the trace is not a crash that makes more sense. I will take a look and see if I can figure out why that test is getting stuck. |
Can't reproduce on master (same vader revision) on SLES11 with btl_vader_single_copy_mechanism set to either xpmem or none. tst_nc4perf runs to completion with 2, 4, and 10 ranks running on a single node. I used netcdf master with hdf5 1.8.14, gcc 4.8.2. This could be a romio bug. The version in 1.8.4 lags behind trunk. Can you try running with -mca io ompio? |
Another alternative would be to run with master. |
That's worse: $ mpirun -np 4 -mca io ompio ./openmpi/nc_test4/tst_nc4perf |
(gdb) list *0x91a4 |
Is this test publicly available? I can have a look at it to see what is Thanks On 3/16/2015 10:46 AM, Orion Poplawski wrote:
Edgar Gabriel Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335Nothing can wear you out like caring about people. - S.E. Hinton |
If this was the 1.8 series of Open MPI, the ompio module there does not have all the fixes necessary to pass the hdf5 test suite, that was done last summer and it includes too many changes to ompio for it to be feasible to backport it to the 1.8 series. |
@edgargabriel See earlier comments for what test and openmpi version this is. Some more details: The HDF5 call seems to be from ./hdf5-1.8.14/src/H5FDmpio.c:1091:
|
Okay, so probably not worth worrying about the 1.8 ompio failure. Looks like the original issue may be fixed in master, but we have no idea that the fix may have been? |
Can you verify that it indeed works for you with master? Just because I can't reproduce doesn't mean it is fixed :). Once we know whether it is fixed we can start the discussion about whether we should back-port the romio fixes on master to 1.8. The fix will likely be among those changes. |
I'm trying to build the Fedora package with opmi master, but running into issue #475 |
Any luck with master? |
Still compiling deps. Ran into issue #478 as well - disabled tests for now. |
I can still reproduce the hang with the current dev snapshot. This may be triggered by gcc 5.0.0 as well. |
That is my guess. I am trying to install a gcc 5 snapshot build to test this theory. Keep in mind that gcc 5.0 is still technically in beta so there is a good chance this is a gcc bug. |
I've found another package that is entering a hang loop on Fedora rawhide i686: elpa. Haven't looked at it in detail yet, but this does seem to be a problem with more than just netcdf. |
Yup, looks like the same cycle in opal_progress(). |
I've updated to 1.8.4-134-g9ad2aa8 and applied the atomic patch. It does not appear to affect this problem. I'm also seeing a (probably) similar failure on arm7hl. |
Found the problem. Vader assumes 64-bit load/store in the fast box code. With gcc 4.8 this doesn't seem to cause any issues but with gcc-5.0 there is a data race between the process setting the fast box header and the process reading it. This causes the receiver to read an incomplete message leading to a hang or a crash. I am putting together a fix for the data race for master and with PR it to 1.8. |
On 32-bit architectures loads/stores of fast box headers may take multiple instructions. This can lead to a data race between the sender/receiver when reading/writing the sequence number. This can lead to a situation where the receiver could process incomplete data. To fix the issue this commit re-orders the fast box header to put the sequence number and the tag in the same 32-bits to ensure they are always loaded/stored together. Fixes open-mpi#473 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
@opoplawski #503 should fix the issue. I was able to reproduce crashes/hangs with ubuntu 14.04 i386 with gcc 5.0-032215. |
It does indeed appear to fix it for me, thanks! Now I just need to track down the armv7hl issue... |
Closing, since @opoplawski confirms that it's fixed. |
CSCuv67889: usnic: fix an error corner case
On 32-bit architectures loads/stores of fast box headers may take multiple instructions. This can lead to a data race between the sender/receiver when reading/writing the sequence number. This can lead to a situation where the receiver could process incomplete data. To fix the issue this commit re-orders the fast box header to put the sequence number and the tag in the same 32-bits to ensure they are always loaded/stored together. master commit open-mpi/ompi@17b80a9 Fixes open-mpi#473 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Per the thread starting here: http://www.open-mpi.org/community/lists/devel/2015/03/17131.php
@opoplawski is seeing crashes in the Open MPI test suite in openmpi-1.8.4-99-20150228 (Feb 28 nightly tarball) with the vader BTL. If he disables the vader BTL, the crashes go away:
@hjelmn Can you have a look?
The text was updated successfully, but these errors were encountered: