-
Notifications
You must be signed in to change notification settings - Fork 871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OSHMEM yoda spml failures: need to update to BTL v3.0 #2028
Comments
I also see these kinds of errors:
|
Am I running these tests wrong? In most cases, the tests are run with 32 procs across 2 nodes (16 cores each); each node has 128GB RAM. |
@artpol84 @jladd-mlnx @igor-ivanov Any advice here? |
I'll do some testing, I'd like to see if yoda is busted over other BTLs - like gni. |
Since open shmem 1.3 compliance is going to one of the major features of the 2.1 release, we want for oshmem on as many system configs as possible. Marking this as a blocker. |
@alex-mikheev - could you please comment. |
@hppritcha @jsquyres We don't maintain Yoda at all and it's very likely that many of the tests fail because (some of) the BTLs can't make asynchronous progress or do true one-sided RDMA. MXM and UCX have all of these features. I would suggest we replace Yoda SPML entirely and move to UCX. There is support for multiple transports and eliminates the BTL mess altogether. UCX will be in a near GA state come late October. We also proposed rolling UCX into OMPI some time ago, perhaps this provides further motivation to do so. |
One of the requirements for OSHMEM to come into the Open MPI code base was that it needs to be able to handle all network types. AFAIK, UCX does not handle all network types (e.g., portals, usNIC). As such, Yoda needs to be fixed before v2.1.0 can be released. |
I don't know if it's possible to fix it for TCP BTL. We have no knowledge here. This is not a regression. |
How does this look on 1.10.3? |
I did some spot checking on a Cray XE and get "registration errors" for what looks to be the BSS export portion if I try to use 8 or more PEs:
I don't think this is a BTL specific issue. EDIT: Added verbatim |
@hppritcha, can you try on 1.10.3? We don't have access to a Cray. |
I've not tried to use the 1.10.x series on Cray in forever and not sure how to configure for it. But I did a little more digging. Actually the registration error I'm seeing is specific to GNI and memory registration limitations on XE system. If I use the tcp btl I'm not seeing the I did some more testing using for PEs so there is sufficient GART space to register the tests' BSS. I did more checking in the openshmem-release-1.0d/feature_tests/C directory. There definitely is a bug with shmem_collect(32/64). I also saw a similar segfault for collect32_performance.x using both the tcp and ugni BTLs. I'll try on the UH system later this week with the 1.10 release. |
@jsquyres I'm noticing that your setup is hitting an error different from mine
|
@hppritcha The "no userspace device-specific driver" warnings can be ignored. It means libibverbs didn't find a driver for my device (which is actually expected). |
After discussions with MLNX there is no guarantee that BTL's that don't support true one-sided operations will be able to run open shmem tests successfully. There will probably be a subset of of tests that work that may work with, for example, the tcp btl, but others likely not. I think we should document in the README for 2.1 which BTLs we think can support the yoda spml. |
Actually it complaints about not being able to register memheap: [nid00060:07656] Error spml_yoda.c:439 - mca_spml_yoda_register() ugni: failed to register source memory: addr: 0xff000000, size: 270532608 From: Howard Pritchard [mailto:notifications@github.com] I did some spot checking on a Cray XE and get "registration errors" for what looks to be the BSS export portion if I try to use 8 or more PEs: [nid00060:07656] base/memheap_base_alloc.c:38 - mca_memheap_base_alloc_init() Memheap alloc memory: 270532608 byte(s), 1 segments by method: 2 I don't think this is a BTL specific issue. — |
FWIW: on v1.10, most oshmem tests pass on my TCP-only cluster. The ones that fail are of the following form:
or
So it looks to me like v2.x has some bug fixes that didn't go back into v1.10, but has some new problems as well. |
FWIW: on 1.10.3 AND 2.0.1 nightly - all of the Houston OSHMEM feature tests complete successfully with Yoda using TCP, SM, openIB, and vader BTLs on up to 16 processes. At this time, I can't reproduce your results @jsquyres. |
Here you go - nearly 2000 failures on 2.0.1 with MTT: |
Can you try with 32 processes across at least 2 machines? That's what I'm running. |
So, I see part of the issue. It seems someone isn't fragmenting correctly. I'm not sure if it's Yoda or TCP BTL, but given that Yoda has been virtually untouched for three years, and there has been significant changes to the BTL structure between 1.10 and 2.0 I'm inclined to point my sniffer on the BTL. It's dying if the message can't fit into one BTL frag. If I set:
Then I make it until the test gets to 500K msg and hit the OOM error again. This flow works in 1.10.3. I'll keep digging. |
Yep. Frag size is garbage. It's
|
I guess when the BTLs moved into OPAL, this field went by the wayside? BTL gurus, what's the correct way to get this info now? |
Something is weird here; if I use the
|
I honestly did the minimum necessary to translate yoda from btl 2.0 -> btl 3.0. Looks like more work is needed to finish the job. |
Not on my priority list at all. Do not assign to me. |
can give pointers on how btl 3.0 works if needed but really will have no time beyond that. |
@hjelmn You touched it. You need to test it, Nathan. Offending commit
|
btl_put_limit is a size_t and in the snippet of code above frag_size is an uint32_t. There is clear mismatch that can lead to unexpected fragmentations. @jladd-mlnx can you print the btl_put_limit instead of the *frag_size. |
Please update the code to be btl 3.0 compliant. I am generally available to answer questions on the btl interface M-F 9-4 MDT except federal holidays. |
FWIW, I added OSHMEM testing to the v2.x branch -- just in case the mempool updates on master are causing issues: https://mtt.open-mpi.org/index.php?do_redir=2354 Short version: I'm seeing similar issues on the v2.x branch:
|
Per lots of discussion on the 2016-09-20 and 2016-09-13 weekly teleconfs, assigning this issue to Mellanox. |
Fixed the shmem OOM error which is referenced on open-mpi#2028 Signed-off-by: Boris Karasev <karasev.b@gmail.com>
Fixed the shmem OOM error which is referenced on open-mpi#2028 Signed-off-by: Boris Karasev <karasev.b@gmail.com>
Fixed the shmem OOM error which is referenced on open-mpi#2028 Signed-off-by: Boris Karasev <karasev.b@gmail.com> (cherry picked from commit 68b5acd)
Fixed the shmem OOM error which is referenced on open-mpi#2028 Signed-off-by: Boris Karasev <karasev.b@gmail.com> (cherry picked from commit 68b5acd)
Fixed the shmem OOM error which is referenced on open-mpi#2028 Signed-off-by: Boris Karasev <karasev.b@gmail.com> (cherry picked from commit 68b5acd)
Even with @karasevb's 68b5acd, I'm getting segv's when running with tcp,vader,self:
|
I get a lot of failures in the OpenSHMEM test suite like this that all seem to have the same signature: the |
Thank you, we will check! |
@jsquyres , is this on master or 2.x? Even without the patch, I have no issues running the
If, however, I try with master (after rebuilding the benchmark), I get:
And the performance is actually significantly improved over the numbers I collected with 1.10.2. |
@jladd-mlnx This is on master. Here's how I configured Open MPI:
Copying a bunch of your params, here's how I ran that individual test (although many more fail in the same way):
vic20 is a 10G ethernet interface. Looking at the corefile that was emitted from the above run, it shows the same symptom: |
I confirm that on v2.0.x and v2.x, these initial tests seem to work fine with vader,tcp,self. Now that those fixes are merged into these branches, let's see how it does tonight on MTT. |
FWIW, I see a bunch of ptmalloc messages like this in the oshmem tests (in the v2.0.x branch):
We'll see more after MTT runs tonight. |
I think the fixes for this particular issue are now done; I'm still seeing some OSHMEM failures in MTT testing, but let's open up a new issue to track those (i.e., they seem to be different than BTL-3.0 updates). |
Cisco just added OSHMEM testing to its MTT 2 weeks ago (at the Dallas engineering meeting).
We're seeing a large failure rate on v2.x with OSHMEM testing using TCP,vader,self. For example: https://mtt.open-mpi.org/index.php?do_redir=2347
This shows 1,624 failures and 6,546 passes. I.e., a nearly 20% failure rate. 😱
Many of the failures show this kind of error message:
Does anyone know what this means?
@artpol84 @jladd-mlnx
The text was updated successfully, but these errors were encountered: