Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSHMEM yoda spml failures: need to update to BTL v3.0 #2028

Closed
jsquyres opened this issue Aug 29, 2016 · 41 comments
Closed

OSHMEM yoda spml failures: need to update to BTL v3.0 #2028

jsquyres opened this issue Aug 29, 2016 · 41 comments
Assignees
Milestone

Comments

@jsquyres
Copy link
Member

Cisco just added OSHMEM testing to its MTT 2 weeks ago (at the Dallas engineering meeting).

We're seeing a large failure rate on v2.x with OSHMEM testing using TCP,vader,self. For example: https://mtt.open-mpi.org/index.php?do_redir=2347

This shows 1,624 failures and 6,546 passes. I.e., a nearly 20% failure rate. 😱

Many of the failures show this kind of error message:

[mpi006:31821] Error base/memheap_base_mkey.c:162 - memheap_attach_segment() tr_id: 1 key 54ba0015
attach failed: errno = 12

Does anyone know what this means?

@artpol84 @jladd-mlnx

@jsquyres jsquyres added the bug label Aug 29, 2016
@jsquyres jsquyres added this to the v2.0.2 milestone Aug 29, 2016
@jsquyres
Copy link
Member Author

I also see these kinds of errors:

[mpi006:18246] Error spml_yoda.c:822 - mca_spml_yoda_put_internal() src=0xff0000d8 nfrags = 1 frag_size=-1
[mpi006:18246] Error spml_yoda.c:823 - mca_spml_yoda_put_internal() shmem OOM error need 262144 bytes
--------------------------------------------------------------------------
'Put' operation failed. Unable to allocate buffer, need 262144 bytes.
Try increasing 'spml_yoda_bml_alloc_threshold' value or setting it to '0' to
force waiting for all puts completion.

  spml_yoda_bml_alloc_threshold: 3
---------------------------------------------------------------------------

@jsquyres
Copy link
Member Author

Am I running these tests wrong? In most cases, the tests are run with 32 procs across 2 nodes (16 cores each); each node has 128GB RAM.

@jsquyres
Copy link
Member Author

@artpol84 @jladd-mlnx @igor-ivanov Any advice here?

@hppritcha
Copy link
Member

I'll do some testing, I'd like to see if yoda is busted over other BTLs - like gni.

@hppritcha
Copy link
Member

Since open shmem 1.3 compliance is going to one of the major features of the 2.1 release, we want for oshmem on as many system configs as possible. Marking this as a blocker.

@mike-dubman
Copy link
Member

@alex-mikheev - could you please comment.

@jladd-mlnx
Copy link
Member

jladd-mlnx commented Aug 31, 2016

@hppritcha @jsquyres We don't maintain Yoda at all and it's very likely that many of the tests fail because (some of) the BTLs can't make asynchronous progress or do true one-sided RDMA. MXM and UCX have all of these features. I would suggest we replace Yoda SPML entirely and move to UCX. There is support for multiple transports and eliminates the BTL mess altogether. UCX will be in a near GA state come late October. We also proposed rolling UCX into OMPI some time ago, perhaps this provides further motivation to do so.

@jsquyres
Copy link
Member Author

One of the requirements for OSHMEM to come into the Open MPI code base was that it needs to be able to handle all network types. AFAIK, UCX does not handle all network types (e.g., portals, usNIC). As such, Yoda needs to be fixed before v2.1.0 can be released.

@jladd-mlnx
Copy link
Member

I don't know if it's possible to fix it for TCP BTL. We have no knowledge here. This is not a regression.

@jladd-mlnx
Copy link
Member

How does this look on 1.10.3?

@hppritcha
Copy link
Member

hppritcha commented Aug 31, 2016

I did some spot checking on a Cray XE and get "registration errors" for what looks to be the BSS export portion if I try to use 8 or more PEs:

[nid00060:07656] base/memheap_base_alloc.c:38 - mca_memheap_base_alloc_init() Memheap alloc memory: 270532608 byte(s), 1 segments by method: 2
[nid00060:07656] base/memheap_base_static.c:209 - _load_segments() add: 00601000-00602000 rw-p 00001000 00:0e 25175784                           /var/opt/cray/alps/spool/18132313/barrier_performance.x
[nid00060:07656] base/memheap_base_static.c:75 - mca_memheap_base_static_init() Memheap static memory: 160 byte(s), 2 segments
[nid00060:07656] base/memheap_base_register.c:39 - mca_memheap_base_reg() register seg#00: 0x0xff000000 - 0x0x10f200000 270532608 bytes type=0x2 id=0x39E38006
[nid00060:07656] Error spml_yoda.c:439 - mca_spml_yoda_register() ugni: failed to register source memory: addr: 0xff000000, size: 270532608
[nid00060:07656] Error base/memheap_base_register.c:131 - _reg_segment() Failed to register segment

I don't think this is a BTL specific issue.

EDIT: Added verbatim

@jladd-mlnx
Copy link
Member

@hppritcha, can you try on 1.10.3? We don't have access to a Cray.

@hppritcha
Copy link
Member

I've not tried to use the 1.10.x series on Cray in forever and not sure how to configure for it.

But I did a little more digging. Actually the registration error I'm seeing is specific to GNI and memory registration limitations on XE system. If I use the tcp btl I'm not seeing the memheap_attach_seg issues.

I did some more testing using for PEs so there is sufficient GART space to register the tests' BSS.

I did more checking in the openshmem-release-1.0d/feature_tests/C directory.
Some of the tests passed, but I got failures (segfaults) with test_shmem_lock.x, test_shmem_get_shmalloc.x, test_shmem_get_globals.x, and test_shmem_collects.x.

There definitely is a bug with shmem_collect(32/64). I also saw a similar segfault for collect32_performance.x using both the tcp and ugni BTLs.

I'll try on the UH system later this week with the 1.10 release.

@hppritcha
Copy link
Member

hppritcha commented Aug 31, 2016

@jsquyres I'm noticing that your setup is hitting an error different from mine

libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0
[mpi017:19558] Error base/memheap_base_mkey.c:162 - memheap_attach_segment() tr_id: 1 key 76d2800d attach failed: errno = 12
[mpi017:19560] Error base/memheap_base_mkey.c:162 - memheap_attach_segment() tr_id: 1 key 76d2800d attach failed: errno = 12
[mpi017:19580] Error base/memheap_base_mkey.c:162 - memheap_attach_segment() tr_id: 1 key 76d2000f attach failed: errno = 12
[mpi017:19563] Error base/memheap_base_mkey.c:162 - memheap_attach_segment() tr_id: 1 key 76d2800d attach failed: errno = 12

@jsquyres
Copy link
Member Author

@hppritcha The "no userspace device-specific driver" warnings can be ignored. It means libibverbs didn't find a driver for my device (which is actually expected).

@hppritcha
Copy link
Member

After discussions with MLNX there is no guarantee that BTL's that don't support true one-sided operations will be able to run open shmem tests successfully. There will probably be a subset of of tests that work that may work with, for example, the tcp btl, but others likely not. I think we should document in the README for 2.1 which BTLs we think can support the yoda spml.

@alex-mikheev
Copy link
Contributor

Actually it complaints about not being able to register memheap:

[nid00060:07656] Error spml_yoda.c:439 - mca_spml_yoda_register() ugni: failed to register source memory: addr: 0xff000000, size: 270532608
[nid00060:07656] Error base/memheap_base_register.c:131 - _reg_segment() Failed to register segment
May be memheap fixed base address 0xff000000 is not good on cray ?

From: Howard Pritchard [mailto:notifications@github.com]
Sent: Wednesday, August 31, 2016 9:21 PM
To: open-mpi/ompi ompi@noreply.github.com
Cc: Alexander Mikheev alexm@mellanox.com; Mention mention@noreply.github.com
Subject: Re: [open-mpi/ompi] Lots of OSHMEM attach errors (#2028)

I did some spot checking on a Cray XE and get "registration errors" for what looks to be the BSS export portion if I try to use 8 or more PEs:

[nid00060:07656] base/memheap_base_alloc.c:38 - mca_memheap_base_alloc_init() Memheap alloc memory: 270532608 byte(s), 1 segments by method: 2
[nid00060:07656] base/memheap_base_static.c:209 - _load_segments() add: 00601000-00602000 rw-p 00001000 00:0e 25175784 /var/opt/cray/alps/spool/18132313/barrier_performance.x
[nid00060:07656] base/memheap_base_static.c:75 - mca_memheap_base_static_init() Memheap static memory: 160 byte(s), 2 segments
[nid00060:07656] base/memheap_base_register.c:39 - mca_memheap_base_reg() register seg#00: 0x0xff000000 - 0x0x10f200000 270532608 bytes type=0x2 id=0x39E38006
[nid00060:07656] Error spml_yoda.c:439 - mca_spml_yoda_register() ugni: failed to register source memory: addr: 0xff000000, size: 270532608
[nid00060:07656] Error base/memheap_base_register.c:131 - _reg_segment() Failed to register segment

I don't think this is a BTL specific issue.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com//issues/2028#issuecomment-243854429, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AIVKnYrzLNmc7rCfzIa0ksLqOeS40Zboks5qlcYngaJpZM4Jv3L6.

@rhc54
Copy link
Contributor

rhc54 commented Sep 1, 2016

FWIW: on v1.10, most oshmem tests pass on my TCP-only cluster. The ones that fail are of the following form:

[rhc001:172439] Error spml_yoda.c:1062 - mca_spml_yoda_get() pe=15: 0x7ffce6514588 is not
address of shared variable
--------------------------------------------------------------------------
SHMEM_ABORT was invoked on rank 0 (pid 172439, host=rhc001) with errorcode -6.
--------------------------------------------------------------------------

or

mpirun noticed that process rank 11 with PID 176049 on node rhc001 exited on signal 8 (Floating
point exception).

So it looks to me like v2.x has some bug fixes that didn't go back into v1.10, but has some new problems as well.

@jladd-mlnx
Copy link
Member

FWIW: on 1.10.3 AND 2.0.1 nightly - all of the Houston OSHMEM feature tests complete successfully with Yoda using TCP, SM, openIB, and vader BTLs on up to 16 processes. At this time, I can't reproduce your results @jsquyres.

@rhc54
Copy link
Contributor

rhc54 commented Sep 2, 2016

Here you go - nearly 2000 failures on 2.0.1 with MTT:

https://mtt.open-mpi.org/index.php?do_redir=2350

@jsquyres
Copy link
Member Author

jsquyres commented Sep 2, 2016

Can you try with 32 processes across at least 2 machines? That's what I'm running.

@jladd-mlnx
Copy link
Member

jladd-mlnx commented Sep 2, 2016

So, I see part of the issue. It seems someone isn't fragmenting correctly. I'm not sure if it's Yoda or TCP BTL, but given that Yoda has been virtually untouched for three years, and there has been significant changes to the BTL structure between 1.10 and 2.0 I'm inclined to point my sniffer on the BTL.

It's dying if the message can't fit into one BTL frag. If I set:

-mca btl_tcp_max_send_size 262144

Then I make it until the test gets to 500K msg and hit the OOM error again. This flow works in 1.10.3. I'll keep digging.

@jladd-mlnx
Copy link
Member

Yep. Frag size is garbage. It's -1.

static inline void calc_nfrags_put (mca_bml_base_btl_t* bml_btl,
                                    size_t size,
                                    unsigned int *frag_size,
                                    int *nfrags,
                                    int use_send)
{
    ...
    else {
        *frag_size = bml_btl->btl->btl_put_limit;
        fprintf(stderr,"YODA Frag size %d\n",*frag_size);
    }
    *nfrags = 1 + (size - 1) / (*frag_size);
}

@jladd-mlnx
Copy link
Member

I guess when the BTLs moved into OPAL, this field went by the wayside? BTL gurus, what's the correct way to get this info now?

@jladd-mlnx
Copy link
Member

jladd-mlnx commented Sep 2, 2016

Something is weird here; if I use the send path instead of put, then all is good i.e. Hmmm. Git blame says the naughty line was added:

16ae7d97 (Nathan Hjelm 2015-01-08 13:04:58 -0700 122) *frag_size = bml_btl->btl->btl_put_limit;

@hjelmn
Copy link
Member

hjelmn commented Sep 2, 2016

I honestly did the minimum necessary to translate yoda from btl 2.0 -> btl 3.0. Looks like more work is needed to finish the job.

@hjelmn
Copy link
Member

hjelmn commented Sep 2, 2016

Not on my priority list at all. Do not assign to me.

@hjelmn
Copy link
Member

hjelmn commented Sep 2, 2016

can give pointers on how btl 3.0 works if needed but really will have no time beyond that.

@jladd-mlnx
Copy link
Member

@hjelmn You touched it. You need to test it, Nathan.

Offending commit

commit 16ae7d97d10769ef930cc1c4cee0911b5ff2363c
Author: Nathan Hjelm <hjelmn@lanl.gov>
Date:   Thu Jan 8 13:04:58 2015 -0700

    spml/yoda: update for BTL 3.0 interface

    This commit make spml/yoda compatible with BTL 3.0. This is meant as a
    starting point only. More work will be needed to make optimial use of
    the new interface.

    Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

@bosilca
Copy link
Member

bosilca commented Sep 3, 2016

btl_put_limit is a size_t and in the snippet of code above frag_size is an uint32_t. There is clear mismatch that can lead to unexpected fragmentations. @jladd-mlnx can you print the btl_put_limit instead of the *frag_size.

@hjelmn
Copy link
Member

hjelmn commented Sep 3, 2016

Please update the code to be btl 3.0 compliant. I am generally available to answer questions on the btl interface M-F 9-4 MDT except federal holidays.

@jsquyres
Copy link
Member Author

jsquyres commented Sep 6, 2016

FWIW, I added OSHMEM testing to the v2.x branch -- just in case the mempool updates on master are causing issues: https://mtt.open-mpi.org/index.php?do_redir=2354

Short version: I'm seeing similar issues on the v2.x branch:

[mpi003:18195] Error base/memheap_base_mkey.c:162 - memheap_attach_segment() tr_id: 1 key af70000 attach failed: errno = 12

@jsquyres
Copy link
Member Author

Per lots of discussion on the 2016-09-20 and 2016-09-13 weekly teleconfs, assigning this issue to Mellanox.

@jsquyres jsquyres modified the milestones: v2.1.0, v2.0.2 Sep 20, 2016
@jsquyres jsquyres changed the title Lots of OSHMEM attach errors OSHMEM yoda spml failures: need to update to BTL v3.0 Sep 20, 2016
@jladd-mlnx jladd-mlnx assigned karasevb and unassigned jladd-mlnx Oct 1, 2016
karasevb added a commit to karasevb/ompi that referenced this issue Nov 2, 2016
Fixed the shmem OOM error which is referenced on open-mpi#2028

Signed-off-by: Boris Karasev <karasev.b@gmail.com>
karasevb added a commit to karasevb/ompi that referenced this issue Nov 3, 2016
Fixed the shmem OOM error which is referenced on open-mpi#2028

Signed-off-by: Boris Karasev <karasev.b@gmail.com>
karasevb added a commit to karasevb/ompi that referenced this issue Nov 3, 2016
Fixed the shmem OOM error which is referenced on open-mpi#2028

Signed-off-by: Boris Karasev <karasev.b@gmail.com>

(cherry picked from commit 68b5acd)
karasevb added a commit to karasevb/ompi that referenced this issue Nov 3, 2016
Fixed the shmem OOM error which is referenced on open-mpi#2028

Signed-off-by: Boris Karasev <karasev.b@gmail.com>
(cherry picked from commit 68b5acd)
karasevb added a commit to karasevb/ompi that referenced this issue Nov 3, 2016
Fixed the shmem OOM error which is referenced on open-mpi#2028

Signed-off-by: Boris Karasev <karasev.b@gmail.com>
(cherry picked from commit 68b5acd)
@jsquyres
Copy link
Member Author

jsquyres commented Nov 3, 2016

Even with @karasevb's 68b5acd, I'm getting segv's when running with tcp,vader,self:

$ mpirun --mca btl tcp,vader,self -np 32 performance_tests/micro_benchmarks/collect32_performance.x
[ivy05:11891] *** Process received signal ***
[ivy05:11891] Signal: Segmentation fault (11)
[ivy05:11891] Signal code: Address not mapped (1)
[ivy05:11891] Failing at address: 0x18
[ivy05:11891] [ 0] /lib64/libpthread.so.0[0x37b220f710]
[ivy05:11891] [ 1] /home/jsquyres/bogus/lib/openmpi/mca_spml_yoda.so(mca_spml_yoda_get+0x65e)[0x2aaac1c76b61]
[ivy05:11891] [ 2] /home/jsquyres/bogus/lib/openmpi/mca_scoll_basic.so(+0x543a)[0x2aaac1e8043a]
[ivy05:11891] [ 3] /home/jsquyres/bogus/lib/openmpi/mca_scoll_basic.so(mca_scoll_basic_collect+0x1b9)[0x2aaac1e7ec4c]
[ivy05:11891] [ 4] /home/jsquyres/bogus/lib/openmpi/mca_scoll_mpi.so(mca_scoll_mpi_collect+0x2e2)[0x2aaac2088b21]
[ivy05:11891] [ 5] /home/jsquyres/bogus/lib/liboshmem.so.0(+0x3740f)[0x2aaaaaae540f]
[ivy05:11891] [ 6] /home/jsquyres/bogus/lib/liboshmem.so.0(shmem_collect32+0x25a)[0x2aaaaaae5673]
[ivy05:11891] [ 7] ./performance_tests/micro_benchmarks/collect32_performance.x[0x400b4f]
[ivy05:11891] [ 8] /lib64/libc.so.6(__libc_start_main+0xfd)[0x37b1e1ed1d]
[ivy05:11891] [ 9] ./performance_tests/micro_benchmarks/collect32_performance.x[0x400859]
[ivy05:11891] *** End of error message ***
$ gdb performance_tests/micro_benchmarks/collect32_performance.x core.collect32_perfo-ivy05-11891
(gdb) bt
#0  0x00002aaac1c76b61 in mca_spml_yoda_get (src_addr=0x6012b0 <pSyncB>, 
    size=8, dst_addr=0x7fffffffba68, src=1) at spml_yoda.c:1170
#1  0x00002aaac1e8043a in _algorithm_central_collector (group=0x81d1e0, 
    target=0xff0000f8, source=0xff0000d8, nlong=16, pSync=0x6012b0 <pSyncB>)
    at scoll_basic_collect.c:570
#2  0x00002aaac1e7ec4c in mca_scoll_basic_collect (group=0x81d1e0, 
    target=0xff0000f8, source=0xff0000d8, nlong=16, pSync=0x6012b0 <pSyncB>, 
    nlong_type=false, alg=-1) at scoll_basic_collect.c:119
#3  0x00002aaac2088b21 in mca_scoll_mpi_collect (group=0x81d1e0, 
    target=0xff0000f8, source=0xff0000d8, nlong=16, pSync=0x6012b0 <pSyncB>, 
    nlong_type=false, alg=-1) at scoll_mpi_ops.c:145
#4  0x00002aaaaaae540f in _shmem_collect (target=0xff0000f8, 
    source=0xff0000d8, nbytes=16, PE_start=0, logPE_stride=0, PE_size=32, 
    pSync=0x6012b0 <pSyncB>, array_type=false) at pshmem_collect.c:87
#5  0x00002aaaaaae5673 in pshmem_collect32 (target=0xff0000f8, 
    source=0xff0000d8, nelems=4, PE_start=0, logPE_stride=0, PE_size=32, 
    pSync=0x6012b0 <pSyncB>) at pshmem_collect.c:113
#6  0x0000000000400b4f in main () at collect32_performance.c:90
(gdb) up
#0  0x00002aaac1c76b61 in mca_spml_yoda_get (src_addr=0x6012b0 <pSyncB>, 
    size=8, dst_addr=0x7fffffffba68, src=1) at spml_yoda.c:1170
1170                    local_handle = ((mca_spml_yoda_context_t*)l_mkey->spml_context)->registration;
(gdb) p l_mkey
$2 = (sshmem_mkey_t *) 0x0
(gdb) 

@jsquyres
Copy link
Member Author

jsquyres commented Nov 3, 2016

I get a lot of failures in the OpenSHMEM test suite like this that all seem to have the same signature: the l_mkey is NULL.

@artpol84
Copy link
Contributor

artpol84 commented Nov 3, 2016

Thank you, we will check!

@jladd-mlnx
Copy link
Member

@jsquyres , is this on master or 2.x? Even without the patch, I have no issues running the collect32_performance.x benchmark between two nodes with the following command line on the 2.x branch

$mpirun -np 32 --map-by node --mca pml ob1 --mca btl self,vader,tcp --mca spml yoda  -mca btl_tcp_if_include ib0 ./performance_tests/micro_benchmarks/collect32_performance.x
Time required to collect 512 bytes of data, with 32 PEs is 53 microseconds

If, however, I try with master (after rebuilding the benchmark), I get:

$mpirun -np 32 --map-by node --mca pml ob1 --mca btl self,vader,tcp --mca spml yoda  -mca coll_hcoll_enable 0 -mca btl_tcp_if_include ib0 ./performance_tests/micro_benchmarks/collect32_performance.x
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 747 on
node clx-orion-121 exiting improperly. There are three reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

3. this process called "MPI_Abort" or "orte_abort" and the mca parameter
orte_create_session_dirs is set to false. In this case, the run-time cannot
detect that the abort call was an abnormal termination. Hence, the only
error message you will receive is this one.

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).

You can avoid this message by specifying -quiet on the mpirun command line.
--------------------------------------------------------------------------

And the performance is actually significantly improved over the numbers I collected with 1.10.2.

@jsquyres
Copy link
Member Author

jsquyres commented Nov 3, 2016

@jladd-mlnx This is on master. Here's how I configured Open MPI:

./configure --prefix=/home/jsquyres/bogus --with-usnic --with-libfabric=/home/jsquyres/libfabric-current/install --enable-mpirun-prefix-by-default --enable-debug --enable-mem-debug --enable-mem-profile --enable-mpi-fortran --enable-debug --enable-mem-debug --enable-picky

Copying a bunch of your params, here's how I ran that individual test (although many more fail in the same way):

$ mpirun \
    --mca spml yoda \
    --map-by node \
    --host ivy05,ivy06 \
    -np 32 \
    --mca btl_tcp_if_include vic20 \
    --mca btl tcp,vader,self \
    ./performance_tests/micro_benchmarks/collect32_performance.x

vic20 is a 10G ethernet interface.

Looking at the corefile that was emitted from the above run, it shows the same symptom: l_mkey is NULL.

@jsquyres
Copy link
Member Author

jsquyres commented Nov 3, 2016

I confirm that on v2.0.x and v2.x, these initial tests seem to work fine with vader,tcp,self. Now that those fixes are merged into these branches, let's see how it does tonight on MTT.

@jsquyres
Copy link
Member Author

jsquyres commented Nov 3, 2016

FWIW, I see a bunch of ptmalloc messages like this in the oshmem tests (in the v2.0.x branch):

$ mpirun -np 32 --mca btl tcp,vader,self ./feature_tests/Fortran/broadcast/test_shmem_broadcast_03_real8.x
PTMALLOC: USAGE ERROR DETECTED: m=0x2aaac15ba720 ptr=0xff0000c0
PTMALLOC: USAGE ERROR DETECTED: m=0x2aaac15ba720 ptr=0xff0000f0  
PTMALLOC: USAGE ERROR DETECTED: m=0x2aaac15ba720 ptr=0xff0000c0
PTMALLOC: USAGE ERROR DETECTED: m=0x2aaac15ba720 ptr=0xff0000f0  
PTMALLOC: USAGE ERROR DETECTED: m=0x2aaac15ba720 ptr=0xff0000c0
PTMALLOC: USAGE ERROR DETECTED: m=0x2aaac15ba720 ptr=0xff0000f0
PTMALLOC: USAGE ERROR DETECTED: m=0x2aaac15ba720 ptr=0xff0000c0
PTMALLOC: USAGE ERROR DETECTED: m=0x2aaac15ba720 ptr=0xff0000f0
PTMALLOC: USAGE ERROR DETECTED: m=0x2aaac15ba720 ptr=0xff0000c0
...
 test_shmem_broadcast8_03: Failed

We'll see more after MTT runs tonight.

@jsquyres
Copy link
Member Author

I think the fixes for this particular issue are now done; I'm still seeing some OSHMEM failures in MTT testing, but let's open up a new issue to track those (i.e., they seem to be different than BTL-3.0 updates).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants