Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resizing array on locale results in segmentation fault on uGNI with Hugepages #13611

Closed
LouisJenkinsCS opened this issue Aug 1, 2019 · 33 comments · Fixed by #13656
Closed

Resizing array on locale results in segmentation fault on uGNI with Hugepages #13611

LouisJenkinsCS opened this issue Aug 1, 2019 · 33 comments · Fixed by #13656

Comments

@LouisJenkinsCS
Copy link
Member

on Locales[1] {
    var arr : [0..-1] int;
    for i in 1..1024 * 1024 * 1024 {
        arr.push_back(i);
    }
}

Stack Trace

ATP Stack walkback for Rank 1 starting:
  [empty]@0xffffffffffffffff
  qthread_wrapper@0x200b7601
  chapel_wrapper@0x2003cb91
  fork_call_wrapper_blocking@0x200450df
  wrapon_fn_chpl14@multilocalePushback.chpl:1
  on_fn_chpl14@ChapelArray.chpl:2200
  on_fn16.isra.304@ChapelArray.chpl:3300
  _delete_arr@ChapelDistribution.chpl:954
  dsiDestroyArr2@chpl-mem-impl.h:69
  chpl_je_huge_dalloc@0x200a067f

This doesn't seem to happen on my Infiniband cluster.

@mppf mppf added the user issue label Aug 1, 2019
@LouisJenkinsCS
Copy link
Member Author

I'd like to state that I've come across this problem quite a few times before, and each time I thought it was due to running out-of-memory, so its not a blocking/gating issue, but I hope it gets resolved.

@mppf
Copy link
Member

mppf commented Aug 1, 2019

It seems that arr.push_back can use ~2x the memory eventually required, for one thing. I was surprised how slow this ran on my system, and it used ~20GB of RAM.

@LouisJenkinsCS
Copy link
Member Author

LouisJenkinsCS commented Aug 1, 2019

Right, but on Swan I have 128GBs of RAM available.

Edit

I should note that in my tests, it ran out of memory far before it ran out of actual physical memory to use. Also on 1 locale, it does not result in a segmentation fault (I.E it runs fine on one locale).

@LouisJenkinsCS
Copy link
Member Author

After printing out the memory used after every million iterations (one push_back per iteration)

Push Back #1073741824: 12412734912

That would be 12GBs of memory. 12 out of 128GBs. Not OOM.

@bradcray
Copy link
Member

bradcray commented Aug 1, 2019

Bringing @gbtitus and @ronawho into the loop on this due to their expertise with uGNI and its memory usage: If OOM were the issue should we be getting a nicer message than this? Any thoughts about how to diagnose what's going on here or who should own it?

Louis, since push_back on arrays is about to be deprecated, I'm curious whether you see the same behavior with the new list type?

@gbtitus
Copy link
Member

gbtitus commented Aug 1, 2019

If we actually had an OOM situation on XC and it happened while we were touching registered memory into existence then yes, we'd get an explicit "out of memory" error message. Definitely not a segfault. Reproducing the problem in-house and getting a core dump from the segfault would probably tell us a lot about what's happening.

@LouisJenkinsCS
Copy link
Member Author

@bradcray unfortunately I require a few things that list cannot provide me. Extremely efficient bulk transfers (#13583), parallel iteration, etc., are necessary. For the time being, I can rely on my own push back vector implementation. Also FWIW, I see list as a more specialized data structure that replaces a linked list (unrolled linked list with exponentially growing nodes) rather than an std::vector equivalent like I'd use here.

@LouisJenkinsCS
Copy link
Member Author

(Should mention this was tested on release 1.19; if it was some subtle bug fixed upstream, that's fine too)

@bradcray
Copy link
Member

bradcray commented Aug 1, 2019

unfortunately I require a few things that list cannot provide me.

That's fine but I'd still be curious whether your six-line program above, if written using list, would result in the same error. If not, it suggests an error in the array code. If so, it seems more likely to be something deeper...

@LouisJenkinsCS
Copy link
Member Author

Just as an update: I'm building chapel/master right now, I'll update as soon as it finishes.

@LouisJenkinsCS LouisJenkinsCS changed the title Resizing array on locale other than 0 results in segmentation fault on uGNI Resizing array on locale results in segmentation fault on uGNI Aug 2, 2019
@LouisJenkinsCS
Copy link
Member Author

So list works just fine.

use Memory;
use Lists;

on Locales[1] {
    var l = new list(int);
    for i in 1..1024 * 1024 * 1024 {
       l.append(i);
       if (i % 1024 * 1024) == 0 then writeln("Push Back #", i, ": ", memoryUsed());
    }
}

@bradcray

@LouisJenkinsCS
Copy link
Member Author

Push-back vector works fine...

use Memory;


on Locales[1] {
    var dom = {0..-1};
    var arr : [dom] int;
    var sz : int;
    var cap : int;
    for i in 1..1024 * 1024 * 1024 {
        if sz == cap {
            var oldCap = cap;
            cap = round(cap * 1.5) : int;
            if oldCap == cap then cap += 1;
            dom = {0..#cap};
        }
    
        arr[sz] = i;
        sz += 1;
        if (i % 1024 * 1024) == 0 then writeln("Push Back #", i, ": ", memoryUsed());
    }
}

So issue is with push_back, but just deprecating and not fixing it seems like it'd be hiding the problem. I'm wondering whats actually going wrong.

@LouisJenkinsCS

This comment has been minimized.

@bradcray

This comment has been minimized.

@bradcray

This comment has been minimized.

@LouisJenkinsCS

This comment has been minimized.

@LouisJenkinsCS

This comment has been minimized.

@daviditen
Copy link
Member

Can you please share the output from module list and printchplenv?

@LouisJenkinsCS
Copy link
Member Author

module list

Currently Loaded Modulefiles:
  1) modules/3.2.11.2                                   14) gcc/8.3.0
  2) alps/6.6.43-6.0.7.0_26.4__ga796da3.ari             15) craype-broadwell
  3) nodestat/2.3.85-6.0.7.0_32.1__gc6218bb.ari         16) craype-network-aries
  4) sdb/3.3.775-6.0.7.0_32.3__gb339c00.ari             17) craype/2.6.0
  5) udreg/2.3.2-6.0.7.0_33.18__g5196236.ari            18) totalview-support/1.2.0.138
  6) ugni/6.0.14.0-6.0.7.0_23.1__gea11d3d.ari           19) totalview/2019.0.4
  7) gni-headers/5.0.12.0-6.0.7.0_24.1__g3b1768f.ari    20) cray-libsci/19.06.1
  8) dmapp/7.1.1-6.0.7.0_34.3__g5a674e0.ari             21) pmi/5.0.14
  9) xpmem/2.2.15-6.0.7.1_5.8__g7549d06.ari             22) atp/2.1.3
 10) llm/21.3.530-6.0.7.0_39.1__g3b4230e.ari            23) rca/2.2.18-6.0.7.0_33.3__g2aa4f39.ari
 11) nodehealth/5.6.13-6.0.7.0_61.2__ge9c9532.ari       24) perftools-base/7.1.0
 12) system-config/3.5.2786-6.0.7.0_42.1__gc54785a.ari  25) PrgEnv-gnu/6.0.5
 13) Base-opts/2.4.135-6.0.7.0_38.1__g718f891.ari       26) craype-hugepages16M

$CHPL_HOME/util/printchplenv.bash

machine info: Linux swan 4.4.103-6.38_4.0.151-cray_ari_s #1 SMP Mon Sep 17 13:39:59 UTC 2018 (e3ad914) x86_64
CHPL_HOME: /lus/scratch/p02405/chapel-master *
script location: /lus/scratch/p02405/chapel-master/util/chplenv
CHPL_TARGET_PLATFORM: cray-xc
CHPL_TARGET_COMPILER: cray-prgenv-gnu
CHPL_TARGET_ARCH: x86_64
CHPL_TARGET_CPU: broadwell
CHPL_LOCALE_MODEL: flat
CHPL_COMM: ugni
CHPL_TASKS: qthreads
CHPL_LAUNCHER: pbs-aprun *
CHPL_TIMERS: generic
CHPL_UNWIND: none *
CHPL_MEM: jemalloc
CHPL_ATOMICS: cstdlib
  CHPL_NETWORK_ATOMICS: ugni
CHPL_GMP: gmp
CHPL_HWLOC: hwloc
CHPL_REGEXP: re2
CHPL_LLVM: llvm *
CHPL_AUX_FILESYS: none

@daviditen
Copy link
Member

I didn't have craype-hugepages16M loaded at first and the program in the first comment was passing. After loading craype-hugepages16M and rebuilding the program I see the same crash. The last function in the stack trace is chpl_je_huge_dalloc which I suspect is only called when using hugepages.

@LouisJenkinsCS
Copy link
Member Author

Glad that you were able to reproduce it. Does this confirm that it has to do with CHPL_COMM=ugni? I know that jemalloc has some hooks that are called on allocation and deallocation having to do with hugepages and the dynamic heap, but I'm not sure if CHPL_COMM=gasnet has the same problem.

@LouisJenkinsCS LouisJenkinsCS changed the title Resizing array on locale results in segmentation fault on uGNI Resizing array on locale results in segmentation fault on uGNI with Hugepages Aug 2, 2019
@LouisJenkinsCS
Copy link
Member Author

LouisJenkinsCS commented Aug 2, 2019

I should note that I came across the same thing when I used CHPL_MEM=cstdlib.

ATP Stack walkback for Rank 1 starting:
  [empty]@0xffffffffffffffff
  qthread_wrapper@0x2008a151
  chapel_wrapper@0x20042886
  fork_call_wrapper_blocking@0x2004888f
  wrapon_fn_chpl14@multilocalePushback.chpl:1
  on_fn_chpl14@ChapelArray.chpl:2344
  _local_on_fn13.isra.381@ChapelDistribution.chpl:991
  dsiDestroyArr2@chpl-mem-sys.h:53
  __GI___libc_free@0x2aaaac314963

Unfortunately, even though I built with CHPL_DEVELOPER=1 it still doesn't show line numbers in stack-trace.

Anyway this indicates a bad free (possibly a double free), which is rather serious.

@daviditen
Copy link
Member

Does this confirm that it has to do with CHPL_COMM=ugni?

I don't think I have enough evidence yet to say that, but it would be good to have one of our runtime specialists take a look.

@bradcray

This comment has been minimized.

@LouisJenkinsCS

This comment has been minimized.

@ronawho
Copy link
Contributor

ronawho commented Aug 4, 2019

Simpler/faster reproducer. Run with 2 locales and 16M hugepages:

var arr : [0..-1] int;
for i in 1..8*1024*1024 do arr.push_back(i);

I'm pretty sure we're calling _ddata_free with the number of elements instead of the backing size, which is always wrong, but only happens to trip up ugni.

@daviditen could you take a look at the code around:

const size = blk(1) * dom.dsiDim(1).length;
_ddata_free(data, size);

(if the wrong size is passed to _ddata_free, ugni is unable to tell that it allocated memory from hugepages, so we end up calling chpl_mem_free() on memory that was acquired with get_huge_pages(), so we have mixed allocator calls.)

@LouisJenkinsCS
Copy link
Member Author

Somewhat related but I mentioned another potential hazard on gitter about blk = copy.blk when dsiReallocate is called, which may result in another bad _ddata_free call.

@gbtitus
Copy link
Member

gbtitus commented Aug 5, 2019

I'm pretty sure we're calling _ddata_free with the number of elements instead of the backing size, which is always wrong, but only happens to trip up ugni.

@ronawho @daviditen Is that not correct? The value being passed there is the same thing we pass to _ddata_allocate() when we acquire the space. Both _ddata_allocate() and _ddata_free() do the multiplication by the sizeof the array element themselves.

@ronawho set me right; I wasn't realizing that the resize in effect caused us to (re)allocate the array space to a different size but not reflect that different size in the metadata the _ddata_free() would rely on.

@ronawho
Copy link
Contributor

ronawho commented Aug 5, 2019

I don't think it's the same. Array-as-vec grows with reallocateArray, which calls dsiReallocate to reserve space for dataAllocRange (capacity). That dsiDestroyArr is passing the domain size, not capacity:

// 'dataAllocRange' is used by the array-vector operations (e.g. push_back,
// pop_back, insert, remove) to allow growing or shrinking the data
// buffer in a doubling/halving style. If it is used, it will be the
// actual size of the 'data' buffer, while 'dom' represents the size of
// the user-level array.
var dataAllocRange: range(idxType);

I think we want something like:

diff --git a/modules/internal/DefaultRectangular.chpl b/modules/internal/DefaultRectangular.chpl
index 3f2c953edb..6f8fd6b83c 100644
--- a/modules/internal/DefaultRectangular.chpl
+++ b/modules/internal/DefaultRectangular.chpl
@@ -1073,10 +1073,12 @@ module DefaultRectangular {
             if numElts == 0 then
               numElts = dom.dsiNumIndices;
             dsiDestroyDataHelper(data, numElts);
+            _ddata_free(data, numElts);
           }
+        } else {
+          const size = blk(1) * dom.dsiDim(1).length;
+          _ddata_free(data, size);
         }
-        const size = blk(1) * dom.dsiDim(1).length;
-        _ddata_free(data, size);
       }
     }

but I'm not very familiar with this code, so I'd prefer to defer to david for the fix.

@daviditen
Copy link
Member

@ronawho, I think your fix looks good. I have it going through some testing now.

@bradcray
Copy link
Member

bradcray commented Aug 5, 2019

For this class of errors, I find I'm wondering about the following: What would the level of effort be to introduce a little opt-in shim that would check that sizes passed to free calls matched those sent to alloc/realloc calls? Of course, users wouldn't know to use it, but we could make it one of those things we check after a memory error is reported, as with valgrind; or we could run a sweep of the test suite with it (similar to, or as part of --verify or one of the other sanity checking nightly jobs).

@ronawho
Copy link
Contributor

ronawho commented Aug 5, 2019

Hmm, that's interesting. I think that's something we could add to memTracking pretty easily -- We already store the size in the memTracking hashtable, I think we just need a sized version of chpl_memhook_free_pre to do our check against. I think it's worth forking a new issue for.

@daviditen
Copy link
Member

@ronawho I had to slightly change your patch to avoid some leaks, but after that testing came back clean.

daviditen added a commit that referenced this issue Aug 6, 2019
Fix array-as-vec bug

[reviewed and suggested by @ronawho]

When freeing an array-as-vec we were passing the size of the user-level array
instead of the size of the allocated array. This lead to a crash. Implement
@ronawho's fix (slightly modified) and add @LouisJenkinsCS's test to the test system.

Resolves #13611
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants