-
Notifications
You must be signed in to change notification settings - Fork 425
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resizing array on locale results in segmentation fault on uGNI with Hugepages #13611
Comments
I'd like to state that I've come across this problem quite a few times before, and each time I thought it was due to running out-of-memory, so its not a blocking/gating issue, but I hope it gets resolved. |
It seems that arr.push_back can use ~2x the memory eventually required, for one thing. I was surprised how slow this ran on my system, and it used ~20GB of RAM. |
Right, but on Swan I have 128GBs of RAM available. Edit I should note that in my tests, it ran out of memory far before it ran out of actual physical memory to use. Also on 1 locale, it does not result in a segmentation fault (I.E it runs fine on one locale). |
After printing out the memory used after every million iterations (one
That would be 12GBs of memory. 12 out of 128GBs. Not OOM. |
Bringing @gbtitus and @ronawho into the loop on this due to their expertise with uGNI and its memory usage: If OOM were the issue should we be getting a nicer message than this? Any thoughts about how to diagnose what's going on here or who should own it? Louis, since push_back on arrays is about to be deprecated, I'm curious whether you see the same behavior with the new list type? |
If we actually had an OOM situation on XC and it happened while we were touching registered memory into existence then yes, we'd get an explicit "out of memory" error message. Definitely not a segfault. Reproducing the problem in-house and getting a core dump from the segfault would probably tell us a lot about what's happening. |
@bradcray unfortunately I require a few things that |
(Should mention this was tested on release 1.19; if it was some subtle bug fixed upstream, that's fine too) |
That's fine but I'd still be curious whether your six-line program above, if written using list, would result in the same error. If not, it suggests an error in the array code. If so, it seems more likely to be something deeper... |
Just as an update: I'm building chapel/master right now, I'll update as soon as it finishes. |
So use Memory;
use Lists;
on Locales[1] {
var l = new list(int);
for i in 1..1024 * 1024 * 1024 {
l.append(i);
if (i % 1024 * 1024) == 0 then writeln("Push Back #", i, ": ", memoryUsed());
}
} |
Push-back vector works fine... use Memory;
on Locales[1] {
var dom = {0..-1};
var arr : [dom] int;
var sz : int;
var cap : int;
for i in 1..1024 * 1024 * 1024 {
if sz == cap {
var oldCap = cap;
cap = round(cap * 1.5) : int;
if oldCap == cap then cap += 1;
dom = {0..#cap};
}
arr[sz] = i;
sz += 1;
if (i % 1024 * 1024) == 0 then writeln("Push Back #", i, ": ", memoryUsed());
}
} So issue is with |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Can you please share the output from |
|
I didn't have |
Glad that you were able to reproduce it. Does this confirm that it has to do with |
I should note that I came across the same thing when I used
Unfortunately, even though I built with Anyway this indicates a bad free (possibly a double free), which is rather serious. |
I don't think I have enough evidence yet to say that, but it would be good to have one of our runtime specialists take a look. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Simpler/faster reproducer. Run with 2 locales and 16M hugepages: var arr : [0..-1] int;
for i in 1..8*1024*1024 do arr.push_back(i); I'm pretty sure we're calling _ddata_free with the number of elements instead of the backing size, which is always wrong, but only happens to trip up ugni. @daviditen could you take a look at the code around: chapel/modules/internal/DefaultRectangular.chpl Lines 1078 to 1079 in db3049a
(if the wrong size is passed to _ddata_free, ugni is unable to tell that it allocated memory from hugepages, so we end up calling |
Somewhat related but I mentioned another potential hazard on gitter about |
@ronawho set me right; I wasn't realizing that the resize in effect caused us to (re)allocate the array space to a different size but not reflect that different size in the metadata the |
I don't think it's the same. Array-as-vec grows with reallocateArray, which calls dsiReallocate to reserve space for chapel/modules/internal/DefaultRectangular.chpl Lines 1027 to 1032 in db3049a
I think we want something like: diff --git a/modules/internal/DefaultRectangular.chpl b/modules/internal/DefaultRectangular.chpl
index 3f2c953edb..6f8fd6b83c 100644
--- a/modules/internal/DefaultRectangular.chpl
+++ b/modules/internal/DefaultRectangular.chpl
@@ -1073,10 +1073,12 @@ module DefaultRectangular {
if numElts == 0 then
numElts = dom.dsiNumIndices;
dsiDestroyDataHelper(data, numElts);
+ _ddata_free(data, numElts);
}
+ } else {
+ const size = blk(1) * dom.dsiDim(1).length;
+ _ddata_free(data, size);
}
- const size = blk(1) * dom.dsiDim(1).length;
- _ddata_free(data, size);
}
}
but I'm not very familiar with this code, so I'd prefer to defer to david for the fix. |
@ronawho, I think your fix looks good. I have it going through some testing now. |
For this class of errors, I find I'm wondering about the following: What would the level of effort be to introduce a little opt-in shim that would check that sizes passed to free calls matched those sent to alloc/realloc calls? Of course, users wouldn't know to use it, but we could make it one of those things we check after a memory error is reported, as with valgrind; or we could run a sweep of the test suite with it (similar to, or as part of |
Hmm, that's interesting. I think that's something we could add to memTracking pretty easily -- We already store the size in the memTracking hashtable, I think we just need a sized version of |
@ronawho I had to slightly change your patch to avoid some leaks, but after that testing came back clean. |
Fix array-as-vec bug [reviewed and suggested by @ronawho] When freeing an array-as-vec we were passing the size of the user-level array instead of the size of the allocated array. This lead to a crash. Implement @ronawho's fix (slightly modified) and add @LouisJenkinsCS's test to the test system. Resolves #13611
Stack Trace
This doesn't seem to happen on my Infiniband cluster.
The text was updated successfully, but these errors were encountered: