fixing "invalid permission" segfault from #5077 by only starting the amount of thread, memory was allocated for. #5087

bennet0496 · 2021-07-29T11:30:02Z

After first finding that the allocated amount of memory for the TMP buffer seamed too low, 0x13c0000 vs 80*UINT_16_MAX*sizeof(int) = 0x13ffec0 and was followed by a protected 0-page with setDTthreads(threads = 80), it turned out that in L717-L718 the allocated memory was actually only 79*UINT_16_MAX*sizeof(int) = 0x13bfec4, because getDTthreads was called with throttle = true.

But later when starting the parallel threads, in L1235-L1238 the number of threads to start was determined with throttle = false, causing the segfault at L995 for threads over the throttled threshold, while trying to access memory past the TMP buffer.

Starting only the throttled amount of threads, fixes that. This also seamed fixed the random double free or corruption errors as they are currently not occurring anymore for 80 threads like before.

Unfortunately, it seems to be quite difficult to find a reasonably small dataset and requirement in thread count to write a test for, as quite some things need to happen, that this error occurs. The requirement I currently found are

the dataset must be groupable into enough groups, that parallel execution will occur (L1195-L1243)
groups need to have the correct size (<65535?) for the parallel region L1235-L1239 to be run
groups or slices of the group (?) must be larger than 256 and smaller than 65535 for L936-L1036 (containing the critical L995) to run
the number of configured threads to run need to be larger than the number of throttled threads (the ceiling of the number of groups divided by 1024 (R_DATATABLE_THROTTLE)), that less memory is allocated than used
the group for the thread with a thread-number higher than the ceiling of the number of groups divided by 1024 must have enough elements for the out of bound access to happen on the next protected page.
the number of throttled threads needs to be large enough, that the malloc() in L718 will internally use mmap() instead of the default brk()/sbrk() so that the protected page is not moved by new heap allocations

In other words it seems to be a very specific edge case and the scientist that created the dataset that causes the crash, should maybe start to gamble :)

…rting the amount of thread, memory was allocated for. L717-L718 allocated memory for a batch-throttled number of threads in the TMP buffer, but L1235-L1238 started an unthrottled number of threads, causing the segfault at L995 for threads over the throttled threshold.

mattdowle · 2021-07-29T17:42:41Z

Thanks for all this great detail. LOL at the last sentence.
I just had a quick look at the proposed fix so far, and I can see why it works by applying the throttle which reduces the number of threads. But false seems like the correct value there since the loop is through ngrp:

#pragma omp parallel for schedule(dynamic) num_threads(getDTthreads(ngrp, false))
for (int i=0; i<ngrp; i++) {

The comments in getDTthreads for throttle are:

// throttle==true  : a number of iterations per thread (DTthrottle) is applied before a second thread is utilized
// throttle==false : parallel region is already pre-chunked such as in fread; e.g. two batches intended for two threads

I'm thinking it's the true on L717 that needs to be false?

mattdowle · 2021-07-29T19:27:29Z

This grep finds this one, and 4 others to review: all getDTthreads(..., true) calls not on #pragma lines :

$ grep -n "getDTthreads(.*true" *.c --exclude=openmp-utils.c --exclude=init.c | grep -v pragma 
forder.c:717:  nth = getDTthreads(nrow, true);  // this nth is relied on in cleanup()
forder.c:1039:  int batchSize = MIN(UINT16_MAX, 1+my_n/getDTthreads(my_n, true));  // (my_n-1)/nBatch + 1;   //UINT16_MAX == 65535
fsort.c:116:  int nth = getDTthreads(xlength(x), true);
gsumm.c:82:  nBatch = MIN((nrow+1)/2, getDTthreads(nrow, true)*2);  // *2 to reduce last-thread-home. TODO: experiment. The higher this is though, the bigger is counts[]
subset.c:14:  int nth = getDTthreads(n, /*throttle=*/true);   // not const for Solaris, #4638

I was thinking that all getDTthreads() calls not on #pragma lines should be enforced to be false; i.e. the 5 calls above. But then, for a large number of threads such as the 80 in issue #5077, working memory would be allocated for the 80 threads which would be wasteful when throttling.

Maybe the problem in general is that we call getDTthreads() again on #pragma lines, where we should instead use nth where there is one, to ensure the throttle=true|false choice matches the allocation.

bennet0496 · 2021-07-30T08:41:16Z

LOL at the last sentence.

I mean, from what I've seen, it looks pretty unlikely to me to end up with a dataset fulfilling all the requirement to hit this error. And there is a saying where I'm from, that when you manage to do something very unlikely by pure chance, than you have enough luck to gamble or more specifically to play the lottery (even though mathematically this would make no sense). But that's besides the point :)

But false seems like the correct value there since the loop is through ngrp
[...]
I'm thinking it's the true on L717 that needs to be false?

To be honest, I'm not completely sure which would be the correct value in either place. The code is quite hard to read, and it took me a good week to understand it enough to find this error. Also throttling the parallel region, to fix the problem, seamed least intrusive to me, as throttling seamed to be applied for a good reason in the first place.

I was thinking that all getDTthreads() calls not on #pragma lines should be enforced to be false; i.e. the 5 calls above. But then, for a large number of threads such as the 80 in issue #5077, working memory would be allocated for the 80 threads which would be wasteful when throttling.

Fair enough, allocating the upper bound of what could possibly be used will definitely also solve the problem. It looks like, the overhead would be about 256kB per additional thread, which is arguably not optimal but probably also not too bad.

Maybe the problem in general is that we call getDTthreads() again on #pragma lines, where we should instead use nth where there is one, to ensure the throttle=true|false choice matches the allocation.

I agree, that it would probably be best to reuse the value where possible and not calculate it again to prevent mismatching option and therefore calculations. I will take a look at that. Maybe using the minimum of nth and the calculated value in the pragma lines would also be an option, as the basis on which the values are calculated seam to be different. I case at some point it desirable to start even less threads than anticipated during memory calculation.

bennet0496 · 2021-07-30T13:21:29Z

Okay so I traced back the values and found, that

now for the first call of radix_r(), directly after the memory allocation, my_n = nrow. This means for the blue box (with a sufficiently large number of configured threads)

And for the yellow box

While during allocation of TMP with a sufficiently large number of configured threads nth is calculated as

So it looks like usually one more thread is started than memory was allocated for. So when the additional conditions to reach L995 are met, the buffer overflow occurs.

It also looks like this global TMP is used by the code around L995, so it seems to overhead in some (or most?) of the time anyway, so it probably also wouldn't hurt to allocate the additional 256kB per thread for this buffer to be safe that no overflow occurs, instead of throttling the number of threads to start.

…t start too many threads where the buffer might be used. Also renaming the global TMP as such. see Rdatatable#5077 and Rdatatable#5087

mattdowle · 2021-08-03T07:38:18Z

Have invited you to be project member so among other things you can create branches in the main project in future. The invite should be a button you need to click to accept in GitHub in either your projects or profile page. Have also added you to the contributors list. Thanks again!

remove reference to radix_r as users won't know what that internal function is for consistency with other news items: just link to issue (those interested can get to the PR from the issue) just your name in news item. folk can get to your profile by clicking the link to the issue

…es are static global in this file (shared between functions just in this file) so adding a prefix to just one is inconsistent

mattdowle · 2021-08-03T17:36:08Z

src/forder.c

      // all groups are <=65535 and radix_r() will handle each one single-threaded. Therefore, this time
      // it does make sense to start a parallel team and there will be no nestedness here either.
+
+      int threads_to_run = MIN(nth, getDTthreads(ngrp, false));  // the parallel regions below might reach code


Calling getDTthreads() again here still leaves the same door open that this value of throttle needs to be the same as the value used to allocate on line 718. We can use nth on the #pragma line, avoid the new variable, avoid the extra call to getDTthreads() and reduce from 3 places to 2 places. Will do...

I don't think calling it again here with the MIN limiting it to the value of nth is necessarily the same problem. But thinking about it again and seeing the last commits you added, I'd say it was redundant, as getDTthreads w/ throttle=false, basically only limits to the number of configured threads, which is already done w/ nth. So if configured threads are the limiting factor, this already resembled in nth so there is no need to apply it again to ngrp as it can then only play a role here if ngrp < threads, but then also ngrp < nth would be true. So doing like you did with MIN(nth, grp) is more efficient as it saves the function call.

…e.cmd in mind too

… on what 80 threads is running

bennet0496 · 2021-08-03T18:41:22Z

src/forder.c

 static int **gs_thread=NULL;        // each thread has a private buffer which gets flushed to the final gs appropriately
 static int *gs_thread_alloc=NULL;
 static int *gs_thread_n=NULL;
-static int *global_TMP=NULL;               // UINT16_MAX*sizeof(int) for each thread; used by counting sort in radix_r()


I renamed the variable, because it made it easier to distinguish from the other block-local TMP(s?) during debugging. But yes, for consistency I probably should have named it back, after I was done.

bennet0496

This was the wrong of the two comment buttons, I guess. I'm sorry

bennet0496 · 2021-08-03T18:46:26Z

NEWS.md

 28. `dplyr::arrange(DT)` uses `vctrs::vec_slice` which retains `data.table`'s class but uses C to bypass `[` method dispatch and does not adjust `data.table`'s attributes containing the index row numbers, [#5042](https://github.com/Rdatatable/data.table/issues/5042). `data.table`'s long-standing `.internal.selfref` mechanism to detect such operations by other packages was not being checked by `data.table` when using indexes, causing `data.table` filters and joins to use invalid indexes and return incorrect results after a `dplyr::arrange(DT)`. Thanks to @Waldi73 for reporting; @avimallu, @tlapak, @MichaelChirico, @jangorecki and @hadley for investigating and suggestions; and @mattdowle for the PR. The intended way to use `data.table` is `data.table::setkey(DT, col1, col2, ...)` which reorders `DT` by reference in parallel, sets the primary key for automatic use by subsequent `data.table` queries, and permits rowname-like usage such as `DT["foo",]` which returns the now-contiguous-in-memory block of rows where the first column of `DT`'s key contains `"foo"`. Multi-column-rownames (i.e. a primary key of more than one column) can be looked up using `DT[.("foo",20210728L), ]`. Using `==` in `i` is also optimized to use the key or indices, if you prefer using column names explicitly and `==`. An alternative to `setkey(DT)` is returning a new ordered result using `DT[order(col1, col2, ...), ]`.

-29. A segfault could occur when grouping or sorting in some circumstances with a large number of threads such as 80, [#5077](https://github.com/Rdatatable/data.table/issues/5077). Thanks to Bennet Becker for reporting, debugging at C level, and fixing.
+29. A segfault could occur when grouping or sorting in some circumstances with a large number of threads such as 80 on a server with 256 logical cores, [#5077](https://github.com/Rdatatable/data.table/issues/5077). Thanks to Bennet Becker for reporting, debugging at C level, and fixing.


When mentioning the processor used, I think it would be better to use 128 threads as the example, as this would be the default value on such a machine. The other values mentioned (such as 80 or 96) were manually configured to initially find the error threshold. And I don't know how common it is for people to manually set the amount of threads to be used.

Thanks, I see. I'm going to try and create a test which reliably fails under ASAN (the commands are in CRAN_Release.cmd). I suspect it's not failing very often currently because there's space after the end of TMP within the last page that's effectively ok to write to (or something like that). Turning on ASAN should trap that in more cases given a suitable test. I never like using the words "in some circumstances" in the news items because I sense the reader thinking "well, what circumstances, and might my circumstances be affected?". So once the test is ready I'll go back and replace that phrase and the number of threads and cores with more detail.

Yes with ASAN it should not be too hard to find a reliably failing example. The examples I have on my work PC might even produce the error with this, but I will not be able to access them until tomorrow noon (CEST time). But I think even this example required at least 6 threads or something like that, and I don't know what is possible in the test environment or what the limits are.

And I wouldn't call it "effectively ok to write to" but "possible to write to, without the environment noticing", as buffer overflows are always inherently problematic, as it may overwrites other data for other program parts or the overflown data is unexpectedly overwritten by other parts of the program, especially if the data is small enough for heap allocation to occur.

Yes that is better wording. I do know that.

This is enough to segfault and doesn't need ASAN:

DT = data.table(grp=sample(65536)) # >=65536 necessary setDTthreads(throttle=nrow(DT)) # increase throttle to reduce threads to 1 for this nrow DT[, .N, by=grp] # segfault setkey(DT, grp) # segfault

Oh ok, I didn't know (or find) the throttle value is adjustable like this. I tried to find an example with the default 1024 value, which made it much harder, to find a small example that would even remotely reach the protected page.

mattdowle added this to the 1.14.1 milestone Jul 29, 2021

mattdowle added the bug label Jul 29, 2021

bennet0496 and others added 2 commits July 30, 2021 15:34

updating memory allocation for global TMP and adding safe-guard to no…

2b8c12e

…t start too many threads where the buffer might be used. Also renaming the global TMP as such. see Rdatatable#5077 and Rdatatable#5087

Update News.md to reflect the fix

3d877aa

bennet0496 marked this pull request as ready for review August 1, 2021 18:23

mattdowle added 2 commits August 3, 2021 01:20

Merge branch 'master' into issue5077

debfd24

Add to contributor list in DESCRIPTION

d9a1c6f

mattdowle added 4 commits August 3, 2021 01:56

remove global_ to reduce diff and because quite a few (most?) variabl…

4bdb797

…es are static global in this file (shared between functions just in this file) so adding a prefix to just one is inconsistent

added issue number to comment, spacing to reduce diff

cbd3c04

more spacing to help viewing diff

7d7ab7a

mattdowle reviewed Aug 3, 2021

View reviewed changes

mattdowle added 3 commits August 3, 2021 11:53

simplify by using nth on #pragma line with future grep in CRAN_Releas…

ce356d4

…e.cmd in mind too

throttle on again as is correct now that we're using nth on pragma lines

97a92bb

add that the server has 256 logical cores as it's interesting to hear…

e1eea9a

… on what 80 threads is running

bennet0496 commented Aug 3, 2021

View reviewed changes

mattdowle added 2 commits August 3, 2021 17:34

add tests

2dc00a0

update news item with detail of the 'some circumstances'

3eb4861

mattdowle merged commit 6452c70 into Rdatatable:master Aug 4, 2021

mattdowle mentioned this pull request Aug 4, 2021

review setDTthreads matching nth allocation #5093

Open

2 tasks

mattdowle mentioned this pull request Aug 20, 2021

sum operation of a numeric column by group in a large data set produces incorrect results. #5066

Closed

ben-schwen mentioned this pull request Dec 5, 2022

merge.data.table using character column crashing Rstudio with tables above 65536 rows #5552

Closed

jangorecki removed this from the 1.14.9 milestone Oct 29, 2023

jangorecki added this to the 1.15.0 milestone Oct 29, 2023

fixing "invalid permission" segfault from #5077 by only starting the amount of thread, memory was allocated for. #5087

fixing "invalid permission" segfault from #5077 by only starting the amount of thread, memory was allocated for. #5087

Uh oh!

Conversation

bennet0496 commented Jul 29, 2021 • edited by mattdowle Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattdowle commented Jul 29, 2021

Uh oh!

mattdowle commented Jul 29, 2021

Uh oh!

bennet0496 commented Jul 30, 2021

Uh oh!

bennet0496 commented Jul 30, 2021

Uh oh!

mattdowle commented Aug 3, 2021

Uh oh!

mattdowle Aug 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bennet0496 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bennet0496 commented Jul 29, 2021 •

edited by mattdowle

Loading

mattdowle Aug 3, 2021 •

edited

Loading

bennet0496 left a comment •

edited

Loading