Skip to content

Conversation

@bennet0496
Copy link
Contributor

@bennet0496 bennet0496 commented Jul 29, 2021

Closes #5077

After first finding that the allocated amount of memory for the TMP buffer seamed too low, 0x13c0000 vs 80*UINT_16_MAX*sizeof(int) = 0x13ffec0 and was followed by a protected 0-page with setDTthreads(threads = 80), it turned out that in L717-L718 the allocated memory was actually only 79*UINT_16_MAX*sizeof(int) = 0x13bfec4, because getDTthreads was called with throttle = true.

But later when starting the parallel threads, in L1235-L1238 the number of threads to start was determined with throttle = false, causing the segfault at L995 for threads over the throttled threshold, while trying to access memory past the TMP buffer.

Starting only the throttled amount of threads, fixes that. This also seamed fixed the random double free or corruption errors as they are currently not occurring anymore for 80 threads like before.

Unfortunately, it seems to be quite difficult to find a reasonably small dataset and requirement in thread count to write a test for, as quite some things need to happen, that this error occurs. The requirement I currently found are

  • the dataset must be groupable into enough groups, that parallel execution will occur (L1195-L1243)
  • groups need to have the correct size (<65535?) for the parallel region L1235-L1239 to be run
  • groups or slices of the group (?) must be larger than 256 and smaller than 65535 for L936-L1036 (containing the critical L995) to run
  • the number of configured threads to run need to be larger than the number of throttled threads (the ceiling of the number of groups divided by 1024 (R_DATATABLE_THROTTLE)), that less memory is allocated than used
  • the group for the thread with a thread-number higher than the ceiling of the number of groups divided by 1024 must have enough elements for the out of bound access to happen on the next protected page.
  • the number of throttled threads needs to be large enough, that the malloc() in L718 will internally use mmap() instead of the default brk()/sbrk() so that the protected page is not moved by new heap allocations

In other words it seems to be a very specific edge case and the scientist that created the dataset that causes the crash, should maybe start to gamble :)

…rting the amount of thread, memory was allocated for.

L717-L718 allocated memory for a batch-throttled number of threads in the TMP buffer, but L1235-L1238 started an unthrottled number of threads, causing the segfault at L995 for threads over the throttled threshold.
@mattdowle
Copy link
Member

Thanks for all this great detail. LOL at the last sentence.
I just had a quick look at the proposed fix so far, and I can see why it works by applying the throttle which reduces the number of threads. But false seems like the correct value there since the loop is through ngrp:

#pragma omp parallel for schedule(dynamic) num_threads(getDTthreads(ngrp, false))
for (int i=0; i<ngrp; i++) {

The comments in getDTthreads for throttle are:

// throttle==true  : a number of iterations per thread (DTthrottle) is applied before a second thread is utilized
// throttle==false : parallel region is already pre-chunked such as in fread; e.g. two batches intended for two threads

I'm thinking it's the true on L717 that needs to be false?

@mattdowle mattdowle added this to the 1.14.1 milestone Jul 29, 2021
@mattdowle mattdowle added the bug label Jul 29, 2021
@mattdowle
Copy link
Member

This grep finds this one, and 4 others to review: all getDTthreads(..., true) calls not on #pragma lines :

$ grep -n "getDTthreads(.*true" *.c --exclude=openmp-utils.c --exclude=init.c | grep -v pragma 
forder.c:717:  nth = getDTthreads(nrow, true);  // this nth is relied on in cleanup()
forder.c:1039:  int batchSize = MIN(UINT16_MAX, 1+my_n/getDTthreads(my_n, true));  // (my_n-1)/nBatch + 1;   //UINT16_MAX == 65535
fsort.c:116:  int nth = getDTthreads(xlength(x), true);
gsumm.c:82:  nBatch = MIN((nrow+1)/2, getDTthreads(nrow, true)*2);  // *2 to reduce last-thread-home. TODO: experiment. The higher this is though, the bigger is counts[]
subset.c:14:  int nth = getDTthreads(n, /*throttle=*/true);   // not const for Solaris, #4638

I was thinking that all getDTthreads() calls not on #pragma lines should be enforced to be false; i.e. the 5 calls above. But then, for a large number of threads such as the 80 in issue #5077, working memory would be allocated for the 80 threads which would be wasteful when throttling.

Maybe the problem in general is that we call getDTthreads() again on #pragma lines, where we should instead use nth where there is one, to ensure the throttle=true|false choice matches the allocation.

@bennet0496
Copy link
Contributor Author

LOL at the last sentence.

I mean, from what I've seen, it looks pretty unlikely to me to end up with a dataset fulfilling all the requirement to hit this error. And there is a saying where I'm from, that when you manage to do something very unlikely by pure chance, than you have enough luck to gamble or more specifically to play the lottery (even though mathematically this would make no sense). But that's besides the point :)

But false seems like the correct value there since the loop is through ngrp
[...]
I'm thinking it's the true on L717 that needs to be false?

To be honest, I'm not completely sure which would be the correct value in either place. The code is quite hard to read, and it took me a good week to understand it enough to find this error. Also throttling the parallel region, to fix the problem, seamed least intrusive to me, as throttling seamed to be applied for a good reason in the first place.

I was thinking that all getDTthreads() calls not on #pragma lines should be enforced to be false; i.e. the 5 calls above. But then, for a large number of threads such as the 80 in issue #5077, working memory would be allocated for the 80 threads which would be wasteful when throttling.

Fair enough, allocating the upper bound of what could possibly be used will definitely also solve the problem. It looks like, the overhead would be about 256kB per additional thread, which is arguably not optimal but probably also not too bad.

Maybe the problem in general is that we call getDTthreads() again on #pragma lines, where we should instead use nth where there is one, to ensure the throttle=true|false choice matches the allocation.

I agree, that it would probably be best to reuse the value where possible and not calculate it again to prevent mismatching option and therefore calculations. I will take a look at that. Maybe using the minimum of nth and the calculated value in the pragma lines would also be an option, as the basis on which the values are calculated seam to be different. I case at some point it desirable to start even less threads than anticipated during memory calculation.

@bennet0496
Copy link
Contributor Author

Okay so I traced back the values and found, that

variable_dependecy

now for the first call of radix_r(), directly after the memory allocation, my_n = nrow. This means for the blue box (with a sufficiently large number of configured threads)

CodeCogsEqn

And for the yellow box

CodeCogsEqn (1)

While during allocation of TMP with a sufficiently large number of configured threads nth is calculated as

CodeCogsEqn (2)

So it looks like usually one more thread is started than memory was allocated for. So when the additional conditions to reach L995 are met, the buffer overflow occurs.

It also looks like this global TMP is used by the code around L995, so it seems to overhead in some (or most?) of the time anyway, so it probably also wouldn't hurt to allocate the additional 256kB per thread for this buffer to be safe that no overflow occurs, instead of throttling the number of threads to start.

bennet0496 and others added 2 commits July 30, 2021 15:34
…t start too many threads where the buffer might be used. Also renaming the global TMP as such. see Rdatatable#5077 and Rdatatable#5087
@bennet0496 bennet0496 marked this pull request as ready for review August 1, 2021 18:23
@mattdowle
Copy link
Member

Have invited you to be project member so among other things you can create branches in the main project in future. The invite should be a button you need to click to accept in GitHub in either your projects or profile page. Have also added you to the contributors list. Thanks again!

remove reference to radix_r as users won't know what that internal function is
for consistency with other news items:
just link to issue (those interested can get to the PR from the issue)
just your name in news item. folk can get to your profile by clicking the link to the issue
…es are static global in this file (shared between functions just in this file) so adding a prefix to just one is inconsistent
src/forder.c Outdated
// all groups are <=65535 and radix_r() will handle each one single-threaded. Therefore, this time
// it does make sense to start a parallel team and there will be no nestedness here either.

int threads_to_run = MIN(nth, getDTthreads(ngrp, false)); // the parallel regions below might reach code
Copy link
Member

@mattdowle mattdowle Aug 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling getDTthreads() again here still leaves the same door open that this value of throttle needs to be the same as the value used to allocate on line 718. We can use nth on the #pragma line, avoid the new variable, avoid the extra call to getDTthreads() and reduce from 3 places to 2 places. Will do...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think calling it again here with the MIN limiting it to the value of nth is necessarily the same problem. But thinking about it again and seeing the last commits you added, I'd say it was redundant, as getDTthreads w/ throttle=false, basically only limits to the number of configured threads, which is already done w/ nth. So if configured threads are the limiting factor, this already resembled in nth so there is no need to apply it again to ngrp as it can then only play a role here if ngrp < threads, but then also ngrp < nth would be true. So doing like you did with MIN(nth, grp) is more efficient as it saves the function call.

src/forder.c Outdated
static int **gs_thread=NULL; // each thread has a private buffer which gets flushed to the final gs appropriately
static int *gs_thread_alloc=NULL;
static int *gs_thread_n=NULL;
static int *global_TMP=NULL; // UINT16_MAX*sizeof(int) for each thread; used by counting sort in radix_r()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed the variable, because it made it easier to distinguish from the other block-local TMP(s?) during debugging. But yes, for consistency I probably should have named it back, after I was done.

Copy link
Contributor Author

@bennet0496 bennet0496 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the wrong of the two comment buttons, I guess. I'm sorry

NEWS.md Outdated
28. `dplyr::arrange(DT)` uses `vctrs::vec_slice` which retains `data.table`'s class but uses C to bypass `[` method dispatch and does not adjust `data.table`'s attributes containing the index row numbers, [#5042](https://github.com/Rdatatable/data.table/issues/5042). `data.table`'s long-standing `.internal.selfref` mechanism to detect such operations by other packages was not being checked by `data.table` when using indexes, causing `data.table` filters and joins to use invalid indexes and return incorrect results after a `dplyr::arrange(DT)`. Thanks to @Waldi73 for reporting; @avimallu, @tlapak, @MichaelChirico, @jangorecki and @hadley for investigating and suggestions; and @mattdowle for the PR. The intended way to use `data.table` is `data.table::setkey(DT, col1, col2, ...)` which reorders `DT` by reference in parallel, sets the primary key for automatic use by subsequent `data.table` queries, and permits rowname-like usage such as `DT["foo",]` which returns the now-contiguous-in-memory block of rows where the first column of `DT`'s key contains `"foo"`. Multi-column-rownames (i.e. a primary key of more than one column) can be looked up using `DT[.("foo",20210728L), ]`. Using `==` in `i` is also optimized to use the key or indices, if you prefer using column names explicitly and `==`. An alternative to `setkey(DT)` is returning a new ordered result using `DT[order(col1, col2, ...), ]`.

29. A segfault could occur when grouping or sorting in some circumstances with a large number of threads such as 80, [#5077](https://github.com/Rdatatable/data.table/issues/5077). Thanks to Bennet Becker for reporting, debugging at C level, and fixing.
29. A segfault could occur when grouping or sorting in some circumstances with a large number of threads such as 80 on a server with 256 logical cores, [#5077](https://github.com/Rdatatable/data.table/issues/5077). Thanks to Bennet Becker for reporting, debugging at C level, and fixing.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When mentioning the processor used, I think it would be better to use 128 threads as the example, as this would be the default value on such a machine. The other values mentioned (such as 80 or 96) were manually configured to initially find the error threshold. And I don't know how common it is for people to manually set the amount of threads to be used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I see. I'm going to try and create a test which reliably fails under ASAN (the commands are in CRAN_Release.cmd). I suspect it's not failing very often currently because there's space after the end of TMP within the last page that's effectively ok to write to (or something like that). Turning on ASAN should trap that in more cases given a suitable test. I never like using the words "in some circumstances" in the news items because I sense the reader thinking "well, what circumstances, and might my circumstances be affected?". So once the test is ready I'll go back and replace that phrase and the number of threads and cores with more detail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes with ASAN it should not be too hard to find a reliably failing example. The examples I have on my work PC might even produce the error with this, but I will not be able to access them until tomorrow noon (CEST time). But I think even this example required at least 6 threads or something like that, and I don't know what is possible in the test environment or what the limits are.

And I wouldn't call it "effectively ok to write to" but "possible to write to, without the environment noticing", as buffer overflows are always inherently problematic, as it may overwrites other data for other program parts or the overflown data is unexpectedly overwritten by other parts of the program, especially if the data is small enough for heap allocation to occur.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that is better wording. I do know that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is enough to segfault and doesn't need ASAN:

DT = data.table(grp=sample(65536))  # >=65536 necessary
setDTthreads(throttle=nrow(DT))     # increase throttle to reduce threads to 1 for this nrow
DT[, .N, by=grp]                    # segfault
setkey(DT, grp)                     # segfault

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh ok, I didn't know (or find) the throttle value is adjustable like this. I tried to find an example with the default 1024 value, which made it much harder, to find a small example that would even remotely reach the protected page.

@jangorecki jangorecki added this to the 1.15.0 milestone Oct 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[.data.table crashes with segfault while grouping with more than 79 threads

3 participants