Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,8 @@ Authors@R: c(
person("Tony","Fischetti", role="ctb"),
person("Ofek","Shilon", role="ctb"),
person("Vadim","Khotilovich", role="ctb"),
person("Hadley","Wickham", role="ctb"))
person("Hadley","Wickham", role="ctb"),
person("Bennet","Becker", role="ctb"))
Depends: R (>= 3.1.0)
Imports: methods
Suggests: bit64 (>= 4.0.0), bit (>= 4.0.4), curl, R.utils, xts, nanotime, zoo (>= 1.8-1), yaml, knitr, rmarkdown, markdown
Expand Down
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -153,6 +153,8 @@

28. `dplyr::arrange(DT)` uses `vctrs::vec_slice` which retains `data.table`'s class but uses C to bypass `[` method dispatch and does not adjust `data.table`'s attributes containing the index row numbers, [#5042](https://github.com/Rdatatable/data.table/issues/5042). `data.table`'s long-standing `.internal.selfref` mechanism to detect such operations by other packages was not being checked by `data.table` when using indexes, causing `data.table` filters and joins to use invalid indexes and return incorrect results after a `dplyr::arrange(DT)`. Thanks to @Waldi73 for reporting; @avimallu, @tlapak, @MichaelChirico, @jangorecki and @hadley for investigating and suggestions; and @mattdowle for the PR. The intended way to use `data.table` is `data.table::setkey(DT, col1, col2, ...)` which reorders `DT` by reference in parallel, sets the primary key for automatic use by subsequent `data.table` queries, and permits rowname-like usage such as `DT["foo",]` which returns the now-contiguous-in-memory block of rows where the first column of `DT`'s key contains `"foo"`. Multi-column-rownames (i.e. a primary key of more than one column) can be looked up using `DT[.("foo",20210728L), ]`. Using `==` in `i` is also optimized to use the key or indices, if you prefer using column names explicitly and `==`. An alternative to `setkey(DT)` is returning a new ordered result using `DT[order(col1, col2, ...), ]`.

29. A segfault occurred when `nrow/throttle < nthread`, [#5077](https://github.com/Rdatatable/data.table/issues/5077). With the default throttle of 1024 rows (see `?setDTthreads`), at least 64 threads would be needed to trigger the segfault since there needed to be more than 65,535 rows too. It occurred on a server with 256 logical cores where `data.table` uses 128 threads by default. Thanks to Bennet Becker for reporting, debugging at C level, and fixing. It also occurred when the throttle was increased so as to use fewer threads; e.g. at the limit `setDTthreads(throttle=nrow(DT))`.

## NOTES

1. New feature 29 in v1.12.4 (Oct 2019) introduced zero-copy coercion. Our thinking is that requiring you to get the type right in the case of `0` (type double) vs `0L` (type integer) is too inconvenient for you the user. So such coercions happen in `data.table` automatically without warning. Thanks to zero-copy coercion there is no speed penalty, even when calling `set()` many times in a loop, so there's no speed penalty to warn you about either. However, we believe that assigning a character value such as `"2"` into an integer column is more likely to be a user mistake that you would like to be warned about. The type difference (character vs integer) may be the only clue that you have selected the wrong column, or typed the wrong variable to be assigned to that column. For this reason we view character to numeric-like coercion differently and will warn about it. If it is correct, then the warning is intended to nudge you to wrap the RHS with `as.<type>()` so that it is clear to readers of your code that a coercion from character to that type is intended. For example :
Expand Down
14 changes: 14 additions & 0 deletions inst/tests/tests.Rraw
Original file line number Diff line number Diff line change
Expand Up @@ -17810,3 +17810,17 @@ setDF(d)
d[1:50, "a"] = d[51:100, "a"]
setDT(d)
test(2200, nrow(d[a==99]), 2L)

# segfault in forder when nrow/throttle<nth && ngrp>=255 && nrow>=65536; #5077
# Matt ran these on clang's ASAN+OpenMP which correctly faulted v1.14.0; these tests segfault consistently without ASAN too
set.seed(1)
DT = data.table(grp=sample(255L, 65536L ,replace=TRUE)) # >=255 && >=65536 necessary
setDTthreads(throttle=nrow(DT)) # increase throttle to reduce threads to 1 for this nrow
test(2201.1, nrow(DT[, .N, by=grp]), 255L)
test(2201.2, nrow(setkey(DT, grp)), 65536L)
set.seed(1)
DT = data.table(grp=sample(65536L)) # extra case with all size 1 groups too just for fun
test(2201.3, nrow(DT[, .N, by=grp]), 65536L)
test(2201.4, nrow(setkey(DT, grp)), 65536L)
setDTthreads() # restore default throttle

9 changes: 6 additions & 3 deletions src/forder.c
Original file line number Diff line number Diff line change
Expand Up @@ -714,10 +714,12 @@ SEXP forder(SEXP DT, SEXP by, SEXP retGrpArg, SEXP sortGroupsArg, SEXP ascArg, S
Rprintf(_("nradix=%d\n"), nradix);
#endif

nth = getDTthreads(nrow, true); // this nth is relied on in cleanup()
// global nth, TMP & UGRP
nth = getDTthreads(nrow, true); // this nth is relied on in cleanup(); throttle=true/false debated for #5077
TMP = (int *)malloc(nth*UINT16_MAX*sizeof(int)); // used by counting sort (my_n<=65536) in radix_r()
UGRP = (uint8_t *)malloc(nth*256); // TODO: align TMP and UGRP to cache lines (and do the same for stack allocations too)
if (!TMP || !UGRP /*|| TMP%64 || UGRP%64*/) STOP(_("Failed to allocate TMP or UGRP or they weren't cache line aligned: nth=%d"), nth);

if (retgrp) {
gs_thread = calloc(nth, sizeof(int *)); // thread private group size buffers
gs_thread_alloc = calloc(nth, sizeof(int));
Expand Down Expand Up @@ -1222,8 +1224,9 @@ void radix_r(const int from, const int to, const int radix) {
} else {
// all groups are <=65535 and radix_r() will handle each one single-threaded. Therefore, this time
// it does make sense to start a parallel team and there will be no nestedness here either.

if (retgrp) {
#pragma omp parallel for ordered schedule(dynamic) num_threads(getDTthreads(ngrp, false))
#pragma omp parallel for ordered schedule(dynamic) num_threads(MIN(nth, ngrp)) // #5077
for (int i=0; i<ngrp; i++) {
int start = from + starts[ugrp[i]];
radix_r(start, start+my_gs[i]-1, radix+1);
Expand All @@ -1232,7 +1235,7 @@ void radix_r(const int from, const int to, const int radix) {
}
} else {
// flush() is only relevant when retgrp==true so save the redundant ordered clause
#pragma omp parallel for schedule(dynamic) num_threads(getDTthreads(ngrp, false))
#pragma omp parallel for schedule(dynamic) num_threads(MIN(nth, ngrp)) // #5077
for (int i=0; i<ngrp; i++) {
int start = from + starts[ugrp[i]];
radix_r(start, start+my_gs[i]-1, radix+1);
Expand Down