-
Notifications
You must be signed in to change notification settings - Fork 991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
port CJ to C #3596
Merged
Merged
port CJ to C #3596
Changes from 1 commit
Commits
Show all changes
17 commits
Select commit
Hold shift + click to select a range
1dd3c40
port CJ to C
bd1fea4
Add support for list input, add tests; also closes #3597
306043f
more robust logic for #3597
4f97972
attempt at pulling INTEGER etc calls outside the rows loop
571f434
get LHS out of loop as well
59e7b0e
split of corner cases to reduce overhead of trivial %, /
5b3ed27
more coverage, missed an edge case
d5b66d2
successfully parallelized where possible
ae3adaa
add explanatory comments for future readers
7d7461d
merge master
mattdowle 41bad11
moved CJ.R back to retain history and so we can see diff within it
mattdowle 970098f
Merge branch 'master' into cj_speedup
mattdowle 121b17b
merge master and tidy
mattdowle d9f3509
coverage
mattdowle c41c1a4
batches of increasing i; simpler code; deep loop bodies minimized
mattdowle ac08142
news item refocussed and moved up
mattdowle c16fa25
nrow==0 case moved down to C level for consistency
mattdowle File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,68 @@ | ||
SJ = function(...) { | ||
JDT = as.data.table(list(...)) | ||
setkey(JDT) | ||
} | ||
# S for Sorted, usually used in i to sort the i table | ||
|
||
# TO DO?: Use the CJ list() replication method for SJ (inside as.data.table.list?, #2109) too to avoid alloc.col | ||
|
||
CJ <- function(..., sorted = TRUE, unique = FALSE) | ||
{ | ||
# Pass in a list of unique values, e.g. ids and dates | ||
# Cross Join will then produce a join table with the combination of all values (cross product). | ||
# The last vector is varied the quickest in the table, so dates should be last for roll for example | ||
l = list(...) | ||
emptyList <- FALSE ## fix for #2511 | ||
if(any(vapply_1i(l, length) == 0L)){ | ||
## at least one column is empty The whole thing will be empty in the end | ||
emptyList <- TRUE | ||
l <- lapply(l, "[", 0L) | ||
} | ||
if (unique && !emptyList) l = lapply(l, unique) | ||
|
||
dups = FALSE # fix for #1513 | ||
ncol = length(l) | ||
if (ncol==1L && !emptyList) { | ||
if (sorted && length(o <- forderv(l[[1L]]))) out = list(l[[1L]][o]) | ||
else out = list(l[[1L]]) | ||
nrow = length(l[[1L]]) | ||
} else if (ncol > 1L && !emptyList) { | ||
# using rep.int instead of rep speeds things up considerably (but attributes are dropped). | ||
n = vapply_1i(l, length) #lengths(l) will work from R 3.2.0 (also above) | ||
nrow = prod(n) | ||
if (nrow > .Machine$integer.max) { | ||
stop("Cross product of elements provided to CJ() would result in ",nrow," rows which exceeds .Machine$integer.max == ",.Machine$integer.max) | ||
} | ||
|
||
# apply sorting | ||
if (sorted) l = lapply(l, function(li) { | ||
# fix for #1513 | ||
if (length(o <- forderv(li, retGrp=TRUE))) li = li[o] | ||
if (!dups) dups <<- attr(o, 'maxgrpn') > 1L | ||
return(li) | ||
}) | ||
|
||
# standard [ method destroys attributes, so below | ||
# will keep attributes only for classes with methods that impose so | ||
attrib = lapply(l, attributes) | ||
out = .Call(Ccj, l) | ||
for (jj in 1:ncol) if (!is.null(attributes(l[[jj]]))) attributes(out[[jj]]) = attrib[[jj]] | ||
# ncol == 0 || emptyList | ||
} else {out = l; nrow = length(l[[1L]])} | ||
setattr(out, "row.names", .set_row_names(nrow)) | ||
setattr(out, "class", c("data.table", "data.frame")) | ||
if (getOption("datatable.CJ.names", TRUE)) { # added as FALSE in v1.11.6 with NEWS item saying TRUE in v1.12.0. TODO: remove in v1.13.0 | ||
vnames = name_dots(...)$vnames | ||
} else { | ||
if (is.null(vnames <- names(out))) vnames = paste0("V", seq_len(ncol)) | ||
else if (any(tt <- vnames=="")) vnames[tt] = paste0("V", which(tt)) | ||
} | ||
setattr(out, "names", vnames) | ||
|
||
alloc.col(out) # a tiny bit wasteful to over-allocate a fixed join table (column slots only), doing it anyway for consistency, and it's possible a user may wish to use SJ directly outside a join and would expect consistent over-allocation. | ||
if (sorted) { | ||
if (!dups) setattr(out, 'sorted', names(out)) | ||
else setkey(out) # fix #1513 | ||
} | ||
out | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
#include "data.table.h" | ||
|
||
SEXP cj(SEXP base_list) { | ||
int JJ = LENGTH(base_list), nprotect = 0; | ||
SEXP out = PROTECT(allocVector(VECSXP, JJ)); nprotect++; | ||
int NN = 1; | ||
for (int j = 0; j < JJ; j++) NN *= LENGTH(VECTOR_ELT(base_list, j)); | ||
int div = NN, modulo; | ||
|
||
for (int j = 0; j < JJ; j++) { | ||
SEXP this_v = VECTOR_ELT(base_list, j); | ||
modulo = div; | ||
div = modulo/LENGTH(VECTOR_ELT(base_list, j)); | ||
switch(TYPEOF(this_v)) { | ||
case LGLSXP: { | ||
SEXP this_col = PROTECT(allocVector(LGLSXP, NN)); nprotect++; | ||
for (int i = 0; i < NN; i++) { | ||
LOGICAL(this_col)[i] = LOGICAL(this_v)[(i % modulo) / div]; | ||
} | ||
SET_VECTOR_ELT(out, j, this_col); | ||
} | ||
break; | ||
case INTSXP: { | ||
SEXP this_col = PROTECT(allocVector(INTSXP, NN)); nprotect++; | ||
for (int i = 0; i < NN; i++) { | ||
INTEGER(this_col)[i] = INTEGER(this_v)[(i % modulo) / div]; | ||
} | ||
SET_VECTOR_ELT(out, j, this_col); | ||
} | ||
break; | ||
case REALSXP: { | ||
SEXP this_col = PROTECT(allocVector(REALSXP, NN)); nprotect++; | ||
for (int i = 0; i < NN; i++) { | ||
REAL(this_col)[i] = REAL(this_v)[(i % modulo) / div]; | ||
} | ||
SET_VECTOR_ELT(out, j, this_col); | ||
} | ||
break; | ||
case STRSXP: { | ||
SEXP this_col = PROTECT(allocVector(STRSXP, NN)); nprotect++; | ||
for (int i = 0; i < NN; i++) { | ||
SET_STRING_ELT(this_col, i, STRING_ELT(this_v, (i % modulo) / div)); | ||
} | ||
SET_VECTOR_ELT(out, j, this_col); | ||
} break; | ||
default: | ||
error("Type '%s' not supported by CJ.", type2char(TYPEOF(this_v))); | ||
} | ||
} | ||
|
||
UNPROTECT(nprotect); | ||
return(out); | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do I need to
UNPROTECT
in an error branch?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's one nice thing: R will clean up all objects on error so you don't need UNPROTECT.
Yay -- you're into C! Party!
We've been taking R API usage outside loops recently since R 3.5 added overhead. So take the REAL(), INTEGER() and LOGICAL() calls outside (see other C code in data.table for examples but look at files that have been more recently revised).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only the
REAL
calls? notINTEGER
/LOGICAL
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
REAL, INTEGER and LOGICAL