-
Notifications
You must be signed in to change notification settings - Fork 991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
port CJ to C #3596
port CJ to C #3596
Conversation
fixed header
096bbef
to
1dd3c40
Compare
Codecov Report
@@ Coverage Diff @@
## master #3596 +/- ##
==========================================
+ Coverage 98.23% 98.24% +<.01%
==========================================
Files 66 67 +1
Lines 12928 12972 +44
==========================================
+ Hits 12700 12744 +44
Misses 228 228
Continue to review full report at Codecov.
|
src/cj.c
Outdated
SET_VECTOR_ELT(out, j, this_col); | ||
} break; | ||
default: | ||
error("Type '%s' not supported by CJ.", type2char(TYPEOF(this_v))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do I need to UNPROTECT
in an error branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's one nice thing: R will clean up all objects on error so you don't need UNPROTECT.
Yay -- you're into C! Party!
We've been taking R API usage outside loops recently since R 3.5 added overhead. So take the REAL(), INTEGER() and LOGICAL() calls outside (see other C code in data.table for examples but look at files that have been more recently revised).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only the REAL
calls? not INTEGER
/LOGICAL
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
REAL, INTEGER and LOGICAL
R/CJ.R
Outdated
} | ||
|
||
# apply sorting | ||
if (sorted && any(idx <- vapply_1b(l, is.list))) stop("'sorted' is TRUE but element ", which(idx), " is a list, which can't be sorted; try setting sorted = FALSE") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so it's not missed because of the diff from moving CJ to its own file -- this is the part that closes #3597
quick suggestion: use |
src/cj.c
Outdated
case INTSXP: { | ||
SEXP this_col = PROTECT(allocVector(INTSXP, NN)); nprotect++; | ||
int *this_v_ptr = INTEGER(this_v); | ||
int *this_col_ptr = INTEGER(this_col); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mattdowle is this the proper way? (also for VECSXP
/STRSXP
cases it seems there's no analog?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. You can't do it for VECSXP/STRSXP because that would be what's known as "behind the write barrier". You can partially do it though by reading (RHS) but not writing (LHS =).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm off to bed now and it's a long weekend here -- so have fun in C! Where did you get the sense that CJ() was slow and that it could be sped up -- is the benchmark you put in this PR what motivated you or is there another one?
The idea from way-back was that CJ() would not actually materialize the full table. It would just mark itself as "CJ" object and then bmerge would know how to recycle from the irregular non-expanded structure appropriately without needing to expand it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No benchmark, just from staring at the code now and then and thinking "there's gotta be a faster way", and then happening on the modulo/div version here in my scratch work on an unrelated issue but alas. I'm still surprised how fast it is... rep.int
is a beast. I guess rep.int
is doing most of the work, and the recursion is only touching scalars instead of the full data (so it's not slowing down).
The alt-rep-y form does sound promising (certainly way more memory efficient), though I do use CJ
often outside of i
for populating the crossjoin tables for other reasons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enjoy the weekend!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep: CJ uses rep.int which is implemented in C and an .Internal() R function too (which makes it fast to call in a loop). So really your're trying to beat C with C. Rprof
on the current CJ might be misleading actually as I think someone said Rprof
doesn't count .Internal() R functions.
Are you definitely getting the same result in the same order? Doing "(i % modulo) / div" on every single iteration is inefficient because it's really just batches of increasing i. It's possible that all those % and / add up in this case (it's contiguous assign so cache effects shouldn't come into play).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's really just batches of increasing i
Not sure I follow this.
Latest commit splits out the first and last column as separate cases to skip one of the %
//
operations, gave about 10% boost.
Updated timing after moving R API calls out of the loop as suggested. Getting close to matching the pure R version on current master. |
I can't make heads or tails of the
|
It's a good case to try parallelizing. Follow the #omp pragma examples in reorder.c or subset.c maybe. Each column of CJ() result can be done independently in parallel. There isn't any random cache cache (all sequential) so you should get good speedups; e.g. 2 threads should be close to twice as fast in this case. Can't go parallel for strings in this case, but logical, integer and real should be good (and factor is integer). |
I think I'm not following how to parallelize across columns then... first iterate over the columns to see which can be done in parallel, then batch it out? Maybe parallelizing rows is the easier route? |
OK I tried parallelizing across rows on the Since really what's been parallelized is the row index, are we safe to include the |
Now that the basic idea is working & improves the current implementation, should we try and bring more of the logic into C? Not sure there's much benefit... |
very nice PR |
src/cj.c
Outdated
if (j == 0) { | ||
#pragma omp parallel for num_threads(getDTthreads()) | ||
for (int i = 0; i < NN; i++) { | ||
this_col_ptr[i] = this_v_ptr[i / div]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You iterate over all rows of the output, maybe we could do memcpy
of blocks of rows? then iterate on blocks, this can be in parallel also. One block per thread. If we can do that it actually depends on the expected order of results.
CJ(1:3, 1:5)
using pseudo code
for (i=0, i<3; i++) memcpy(add_of_res()+i*size_of_int*5, addr_of(1:5), size_of_int*5)
i*size_of_int*5
is memory offset writing to same vector, but different location in it, AFAIK safe even from multiple threads because blocks will not overlap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite sure I follow you here, sorry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here is R code for that
library(data.table)
x = 1:3; nx = length(x)
y = 1:5; ny = length(y)
ans_y = vector("integer", nx*ny)
# pragma here
for (i in seq.int(nx)) ans_y[seq.int(ny)+(i-1L)*length(y)] = y
all.equal(CJ(x, y)$y, ans_y)
#[1] TRUE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ohh gotcha. just loop over y once, figure out where each element goes, and copy all at once, makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks trivial for 2 columns, for 3+ columns gets slightly more complicated, but I believe this is the way to go if we want to have that in C optimally
CJ(1:2, 1:3, 1:4)
V1 V2 V3
<int> <int> <int>
1: 1 1 1
2: 1 1 2
3: 1 1 3
4: 1 1 4
5: 1 2 1
6: 1 2 2
7: 1 2 3
8: 1 2 4
Please don't move CJ() at R level into its own file. This breaks the commit history: CJ.R looks like new code and we can't see what changed inside it. If we did this for all functions then we'd have a ton of files. It's not perfect currently, but I don't struggle with the way things are organized currently: fairly similar functions being together in one file. I can move CJ.R back. |
With c41c1a4 and default 50% threads (4 out of 8 on my laptop) :
Would be good to compare character type too. |
With character ids : ids = as.vector(outer(LETTERS, LETTERS, paste0))
system.time(DT <- CJ(ids, 1:72, 1:73, 1:74)) # 5GB; 250m rows
# user system elapsed
# 3.030 0.950 3.963 # v1.12.2
# 1.563 1.298 1.910 # now
ids = as.factor(ids)
system.time(DT <- CJ(ids, 1:72, 1:73, 1:74))
# user system elapsed
# 2.334 0.833 3.152 # v1.12.2
# 0.513 1.122 0.409 # now |
It just struck me to worry about how this interacts with ALTREP, are we covered there? Particularly since |
Yes should be fine. INTEGER(), REAL() etc auto-expand ALTREPs. To work with ALTREPs without expanding them, you have to switch to use different R API calls. In this case of |
Also closes #3597
Putting on my C training wheels... guidance on ways to improve appreciated 🙌
Follow-up to https://github.com/orgs/Rdatatable/teams/project-members/discussions/17
Benchmarking with
setDTthreads(2L)
Paralellization working well here:
For the record the speed of the pure-R
CJ
is really quite impressive! it's basically the same speed as declaring an empty list with that number of rows: