Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

port CJ to C #3596

Merged
merged 17 commits into from
Jun 17, 2019
Merged

port CJ to C #3596

merged 17 commits into from
Jun 17, 2019

Conversation

MichaelChirico
Copy link
Member

@MichaelChirico MichaelChirico commented May 25, 2019

Also closes #3597

Putting on my C training wheels... guidance on ways to improve appreciated 🙌

Follow-up to https://github.com/orgs/Rdatatable/teams/project-members/discussions/17

Benchmarking with setDTthreads(2L)

setDTthreads(2L)
system.time(DT <- CJ(1:2, 1:3, 1:4, 1:5, 1:6, 1:7, 1:8, 1:9, 1:10, 1:11))
  #  user  system elapsed 
  # 2.823   0.578   1.707 
nrow(DT)
rm(DT); gc()
system.time(CJ_newR(1:2, 1:3, 1:4, 1:5, 1:6, 1:7, 1:8, 1:9, 1:10, 1:11))
 #   user  system elapsed 
 # 16.663   2.202  15.514
gc()
system.time(CJ_old(1:2, 1:3, 1:4, 1:5, 1:6, 1:7, 1:8, 1:9, 1:10, 1:11))
  #  user  system elapsed 
  # 5.959   0.637   1.953 

Paralellization working well here:

setDTthreads(8L)
system.time(DT <- CJ(1:2, 1:3, 1:4, 1:5, 1:6, 1:7, 1:8, 1:9, 1:10, 1:11))
  #  user  system elapsed 
  # 5.506   1.215   0.911 

For the record the speed of the pure-R CJ is really quite impressive! it's basically the same speed as declaring an empty list with that number of rows:

system.time(replicate(10L, numeric(39916800L), simplify = FALSE))
#    user  system elapsed 
#   0.899   0.912   1.891 
system.time(CJ_old(1:2, 1:3, 1:4, 1:5, 1:6, 1:7, 1:8, 1:9, 1:10, 1:11))
#    user  system elapsed 
#   5.994   0.648   1.920 

fixed header
@codecov
Copy link

codecov bot commented May 25, 2019

Codecov Report

Merging #3596 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3596      +/-   ##
==========================================
+ Coverage   98.23%   98.24%   +<.01%     
==========================================
  Files          66       67       +1     
  Lines       12928    12972      +44     
==========================================
+ Hits        12700    12744      +44     
  Misses        228      228
Impacted Files Coverage Δ
src/init.c 100% <ø> (ø) ⬆️
R/setkey.R 98.69% <100%> (-0.06%) ⬇️
src/cj.c 100% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update eea0a7b...c16fa25. Read the comment docs.

src/cj.c Outdated
SET_VECTOR_ELT(out, j, this_col);
} break;
default:
error("Type '%s' not supported by CJ.", type2char(TYPEOF(this_v)));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do I need to UNPROTECT in an error branch?

Copy link
Member

@mattdowle mattdowle May 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's one nice thing: R will clean up all objects on error so you don't need UNPROTECT.
Yay -- you're into C! Party!
We've been taking R API usage outside loops recently since R 3.5 added overhead. So take the REAL(), INTEGER() and LOGICAL() calls outside (see other C code in data.table for examples but look at files that have been more recently revised).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the REAL calls? not INTEGER/LOGICAL?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

REAL, INTEGER and LOGICAL

R/CJ.R Outdated
}

# apply sorting
if (sorted && any(idx <- vapply_1b(l, is.list))) stop("'sorted' is TRUE but element ", which(idx), " is a list, which can't be sorted; try setting sorted = FALSE")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it's not missed because of the diff from moving CJ to its own file -- this is the part that closes #3597

@mattdowle
Copy link
Member

quick suggestion: use Rprof() to confirm the timing you've shown is all in the C. It's feasible there's some surprising overhead at the R level before it gets to C. Best to double-check.

src/cj.c Outdated Show resolved Hide resolved
src/cj.c Outdated
case INTSXP: {
SEXP this_col = PROTECT(allocVector(INTSXP, NN)); nprotect++;
int *this_v_ptr = INTEGER(this_v);
int *this_col_ptr = INTEGER(this_col);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattdowle is this the proper way? (also for VECSXP/STRSXP cases it seems there's no analog?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. You can't do it for VECSXP/STRSXP because that would be what's known as "behind the write barrier". You can partially do it though by reading (RHS) but not writing (LHS =).

Copy link
Member

@mattdowle mattdowle May 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm off to bed now and it's a long weekend here -- so have fun in C! Where did you get the sense that CJ() was slow and that it could be sped up -- is the benchmark you put in this PR what motivated you or is there another one?
The idea from way-back was that CJ() would not actually materialize the full table. It would just mark itself as "CJ" object and then bmerge would know how to recycle from the irregular non-expanded structure appropriately without needing to expand it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No benchmark, just from staring at the code now and then and thinking "there's gotta be a faster way", and then happening on the modulo/div version here in my scratch work on an unrelated issue but alas. I'm still surprised how fast it is... rep.int is a beast. I guess rep.int is doing most of the work, and the recursion is only touching scalars instead of the full data (so it's not slowing down).

The alt-rep-y form does sound promising (certainly way more memory efficient), though I do use CJ often outside of i for populating the crossjoin tables for other reasons.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enjoy the weekend!

Copy link
Member

@mattdowle mattdowle May 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep: CJ uses rep.int which is implemented in C and an .Internal() R function too (which makes it fast to call in a loop). So really your're trying to beat C with C. Rprof on the current CJ might be misleading actually as I think someone said Rprof doesn't count .Internal() R functions.
Are you definitely getting the same result in the same order? Doing "(i % modulo) / div" on every single iteration is inefficient because it's really just batches of increasing i. It's possible that all those % and / add up in this case (it's contiguous assign so cache effects shouldn't come into play).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's really just batches of increasing i

Not sure I follow this.

Latest commit splits out the first and last column as separate cases to skip one of the %// operations, gave about 10% boost.

@MichaelChirico
Copy link
Member Author

Updated timing after moving R API calls out of the loop as suggested. Getting close to matching the pure R version on current master.

@MichaelChirico
Copy link
Member Author

I can't make heads or tails of the Rprof output, though it does look like the new version is (slightly) more memory efficient:

Rprof(interval = .005, memory.profiling = TRUE)
CJ(1:2, 1:3, 1:4, 1:5, 1:6, 1:7, 1:8, 1:9, 1:10, 1:11)
Rprof(NULL)
summaryRprof(memory = 'both')
# $by.self
#                 self.time self.pct total.time total.pct mem.total
# "CJ"                3.195    99.22      3.200     99.38    1370.5
# "print.default"     0.010     0.31      0.010      0.31       0.2
# "[.data.table"      0.005     0.16      0.005      0.16       0.3
# "options"           0.005     0.16      0.005      0.16       0.3
# "pmatch"            0.005     0.16      0.005      0.16       0.0
# 
# $by.total
#                     total.time total.pct mem.total self.time self.pct
# "CJ"                     3.200     99.38    1370.5     3.195    99.22
# "<Anonymous>"            0.020      0.62       0.7     0.000     0.00
# "print.data.table"       0.020      0.62       0.7     0.000     0.00
# "print.default"          0.010      0.31       0.2     0.010     0.31
# "print"                  0.010      0.31       0.2     0.000     0.00
# "[.data.table"           0.005      0.16       0.3     0.005     0.16
# "options"                0.005      0.16       0.3     0.005     0.16
# "pmatch"                 0.005      0.16       0.0     0.005     0.16
# ".deparseOpts"           0.005      0.16       0.0     0.000     0.00
# "["                      0.005      0.16       0.3     0.000     0.00
# "char.trunc"             0.005      0.16       0.3     0.000     0.00
# "deparse"                0.005      0.16       0.0     0.000     0.00
# "do.call"                0.005      0.16       0.3     0.000     0.00
# "format.data.table"      0.005      0.16       0.3     0.000     0.00
# "format"                 0.005      0.16       0.3     0.000     0.00
# "FUN"                    0.005      0.16       0.3     0.000     0.00
# "lapply"                 0.005      0.16       0.3     0.000     0.00
# "name_dots"              0.005      0.16       0.0     0.000     0.00
# "rbindlist"              0.005      0.16       0.3     0.000     0.00
# "suppressWarnings"       0.005      0.16       0.3     0.000     0.00
# "tail.data.table"        0.005      0.16       0.3     0.000     0.00
# "tail"                   0.005      0.16       0.3     0.000     0.00
# 
# $sample.interval
# [1] 0.005
# 
# $sampling.time
# [1] 3.22

Rprof(interval = .005, memory.profiling = TRUE)
CJ_old(1:2, 1:3, 1:4, 1:5, 1:6, 1:7, 1:8, 1:9, 1:10, 1:11)
Rprof(NULL)
summaryRprof(memory = 'both')
# $by.self
#                 self.time self.pct total.time total.pct mem.total
# "CJ_old"            6.295    99.76      6.300     99.84    1472.0
# "[.data.table"      0.005     0.08      0.005      0.08       0.2
# "%in%"              0.005     0.08      0.005      0.08       0.1
# "print.default"     0.005     0.08      0.005      0.08       0.5
# 
# $by.total
#                    total.time total.pct mem.total self.time self.pct
# "CJ_old"                6.300     99.84    1472.0     6.295    99.76
# "<Anonymous>"           0.010      0.16       0.7     0.000     0.00
# "print.data.table"      0.010      0.16       0.7     0.000     0.00
# "[.data.table"          0.005      0.08       0.2     0.005     0.08
# "%in%"                  0.005      0.08       0.1     0.005     0.08
# "print.default"         0.005      0.08       0.5     0.005     0.08
# "["                     0.005      0.08       0.2     0.000     0.00
# "deparse"               0.005      0.08       0.1     0.000     0.00
# "head.data.table"       0.005      0.08       0.2     0.000     0.00
# "head"                  0.005      0.08       0.2     0.000     0.00
# "mode"                  0.005      0.08       0.1     0.000     0.00
# "name_dots"             0.005      0.08       0.1     0.000     0.00
# "print"                 0.005      0.08       0.5     0.000     0.00
# "rbindlist"             0.005      0.08       0.2     0.000     0.00
# 
# $sample.interval
# [1] 0.005
# 
# $sampling.time
# [1] 6.31

@mattdowle
Copy link
Member

It's a good case to try parallelizing. Follow the #omp pragma examples in reorder.c or subset.c maybe. Each column of CJ() result can be done independently in parallel. There isn't any random cache cache (all sequential) so you should get good speedups; e.g. 2 threads should be close to twice as fast in this case. Can't go parallel for strings in this case, but logical, integer and real should be good (and factor is integer).

@MichaelChirico
Copy link
Member Author

Can't go parallel for strings in this case

I think I'm not following how to parallelize across columns then... first iterate over the columns to see which can be done in parallel, then batch it out?

Maybe parallelizing rows is the easier route?

@MichaelChirico
Copy link
Member Author

OK I tried parallelizing across rows on the INTEGER/REAL/LOGICAL branches, I think it worked well. With 2 threads, we equal the speed of the current implementation and still get the 10% memory savings. More threads and we do better than currently.

Since really what's been parallelized is the row index, are we safe to include the VECSXP and STRSXP cases in parallel as well?

@MichaelChirico
Copy link
Member Author

Now that the basic idea is working & improves the current implementation, should we try and bring more of the logic into C? Not sure there's much benefit...

@MichaelChirico MichaelChirico changed the title port CJ to C port CJ to C; also closes #3597 May 25, 2019
src/cj.c Outdated Show resolved Hide resolved
@jangorecki
Copy link
Member

very nice PR

src/cj.c Outdated
if (j == 0) {
#pragma omp parallel for num_threads(getDTthreads())
for (int i = 0; i < NN; i++) {
this_col_ptr[i] = this_v_ptr[i / div];
Copy link
Member

@jangorecki jangorecki May 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You iterate over all rows of the output, maybe we could do memcpy of blocks of rows? then iterate on blocks, this can be in parallel also. One block per thread. If we can do that it actually depends on the expected order of results.

CJ(1:3, 1:5)

using pseudo code

for (i=0, i<3; i++) memcpy(add_of_res()+i*size_of_int*5, addr_of(1:5), size_of_int*5)

i*size_of_int*5 is memory offset writing to same vector, but different location in it, AFAIK safe even from multiple threads because blocks will not overlap.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite sure I follow you here, sorry

Copy link
Member

@jangorecki jangorecki May 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is R code for that

library(data.table)
x = 1:3; nx = length(x)
y = 1:5; ny = length(y)
ans_y = vector("integer", nx*ny)
# pragma here
for (i in seq.int(nx)) ans_y[seq.int(ny)+(i-1L)*length(y)] = y

all.equal(CJ(x, y)$y, ans_y)
#[1] TRUE

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohh gotcha. just loop over y once, figure out where each element goes, and copy all at once, makes sense

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks trivial for 2 columns, for 3+ columns gets slightly more complicated, but I believe this is the way to go if we want to have that in C optimally

CJ(1:2, 1:3, 1:4)
      V1    V2    V3
    <int> <int> <int>
 1:     1     1     1
 2:     1     1     2
 3:     1     1     3
 4:     1     1     4
 5:     1     2     1
 6:     1     2     2
 7:     1     2     3
 8:     1     2     4

src/cj.c Outdated Show resolved Hide resolved
@mattdowle mattdowle changed the title port CJ to C; also closes #3597 port CJ to C May 29, 2019
@mattdowle
Copy link
Member

mattdowle commented May 29, 2019

Please don't move CJ() at R level into its own file. This breaks the commit history: CJ.R looks like new code and we can't see what changed inside it. If we did this for all functions then we'd have a ton of files. It's not perfect currently, but I don't struggle with the way things are organized currently: fairly similar functions being together in one file. I can move CJ.R back. cj.c on the other hand is new and appropriate to be its own file. And in the end CJ.R will become much smaller as it is moved down to cj.c so it makes sense to keep the (reducing) CJ() at R level where it is (especially as we'll want to see how it has reduced in the diffs).

@mattdowle mattdowle added this to the 1.12.4 milestone Jun 15, 2019
@mattdowle
Copy link
Member

mattdowle commented Jun 15, 2019

With c41c1a4 and default 50% threads (4 out of 8 on my laptop) :

> system.time(DT <- CJ(1:2, 1:3, 1:4, 1:5, 1:6, 1:7, 1:8, 1:9, 1:10, 1:11))
   user  system elapsed   # v1.12.2
  0.966   0.288   1.216 

   user  system elapsed   # now
  0.212   0.434   0.205

Would be good to compare character type too.

@Rdatatable Rdatatable deleted a comment from MichaelChirico Jun 15, 2019
@mattdowle
Copy link
Member

With character ids :

ids = as.vector(outer(LETTERS, LETTERS, paste0))
system.time(DT <- CJ(ids, 1:72, 1:73, 1:74))   # 5GB; 250m rows
#   user  system elapsed   
#  3.030   0.950   3.963  # v1.12.2
#  1.563   1.298   1.910  # now

ids = as.factor(ids)
system.time(DT <- CJ(ids, 1:72, 1:73, 1:74))
#   user  system elapsed
#  2.334   0.833   3.152  # v1.12.2
#  0.513   1.122   0.409  # now

@mattdowle mattdowle merged commit c9c8d09 into master Jun 17, 2019
@mattdowle mattdowle deleted the cj_speedup branch June 17, 2019 20:02
@MichaelChirico
Copy link
Member Author

It just struck me to worry about how this interacts with ALTREP, are we covered there? Particularly since CJ will often be used like CJ(1:10, 1:20)...

@mattdowle
Copy link
Member

Yes should be fine. INTEGER(), REAL() etc auto-expand ALTREPs. To work with ALTREPs without expanding them, you have to switch to use different R API calls. In this case of CJ() definitely not worth it because the work happens as the product of the lengths.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CJ can handle list() input if sorted=FALSE but fails confusingly
3 participants