port CJ to C #3596

MichaelChirico · 2019-05-25T06:03:56Z

Also closes #3597

Putting on my C training wheels... guidance on ways to improve appreciated 🙌

Follow-up to https://github.com/orgs/Rdatatable/teams/project-members/discussions/17

Benchmarking with setDTthreads(2L)

setDTthreads(2L)
system.time(DT <- CJ(1:2, 1:3, 1:4, 1:5, 1:6, 1:7, 1:8, 1:9, 1:10, 1:11))
  #  user  system elapsed 
  # 2.823   0.578   1.707 
nrow(DT)
rm(DT); gc()
system.time(CJ_newR(1:2, 1:3, 1:4, 1:5, 1:6, 1:7, 1:8, 1:9, 1:10, 1:11))
 #   user  system elapsed 
 # 16.663   2.202  15.514
gc()
system.time(CJ_old(1:2, 1:3, 1:4, 1:5, 1:6, 1:7, 1:8, 1:9, 1:10, 1:11))
  #  user  system elapsed 
  # 5.959   0.637   1.953

Paralellization working well here:

setDTthreads(8L)
system.time(DT <- CJ(1:2, 1:3, 1:4, 1:5, 1:6, 1:7, 1:8, 1:9, 1:10, 1:11))
  #  user  system elapsed 
  # 5.506   1.215   0.911

For the record the speed of the pure-R CJ is really quite impressive! it's basically the same speed as declaring an empty list with that number of rows:

system.time(replicate(10L, numeric(39916800L), simplify = FALSE))
#    user  system elapsed 
#   0.899   0.912   1.891 
system.time(CJ_old(1:2, 1:3, 1:4, 1:5, 1:6, 1:7, 1:8, 1:9, 1:10, 1:11))
#    user  system elapsed 
#   5.994   0.648   1.920

fixed header

codecov · 2019-05-25T06:18:59Z

Codecov Report

Merging #3596 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #3596      +/-   ##
==========================================
+ Coverage   98.23%   98.24%   +<.01%     
==========================================
  Files          66       67       +1     
  Lines       12928    12972      +44     
==========================================
+ Hits        12700    12744      +44     
  Misses        228      228

Impacted Files	Coverage Δ
src/init.c	`100% <ø> (ø)`	⬆️
R/setkey.R	`98.69% <100%> (-0.06%)`	⬇️
src/cj.c	`100% <100%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update eea0a7b...c16fa25. Read the comment docs.

MichaelChirico · 2019-05-25T06:23:36Z

src/cj.c

+      SET_VECTOR_ELT(out, j, this_col);
+    } break;
+    default:
+      error("Type '%s' not supported by CJ.", type2char(TYPEOF(this_v)));


do I need to UNPROTECT in an error branch?

That's one nice thing: R will clean up all objects on error so you don't need UNPROTECT.
Yay -- you're into C! Party!
We've been taking R API usage outside loops recently since R 3.5 added overhead. So take the REAL(), INTEGER() and LOGICAL() calls outside (see other C code in data.table for examples but look at files that have been more recently revised).

Only the REAL calls? not INTEGER/LOGICAL?

REAL, INTEGER and LOGICAL

MichaelChirico · 2019-05-25T06:49:34Z

R/CJ.R

+    }
+
+    # apply sorting
+    if (sorted && any(idx <- vapply_1b(l, is.list))) stop("'sorted' is TRUE but element ", which(idx), " is a list, which can't be sorted; try setting sorted = FALSE")


so it's not missed because of the diff from moving CJ to its own file -- this is the part that closes #3597

mattdowle · 2019-05-25T07:08:57Z

quick suggestion: use Rprof() to confirm the timing you've shown is all in the C. It's feasible there's some surprising overhead at the R level before it gets to C. Best to double-check.

src/cj.c

MichaelChirico · 2019-05-25T07:20:11Z

src/cj.c

+    case INTSXP: {
+      SEXP this_col = PROTECT(allocVector(INTSXP, NN)); nprotect++;
+      int *this_v_ptr = INTEGER(this_v);
+      int *this_col_ptr = INTEGER(this_col);


@mattdowle is this the proper way? (also for VECSXP/STRSXP cases it seems there's no analog?)

Correct. You can't do it for VECSXP/STRSXP because that would be what's known as "behind the write barrier". You can partially do it though by reading (RHS) but not writing (LHS =).

I'm off to bed now and it's a long weekend here -- so have fun in C! Where did you get the sense that CJ() was slow and that it could be sped up -- is the benchmark you put in this PR what motivated you or is there another one?
The idea from way-back was that CJ() would not actually materialize the full table. It would just mark itself as "CJ" object and then bmerge would know how to recycle from the irregular non-expanded structure appropriately without needing to expand it.

No benchmark, just from staring at the code now and then and thinking "there's gotta be a faster way", and then happening on the modulo/div version here in my scratch work on an unrelated issue but alas. I'm still surprised how fast it is... rep.int is a beast. I guess rep.int is doing most of the work, and the recursion is only touching scalars instead of the full data (so it's not slowing down).

The alt-rep-y form does sound promising (certainly way more memory efficient), though I do use CJ often outside of i for populating the crossjoin tables for other reasons.

enjoy the weekend!

yep: CJ uses rep.int which is implemented in C and an .Internal() R function too (which makes it fast to call in a loop). So really your're trying to beat C with C. Rprof on the current CJ might be misleading actually as I think someone said Rprof doesn't count .Internal() R functions.
Are you definitely getting the same result in the same order? Doing "(i % modulo) / div" on every single iteration is inefficient because it's really just batches of increasing i. It's possible that all those % and / add up in this case (it's contiguous assign so cache effects shouldn't come into play).

it's really just batches of increasing i

Not sure I follow this.

Latest commit splits out the first and last column as separate cases to skip one of the %// operations, gave about 10% boost.

MichaelChirico · 2019-05-25T07:24:35Z

Updated timing after moving R API calls out of the loop as suggested. Getting close to matching the pure R version on current master.

MichaelChirico · 2019-05-25T07:50:09Z

I can't make heads or tails of the Rprof output, though it does look like the new version is (slightly) more memory efficient:

Rprof(interval = .005, memory.profiling = TRUE)
CJ(1:2, 1:3, 1:4, 1:5, 1:6, 1:7, 1:8, 1:9, 1:10, 1:11)
Rprof(NULL)
summaryRprof(memory = 'both')
# $by.self
#                 self.time self.pct total.time total.pct mem.total
# "CJ"                3.195    99.22      3.200     99.38    1370.5
# "print.default"     0.010     0.31      0.010      0.31       0.2
# "[.data.table"      0.005     0.16      0.005      0.16       0.3
# "options"           0.005     0.16      0.005      0.16       0.3
# "pmatch"            0.005     0.16      0.005      0.16       0.0
# 
# $by.total
#                     total.time total.pct mem.total self.time self.pct
# "CJ"                     3.200     99.38    1370.5     3.195    99.22
# "<Anonymous>"            0.020      0.62       0.7     0.000     0.00
# "print.data.table"       0.020      0.62       0.7     0.000     0.00
# "print.default"          0.010      0.31       0.2     0.010     0.31
# "print"                  0.010      0.31       0.2     0.000     0.00
# "[.data.table"           0.005      0.16       0.3     0.005     0.16
# "options"                0.005      0.16       0.3     0.005     0.16
# "pmatch"                 0.005      0.16       0.0     0.005     0.16
# ".deparseOpts"           0.005      0.16       0.0     0.000     0.00
# "["                      0.005      0.16       0.3     0.000     0.00
# "char.trunc"             0.005      0.16       0.3     0.000     0.00
# "deparse"                0.005      0.16       0.0     0.000     0.00
# "do.call"                0.005      0.16       0.3     0.000     0.00
# "format.data.table"      0.005      0.16       0.3     0.000     0.00
# "format"                 0.005      0.16       0.3     0.000     0.00
# "FUN"                    0.005      0.16       0.3     0.000     0.00
# "lapply"                 0.005      0.16       0.3     0.000     0.00
# "name_dots"              0.005      0.16       0.0     0.000     0.00
# "rbindlist"              0.005      0.16       0.3     0.000     0.00
# "suppressWarnings"       0.005      0.16       0.3     0.000     0.00
# "tail.data.table"        0.005      0.16       0.3     0.000     0.00
# "tail"                   0.005      0.16       0.3     0.000     0.00
# 
# $sample.interval
# [1] 0.005
# 
# $sampling.time
# [1] 3.22

Rprof(interval = .005, memory.profiling = TRUE)
CJ_old(1:2, 1:3, 1:4, 1:5, 1:6, 1:7, 1:8, 1:9, 1:10, 1:11)
Rprof(NULL)
summaryRprof(memory = 'both')
# $by.self
#                 self.time self.pct total.time total.pct mem.total
# "CJ_old"            6.295    99.76      6.300     99.84    1472.0
# "[.data.table"      0.005     0.08      0.005      0.08       0.2
# "%in%"              0.005     0.08      0.005      0.08       0.1
# "print.default"     0.005     0.08      0.005      0.08       0.5
# 
# $by.total
#                    total.time total.pct mem.total self.time self.pct
# "CJ_old"                6.300     99.84    1472.0     6.295    99.76
# "<Anonymous>"           0.010      0.16       0.7     0.000     0.00
# "print.data.table"      0.010      0.16       0.7     0.000     0.00
# "[.data.table"          0.005      0.08       0.2     0.005     0.08
# "%in%"                  0.005      0.08       0.1     0.005     0.08
# "print.default"         0.005      0.08       0.5     0.005     0.08
# "["                     0.005      0.08       0.2     0.000     0.00
# "deparse"               0.005      0.08       0.1     0.000     0.00
# "head.data.table"       0.005      0.08       0.2     0.000     0.00
# "head"                  0.005      0.08       0.2     0.000     0.00
# "mode"                  0.005      0.08       0.1     0.000     0.00
# "name_dots"             0.005      0.08       0.1     0.000     0.00
# "print"                 0.005      0.08       0.5     0.000     0.00
# "rbindlist"             0.005      0.08       0.2     0.000     0.00
# 
# $sample.interval
# [1] 0.005
# 
# $sampling.time
# [1] 6.31

mattdowle · 2019-05-25T08:05:06Z

It's a good case to try parallelizing. Follow the #omp pragma examples in reorder.c or subset.c maybe. Each column of CJ() result can be done independently in parallel. There isn't any random cache cache (all sequential) so you should get good speedups; e.g. 2 threads should be close to twice as fast in this case. Can't go parallel for strings in this case, but logical, integer and real should be good (and factor is integer).

MichaelChirico · 2019-05-25T10:23:03Z

Can't go parallel for strings in this case

I think I'm not following how to parallelize across columns then... first iterate over the columns to see which can be done in parallel, then batch it out?

Maybe parallelizing rows is the easier route?

MichaelChirico · 2019-05-25T10:44:37Z

OK I tried parallelizing across rows on the INTEGER/REAL/LOGICAL branches, I think it worked well. With 2 threads, we equal the speed of the current implementation and still get the 10% memory savings. More threads and we do better than currently.

Since really what's been parallelized is the row index, are we safe to include the VECSXP and STRSXP cases in parallel as well?

MichaelChirico · 2019-05-25T11:04:39Z

Now that the basic idea is working & improves the current implementation, should we try and bring more of the logic into C? Not sure there's much benefit...

src/cj.c

jangorecki · 2019-05-25T11:13:05Z

very nice PR

jangorecki · 2019-05-25T11:33:44Z

src/cj.c

+      if (j == 0) {
+        #pragma omp parallel for num_threads(getDTthreads())
+        for (int i = 0; i < NN; i++) {
+          this_col_ptr[i] = this_v_ptr[i / div];


You iterate over all rows of the output, maybe we could do memcpy of blocks of rows? then iterate on blocks, this can be in parallel also. One block per thread. If we can do that it actually depends on the expected order of results.

CJ(1:3, 1:5)

using pseudo code

for (i=0, i<3; i++) memcpy(add_of_res()+i*size_of_int*5, addr_of(1:5), size_of_int*5)

i*size_of_int*5 is memory offset writing to same vector, but different location in it, AFAIK safe even from multiple threads because blocks will not overlap.

Not quite sure I follow you here, sorry

here is R code for that

library(data.table) x = 1:3; nx = length(x) y = 1:5; ny = length(y) ans_y = vector("integer", nx*ny) # pragma here for (i in seq.int(nx)) ans_y[seq.int(ny)+(i-1L)*length(y)] = y all.equal(CJ(x, y)$y, ans_y) #[1] TRUE

ohh gotcha. just loop over y once, figure out where each element goes, and copy all at once, makes sense

Looks trivial for 2 columns, for 3+ columns gets slightly more complicated, but I believe this is the way to go if we want to have that in C optimally

CJ(1:2, 1:3, 1:4)

V1 V2 V3 <int> <int> <int> 1: 1 1 1 2: 1 1 2 3: 1 1 3 4: 1 1 4 5: 1 2 1 6: 1 2 2 7: 1 2 3 8: 1 2 4

src/cj.c

mattdowle · 2019-05-29T23:06:19Z

Please don't move CJ() at R level into its own file. This breaks the commit history: CJ.R looks like new code and we can't see what changed inside it. If we did this for all functions then we'd have a ton of files. It's not perfect currently, but I don't struggle with the way things are organized currently: fairly similar functions being together in one file. I can move CJ.R back. cj.c on the other hand is new and appropriate to be its own file. And in the end CJ.R will become much smaller as it is moved down to cj.c so it makes sense to keep the (reducing) CJ() at R level where it is (especially as we'll want to see how it has reduced in the diffs).

mattdowle · 2019-06-15T09:29:00Z

With c41c1a4 and default 50% threads (4 out of 8 on my laptop) :

> system.time(DT <- CJ(1:2, 1:3, 1:4, 1:5, 1:6, 1:7, 1:8, 1:9, 1:10, 1:11))
   user  system elapsed   # v1.12.2
  0.966   0.288   1.216 

   user  system elapsed   # now
  0.212   0.434   0.205

Would be good to compare character type too.

mattdowle · 2019-06-17T18:49:59Z

With character ids :

ids = as.vector(outer(LETTERS, LETTERS, paste0))
system.time(DT <- CJ(ids, 1:72, 1:73, 1:74))   # 5GB; 250m rows
#   user  system elapsed   
#  3.030   0.950   3.963  # v1.12.2
#  1.563   1.298   1.910  # now

ids = as.factor(ids)
system.time(DT <- CJ(ids, 1:72, 1:73, 1:74))
#   user  system elapsed
#  2.334   0.833   3.152  # v1.12.2
#  0.513   1.122   0.409  # now

MichaelChirico · 2019-06-18T06:28:34Z

It just struck me to worry about how this interacts with ALTREP, are we covered there? Particularly since CJ will often be used like CJ(1:10, 1:20)...

mattdowle · 2019-06-18T09:17:28Z

Yes should be fine. INTEGER(), REAL() etc auto-expand ALTREPs. To work with ALTREPs without expanding them, you have to switch to use different R API calls. In this case of CJ() definitely not worth it because the work happens as the product of the lengths.

port CJ to C

1dd3c40

fixed header

MichaelChirico force-pushed the cj_speedup branch from 096bbef to 1dd3c40 Compare May 25, 2019 06:07

MichaelChirico commented May 25, 2019

View reviewed changes

Add support for list input, add tests; also closes #3597

bd1fea4

MichaelChirico commented May 25, 2019

View reviewed changes

more robust logic for #3597

306043f

attempt at pulling INTEGER etc calls outside the rows loop

4f97972

MichaelChirico commented May 25, 2019

View reviewed changes

src/cj.c Outdated Show resolved Hide resolved

get LHS out of loop as well

571f434

MichaelChirico commented May 25, 2019

View reviewed changes

Michael Chirico added 2 commits May 25, 2019 16:53

split of corner cases to reduce overhead of trivial %, /

59e7b0e

more coverage, missed an edge case

5b3ed27

successfully parallelized where possible

d5b66d2

MichaelChirico mentioned this pull request May 25, 2019

CJ can handle list() input if sorted=FALSE but fails confusingly #3597

Closed

MichaelChirico changed the title ~~port CJ to C~~ port CJ to C; also closes #3597 May 25, 2019

jangorecki reviewed May 25, 2019

View reviewed changes

src/cj.c Outdated Show resolved Hide resolved

jangorecki reviewed May 25, 2019

View reviewed changes

add explanatory comments for future readers

ae3adaa

jangorecki reviewed May 25, 2019

View reviewed changes

src/cj.c Outdated Show resolved Hide resolved

mattdowle changed the title ~~port CJ to C; also closes #3597~~ port CJ to C May 29, 2019

mattdowle added 4 commits May 29, 2019 16:24

merge master

7d7461d

moved CJ.R back to retain history and so we can see diff within it

41bad11

Merge branch 'master' into cj_speedup

970098f

merge master and tidy

121b17b

mattdowle added this to the 1.12.4 milestone Jun 15, 2019

mattdowle added 2 commits June 15, 2019 00:27

coverage

d9f3509

batches of increasing i; simpler code; deep loop bodies minimized

c41c1a4

Rdatatable deleted a comment from MichaelChirico Jun 15, 2019

mattdowle added 2 commits June 17, 2019 12:23

news item refocussed and moved up

ac08142

nrow==0 case moved down to C level for consistency

c16fa25

mattdowle merged commit c9c8d09 into master Jun 17, 2019

mattdowle deleted the cj_speedup branch June 17, 2019 20:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

port CJ to C #3596

port CJ to C #3596

MichaelChirico commented May 25, 2019 •

edited

Loading

codecov bot commented May 25, 2019 •

edited

Loading

MichaelChirico May 25, 2019

mattdowle May 25, 2019 •

edited

Loading

MichaelChirico May 25, 2019

jangorecki May 25, 2019

MichaelChirico May 25, 2019

mattdowle commented May 25, 2019

MichaelChirico May 25, 2019

mattdowle May 25, 2019

mattdowle May 25, 2019 •

edited

Loading

MichaelChirico May 25, 2019

MichaelChirico May 25, 2019

mattdowle May 25, 2019 •

edited

Loading

MichaelChirico May 25, 2019

MichaelChirico commented May 25, 2019

MichaelChirico commented May 25, 2019

mattdowle commented May 25, 2019

MichaelChirico commented May 25, 2019

MichaelChirico commented May 25, 2019

MichaelChirico commented May 25, 2019

jangorecki commented May 25, 2019

jangorecki May 25, 2019 •

edited

Loading

MichaelChirico May 25, 2019

jangorecki May 25, 2019 •

edited

Loading

MichaelChirico May 25, 2019

jangorecki May 25, 2019

mattdowle commented May 29, 2019 •

edited

Loading

mattdowle commented Jun 15, 2019 •

edited

Loading

mattdowle commented Jun 17, 2019

MichaelChirico commented Jun 18, 2019

mattdowle commented Jun 18, 2019

port CJ to C #3596

port CJ to C #3596

Conversation

MichaelChirico commented May 25, 2019 • edited Loading

codecov bot commented May 25, 2019 • edited Loading

Codecov Report

Choose a reason for hiding this comment

mattdowle May 25, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattdowle commented May 25, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattdowle May 25, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattdowle May 25, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MichaelChirico commented May 25, 2019

MichaelChirico commented May 25, 2019

mattdowle commented May 25, 2019

MichaelChirico commented May 25, 2019

MichaelChirico commented May 25, 2019

MichaelChirico commented May 25, 2019

jangorecki commented May 25, 2019

jangorecki May 25, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jangorecki May 25, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattdowle commented May 29, 2019 • edited Loading

mattdowle commented Jun 15, 2019 • edited Loading

mattdowle commented Jun 17, 2019

MichaelChirico commented Jun 18, 2019

mattdowle commented Jun 18, 2019

MichaelChirico commented May 25, 2019 •

edited

Loading

codecov bot commented May 25, 2019 •

edited

Loading

mattdowle May 25, 2019 •

edited

Loading

mattdowle May 25, 2019 •

edited

Loading

mattdowle May 25, 2019 •

edited

Loading

jangorecki May 25, 2019 •

edited

Loading

jangorecki May 25, 2019 •

edited

Loading

mattdowle commented May 29, 2019 •

edited

Loading

mattdowle commented Jun 15, 2019 •

edited

Loading