Better subsetting optimization for compound queries #2494

MarkusBonsch · 2017-11-22T06:52:58Z

Closes issue #2472.
This implementation makes fast subsetting with bmerge and keys/indices applicable to a wider range of
subsetting queries in i.
Detailed benchmarks can be found in my post in issue #2472

Logic changes

No results of operations on data.tables are changed
The use of fast subsetting is now enabled via options(datatable.optimize) = 3.
Internally, the use of bmerge for fast subsetting has been extended to the following query types:
- queries with %chin% like DT[char %chin% c("A", "B")]
- compound queries if the connector is & and each subquery satisfies the criteria,
  i.e operator is ==, %in%, or %chin%, lefthand side is a column of the data.table,
  and righthand side fulfills several complicated criteria (that I just copy-pasted).
Internally, fast subsetting with keys is now also used if the data.table.use.index is FALSE. Previously, this option didn't only prevent the usage of indices, but also the usage of keys. According to @jangorecki, the new behaviour is the desired one.
reordering of irows after bmerge has been brought to speed for large subsets of > 1e6 rows, see var %in% vec bmerge is slow #2366
Fast subsetting with bmerge for non-equi operators (<, >) has been tested but dismissed because it actually slowed things down considerably.

Implications for unit tests

Several unit tests had to be adapted because of changed verbose messages.
27 dedicated unit tests (1437.1 - 1437.27 have been added to test aspects of the new implementation that appeared critical to me.
A large set of unit tests (> 100) has been added that test for equal behaviour with optimization 2 and 3. This involves queries with different combinations of subsets in i, combined with different expressions in j and extended to which = TRUE and to grouped queries using 'by' (1437.28ff)

Structure changes
I factored out the whole logic of determining, whether a fast subset can be executed into the new function prepareFastSubset.
The advantage is, that for future changes of these conditions, it is clear, where to adapt the code.
Previously, optimized subsets had their own call to bmerge. Now, they are redirected to the normal join implementation. This causes a small slowdown (to be investigated) but offers the advantage of easier code maintenance and the possibility of including non-equi operators, once the non-equi joins have been brought to speed.

Benchmark
Code at the end of this post.
data.table with 1e9 rows and 3 columns (~50GB):

DT <- data.table(intCol = sample(1L:10L, n, replace = T),
                 doubleCol = sample(1:10, n, replace=T),
                 charCol   = sample(LETTERS, n, replace = T))

Tested each subsetting query in three versions:

WITH NOTHING: no index or key exists. An index needs to be created first if optimization is turned on.
WITH INDEX: an apropriate index for this query exists
WITH KEY: an apropriate key for this query exists.

Tested for 4 different package settings:

master_opt: master branch with optimization switched on
master_raw: master branch with optimization switched off
speedy_opt: speedySubset branch with optimization switched on
speedy_raw: speedySubset branch with optimization switched off

Things that are optimized now and weren't before

query	master_opt (s)	master_raw (s)	speedy_opt (s)	speedy_raw (s)	comment
DT[charCol %chin% c("A", "B", "Y","Z")] (WITH NOTHING)	12.1	12.1	18.1	12.4	If index needs to be created first, we see a significant slowdown in optimized mode.
DT[charCol %chin% c("A", "B", "Y","Z")] (WITH INDEX)	14.5	14.4	8.2	14.9	Some benefit despite reorder after bmerge.
DT[charCol %chin% c("A", "B", "Y","Z")] (WITH KEY)	12.6	12.6	3.5	12.8	significant speed-up due to optimization if proper key exists
DT[intCol == 2L & doubleCol %in% 1:5 & charCol %chin% c("A", "B", "Y","Z")] (WITH NOTHING)	27.8	28	23.3	28	Slight speed improvement despite index creation
DT[intCol == 2L & doubleCol %in% 1:5 & charCol %chin% c("A", "B", "Y","Z")] (WITH INDEX)	43.5	44.6	0.6	45	Tremendous speed improvement if a proper index exists. Reorder after bmerge doesn't spoil because result set is small
DT[intCol == 2L & doubleCol %in% 1:5 & charCol %chin% c("A", "B", "Y","Z")] (WITH KEY)	27.9	28.1	0.18	27.8	Tremendous speed-up if proper key already exists

Things that were optimized before and are still optimized

query	master_opt (s)	master_raw (s)	speedy_opt (s)	speedy_raw (s)	comment
DT[intCol == 2L] (WITH NOTHING)	5.3	5	5.2	5.2	no observable malus from index creation in optimized mode
DT[intCol == 2L] (WITH INDEX)	3.9	7.1	4.7	7.1	Optimization helps.
DT[intCol == 2L] (WITH KEY )	2.4	5.2	2.3	5.2	Optimization helps. Here, speedy is not slower than master
DT[doubleCol %in% 1:5] (WITH NOTHING)	36.7	20.9	30	20.9	Index creation leads to slower execution when optimized. Less pronounced in speedySubset because of smarter reordering after bmerge.
DT[doubleCol %in% 1:5] (WITH INDEX)	41.8	31.9	24	33.5	New optimization implementation is fast with existing index
DT[doubleCol %in% 1:5] (WITH KEY)	12.2	22	11.6	22.3	With key, optimization really speeds things up
DT[!intCol == 3L] (WITH NOTHING)	28.3	21.8	291	23	Significant slowdown. See #2567.
DT[!intCol == 3L] (WITH INDEX)	28.8	22.8	29	24	Still slowdown, see #2567
DT[!intCol == 3L] (WITH KEY)	27.5	24.2	28	25	Wow! Even with a key optimization is not beneficial. See #2567.

Things where optimization was tested but discarded
I made a lot of effort to implement optimization for non-equi operators.
But it turned out that non-equi joins are much slower than vector based subsets. Until this is fixed, They won't be "optimized"

query	master (s)	speedy (s)	comment
DT[intCol > 2L](WITH INDEX)	21.6	78.9	Unbearable slowdown
DT[intCol > 2L](WITH KEY)	20.8	26.9	Slowdown

Here is the code for my benchmarks:

library(data.table)
library(microbenchmark)

n <- 1e9
times <- 1L 
set.seed(12563)

## switch off optimization in master
# options(datatable.use.index = FALSE)
## switch off optimization in speedySubset
# options(datatable.optimize = 2L)

DT <- data.table(intCol = sample(1L:10L, n, replace = T),
                 doubleCol = sample(1:10, n, replace=T),
                 charCol   = sample(LETTERS, n, replace = T))

## first test: subset with indices
## one call to auto create relevant indices
DT[intCol == 2L, verbose = TRUE]
DT[doubleCol %in% 1:5, verbose = TRUE]
DT[charCol %chin% c("A", "B", "Y","Z"), verbose = TRUE]
DT[intCol == 2L & doubleCol %in% 1:5 & charCol %chin% c("A", "B", "Y","Z"), verbose = TRUE]
DT[!intCol == 3L, verbose = TRUE]
DT[intCol > 2L, verbose = TRUE]
DT[intCol > 2L & doubleCol <= 6 & charCol %chin% c("A", "B", "Y","Z"), verbose = TRUE]
print("starting index subsets")
test <- microbenchmark(equalIndex    = o <- DT[intCol == 2L],
                       inIndex       = o <- DT[doubleCol %in% 1:5],
                       notjoinIndex  = o <- DT[!intCol == 3L],
                       chinIndex     = o <- DT[charCol %chin% c("A", "B", "Y","Z")],
                       non_equiIndex      = o <- DT[intCol > 2L],
                       combinedIndex = o <- DT[intCol == 2L & doubleCol %in% 1:5 & charCol %chin% c("A", "B", "Y","Z")],
                       non_equi_combinedIndex = o <- DT[intCol > 2L & doubleCol <= 6 & charCol %chin% c("A", "B", "Y","Z")],
                       times = times, unit = "s")

out <- summary(test)

## add tests with an index that contains more columns
setindex(DT, NULL)
setindex(DT, intCol, doubleCol, charCol)
out <- rbind(out, summary(microbenchmark(overIndex = o <- DT[intCol == 2L], times = times, unit = "s")))
print("starting key subset")
## now add the tests with keys
setindex(DT, NULL)

print("starting key subset")
setkey(DT, intCol, doubleCol, charCol)
out <- rbind(out, summary(microbenchmark(equalKey = o <- DT[intCol == 2L], times = times, unit = "s")))

print("starting key subset")
setkey(DT, doubleCol, charCol)
out <- rbind(out, summary(microbenchmark(inKey = o <- DT[doubleCol %in% 1:5], times = times, unit = "s")))

print("starting key subset")
setkey(DT, charCol)
out <- rbind(out, summary(microbenchmark(chinKey = o <- DT[charCol %chin% c("A", "B", "Y","Z")], times = times, unit = "s")))

print("starting key subset")
setkey(DT, intCol, doubleCol, charCol)
out <- rbind(out, summary(microbenchmark(combinedKey = o <- DT[intCol == 2L & doubleCol %in% 1:5 & charCol %chin% c("A", "B", "Y","Z")], times = times, unit = "s")))

print("starting key subset")
setkey(DT, intCol)
out <- rbind(out, summary(microbenchmark(notjoinKey = o <- DT[!intCol == 3L], times = times, unit = "s")))

print("starting key subset")
setkey(DT, intCol)
out <- rbind(out, summary(microbenchmark(non_equiKey = o <- DT[intCol > 2L], times = times, unit = "s")))

print("starting key subset")
setkey(DT, intCol, doubleCol, charCol)
out <- rbind(out, summary(microbenchmark(non_equi_combinedKey = o <- DT[intCol > 2L & doubleCol <= 6 & charCol %chin% c("A", "B", "Y","Z")], times = times, unit = "s")))

## add benchmark without any existing index Can be done only one time
setindex(DT, NULL)
setkey(DT, NULL)

print("starting raw subset")
out <- rbind(out, summary(microbenchmark(equalRaw = o <- DT[intCol == 2L], times = 1L, unit = "s")))

print("starting raw subset")
setindex(DT, NULL)
out <- rbind(out, summary(microbenchmark(inRaw = o <- DT[doubleCol %in% 1:5], times = 1L, unit = "s")))

print("starting raw subset")
setindex(DT, NULL)
out <- rbind(out, summary(microbenchmark(chinRaw = o <- DT[charCol %chin% c("A", "B", "Y","Z")], times = 1L, unit = "s")))

print("starting raw subset")
setindex(DT, NULL)
out <- rbind(out, summary(microbenchmark(combinedRaw = o <- DT[intCol == 2L & doubleCol %in% 1:5 & charCol %chin% c("A", "B", "Y","Z")], times = 1L, unit = "s")))

print("starting raw subset")
setindex(DT, NULL)
out <- rbind(out, summary(microbenchmark(notjoinRaw = o <- DT[!intCol == 3L], times = 1L, unit = "s")))

print("starting raw subset")
setindex(DT, NULL)
out <- rbind(out, summary(microbenchmark(non_equiRaw = o <- DT[intCol > 2L], times = 1L, unit = "s")))

print("starting key subset")
setkey(DT, intCol, doubleCol, charCol)
out <- rbind(out, summary(microbenchmark(non_equi_combinedKey = o <- DT[intCol > 2L & doubleCol <= 6 & charCol %chin% c("A", "B", "Y","Z")], times = times, unit = "s")))

print("saving data")
saveRDS(out, file = "benchmark.rds")
print("done")

Cheers,
Markus

codecov-io · 2017-11-22T06:53:01Z

Codecov Report

Merging #2494 into master will increase coverage by 0.01%.
The diff coverage is 96.45%.

@@            Coverage Diff             @@
##           master    #2494      +/-   ##
==========================================
+ Coverage   91.42%   91.44%   +0.01%     
==========================================
  Files          63       63              
  Lines       12111    12203      +92     
==========================================
+ Hits        11073    11159      +86     
- Misses       1038     1044       +6

Impacted Files	Coverage Δ
R/data.table.R	`97.06% <96.45%> (-0.12%)`	⬇️
R/setkey.R	`93.79% <0%> (-0.37%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 54c6dc4...82d2b51. Read the comment docs.

MichaelChirico · 2017-11-22T07:39:29Z

@franknarf1 mentioned this touches on #1453; that remains a separate issue, correct?

MarkusBonsch · 2017-11-22T11:28:26Z

#1453 is definitely touched and the request from the issue title is implemented. However, within the issue, optimized joins for other operators like >, < are mentioned. That has not been addressed.
I added a review request to @arunsrinivasan because he seems to have implemented auto-indexing originally and should definitely have a look at my modifications.
(Edit by Matt: I think I implemented auto-indexing originally. It was done in a rush for the Datacamp course so we could use DT[col==val] in the course and not have to explain that setkey(DT,col); DT[.(val)] was absolutely necessary for speed.)

jangorecki · 2017-11-22T19:05:45Z

R/data.table.R

+  if (!is.null(attr(x, '.data.table.locked'))) return(NULL)  # fix for #958, don't create auto index on '.SD'.
+  if (!getOption("datatable.use.index")) {
+    return(NULL) # #1422 
+    ## Does this option also prevent the usage of keys? It is interpreted in this way here...


Correct, index should not be created and if it already exists it should not be used. Main purpose of that was to make it easy to benchmark index vs no-index.

Thanks for your good questions and comments. Here, I am wondering about keys, not indices.

usage of keys should not be affected by this option

jangorecki · 2017-11-22T19:11:46Z

R/data.table.R

+  }
+  ## Determine, whether the nature of isub in general supports fast binary search
+  remainingIsub <- isub
+  i <- list()


I would use more meaningful name for this variable i might be easily confused with [.data.table i arg.

actually, this is the list that will become the i arg for bmerge after it is CJed. Therefore, I think the naming makes sense.

jangorecki · 2017-11-22T19:15:18Z

R/data.table.R

+  } 
+  if (is.null(idx)){
+    ## if nothing else helped, auto create a new index that can be used
+    if (!getOption("datatable.auto.index")) return(NULL) 


just for information... on the other hand this option allows to use indexes only when they were created manually by calling setindex (or created before setting up this option).

jangorecki · 2017-11-22T19:17:18Z

inst/tests/tests.Rraw

@@ -5547,12 +5547,12 @@ test(1375.3, DT[,mean(Petal.Width),by=Species][V1>1,Species:=toupper(Species)]$S
 # Secondary keys a.k.a indexes ...
 DT = data.table(a=1:10,b=10:1)
 test(1376.1, indices(DT), NULL)
-test(1376.2, DT[b==7L,verbose=TRUE], DT[4L], output="Creating new index 'b'")
+test(1376.2, DT[b==7L,verbose=TRUE], DT[4L], output="Creating new index '__b'")


I am not sure about the naming you used. I would prefer the old output. __b is just internal attribute name for b index`.

Fine for me. I will adjust so that the messsage completely reflects the return value for the indices function:
b, but b__c for multi-column indices, OK?

jangorecki

Very nice PR, especially isolating the logic to new function. I put few comments below.

jangorecki · 2017-11-22T19:20:01Z

inst/tests/tests.Rraw

-test(1427, indices(DT), c("bar","baz__bar","baz"))
-test(1428, DT[bar==9L, verbose=TRUE], output="Using existing index 'bar'")
-test(1429, indices(setnames(DT,"bar","a")), c("baz", "a", "baz__a"))
+test(1426, DT[baz==4L, verbose=TRUE], output="Optimized subsetting with index '__baz__bar'")


why this test doesn't create new index anymore but use existing one? it is potentially breaking change

As I argue in the first comment of this PR, I think there is no chance that this is a breaking change:

Internally, indices are now used, even if not all index columns appear in the query.
Previously, this was prevented out of fear for spurious reordering. This, however, can't happen since the order is restored brute force after bmerge (unit tested in the PR).

The unit test that I inserted to guarantee no reordering is this one:

## test no reordering in groups DT <- data.table(x = c(1,1,1,2), y = c(3,2,1,1)) setindex(DT, x, y) test(1437.3, DT[x==1], setindex(data.table(x = c(1,1,1), y = c(3,2,1)), x,y))

What do you think?

Adding new tests is fine but when changing old one you should have a reason. I don't understand why 'creating index' has change to 'using index' in 1426. Maybe it is copy-paste mistakes which pass the tests because you create and use index in that test. @MarkusBonsch

Sorry, I wasn't clear enough.
In the master, the index __baz is created in 1426. In the speedySubset, however, no index is created, as the existing index __baz__bar is used:

## on branch speedySubset, repeating test 1426 DT = data.table(a=1:5,b=10:6) setindex(DT,b) setnames(DT,"b","foo") setindex(DT,a,foo) setnames(DT,"foo","bar") setnames(DT,"a","baz") DT[baz==4L, verbose=TRUE] # Optimized subsetting with index 'baz__bar' # Starting bmerge ...done in 0.001 secs

As argued above, I believe that this doesn't cause any problems since the original order of the data is restored brute force by the line:
if (length(o$xo)) i = fsort(o$xo[i], internal=TRUE) else i = fsort(i, internal=TRUE) # fix for #1495
If you have the feeling that this change is too dangerous because of unknown side-effects that are not captured by any of the unit-tests, I will revert the change. It was just an optimization during cleanup and not the main purpose of the PR.

@MarkusBonsch It is very good that we don't need to create new index and can reuse part of the existing index. I was not aware of this improvement in PR. I did not look at the ordering issue and cannot say anything about it now.
What if user will subset many times by baz field? Then it might be better to create this single column index and not to maintain original order by brute force?
To handle that well every multi-column index should include order of its parent subsets of columns within the index. This wouldn't cost much to create (as it has to be computed for the desired index anyway) but it will take much more memory. FYI @arunsrinivasan

@jangorecki You are right that brute force reordering is costly, as has been shown here: #2366 and also in the benchmarks of this PR in #2472
However, it is an existing part of the master and not introduced in this PR.
No matter, which index we use, baz or baz__bar, this brute force reordering is necessary to ensure the original order of the DT. Which index is faster in terms of reordering depends on the original order of the DT. In my opinion, there is no clear indication that a single-column index will be substantially more ordered than a multi-column index in general. If a user has to do many subsets on baz, the fastest option is a key since this will avoid reordering costs almost completely.

MarkusBonsch · 2017-11-23T08:19:44Z

I just discovered that bmerge supports non-equi joins. This makes it so easy to introduce support for fast >=, >, <=, < subsets. I am currently working on that and will update the PR ASAP.

MarkusBonsch · 2017-11-26T08:06:39Z

Unfortunately, bmerge for non-equi joins requires two additional arguments: nqgrp and nqmaxgrp that are complicated to calculate. Therefore, support for non-equi-operators is beyond the scope of this PR.

MarkusBonsch · 2017-11-26T08:38:19Z

I have adapted the code to @jangorecki 's valuable review. IT is final from my perspective.

mattdowle

Awesome! I added a non-blocking comment in the Issue where the benchmarks are, related to the benchmarks.

mattdowle · 2017-11-28T01:21:51Z

Since this is a reasonably big complicated change, should we postpone it to 1.10.8 or are we confident the test coverage is good?

jangorecki · 2017-11-28T04:55:20Z

Agree with Matt. Maybe we can branch new code with if (getOption("datatable.multindex", FALSE)) _new_ else _old_ to make it optional for 1.10.6, and default starting from 1.10.7. This will help people to test upcoming feature if they want to.

jangorecki · 2017-11-28T18:11:46Z

Internally, fast subsetting with keys is now also used if the data.table.use.index is FALSE. Previously, this option didn't only prevent the usage of indices, but also the usage of keys.

@MarkusBonsch By usage of keys you mean setkey(dt, col1); dt[.("a")] and dt["a", on="col1"]? Or something else? Because already now datatable.use.index=FALSE should not prevent to use the key.

MarkusBonsch · 2017-12-10T21:31:45Z

@jangorecki I am referring to setkey(dt, col1); dt[col1 == "a"]. In the master, this is not using the fast subset based on the key if data.table.use.index = FALSE

MarkusBonsch · 2017-12-10T21:33:47Z

Concerning test coverage: I am currently implementing extensive tests with all query combinations of up to three columns with '&'. I can implement the tests when this #2511 bug is fixed.
Concerning benchmarks: I will scale them up as soon as possible.

HughParsonage · 2018-01-13T06:04:07Z

I observe a slight downside in performance in some cases, namely when an index or key is not present and the optimized subsetting is attempted. Apologies if I'm just not implementing it correctly.

(The use of verbose = TRUE does not have a significant impact on performance as far as I can see.)

library(magrittr) # for and() alias

Distribution_Course <-
  data.table(Course = c("AdvDip", "AssocDegree", "Bachelors",
                        "BachelorsHons", "Dip", "Doctorate", "Enabling",
                        "GradCert","GradDip", "Masters", "Non-award",
                        "OUAPostgrad", "OUAUndergrad",
                        "Other", "Xinst. prog postgrad",
                        "Xinst. prog undergrad"),
             N = c(255000,
                   520000, 55277000, 1320000, 1335000, 1215000,
                   771000, 788000, 2039000, 9071000, 737000,
                   72000, 1051000, 94000, 25000, 136000))

Distribution_Citizen <-
  data.table(Citizen = c("Aust", "Humanitarian visa", "NZ",
                         "Permanent visa", "Resid. overseas",
                         "Temp visa / diplo."),
              N = c(52851000, 150000, 482000,
                    1627000, 4906000, 14691000))

load_dummy <-
  data.table(Course = sample(Distribution_Course$Course,
                             size = 100e6,
                             replace = TRUE,
                             prob = Distribution_Course$N),
             Citizen = sample(Distribution_Citizen$Citizen,
                             size = 100e6,
                             replace = TRUE,
                             prob = Distribution_Citizen$N))

# Use magrittr::and() to avoid subsetting optimization
system.time(load_dummy[and(Course %chin% c("Bachelors", "BachelorsHons"), Citizen %chin% c("Aust", "Permanent visa", "NZ", "Humanitarian visa")), verbose = TRUE])
# user  system elapsed
# 3.54    0.18    3.73

system.time(load_dummy[Course %chin% c("Bachelors", "BachelorsHons") & Citizen %chin% c("Aust", "Permanent visa", "NZ", "Humanitarian visa"), verbose = TRUE])
# Creating new index 'Citizen__Course'
# Optimized subsetting with index 'Citizen__Course'
# on= matches existing index, using index
# user  system elapsed
# 5.32    1.04    6.37

setindex(load_dummy, NULL)

system.time(setindex(load_dummy, Course, Citizen))
# user  system elapsed
# 0.69    0.21    0.90
system.time(load_dummy[Course %chin% c("Bachelors", "BachelorsHons") & Citizen %chin% c("Aust", "Permanent visa", "NZ", "Humanitarian visa"), verbose = TRUE])
# Optimized subsetting with index 'Course__Citizen'
# on= matches existing index, using index
# user  system elapsed
# 3.39    0.30    3.70

setkey(load_dummy, Course, Citizen)

system.time(load_dummy[Course %chin% c("Bachelors", "BachelorsHons") & Citizen %chin% c("Aust", "Permanent visa", "NZ", "Humanitarian visa"), verbose = TRUE])
# Optimized subsetting with key 'Course, Citizen'
# on= matches existing key, using key
# Starting bmerge ...done in 0 secs
# user  system elapsed
# 1.05    0.09    1.17

MarkusBonsch · 2018-01-13T13:38:59Z

@HughParsonage thank very much for the test. You are completely right: if no index is present, it can definitely be slower. This is probably also true for the existing "optimizations" of %in% and == queries in the master branch. I will include this case in my large scale benchmark in the top post ASAP. Then we can make an informed decision about the best optimization strategy. It may be worthwhile to optimize on existing keys and indices but to drop auto-idexing. Let's see, what the benchmark says.

MarkusBonsch · 2018-01-13T21:31:40Z

I updated the benchmark in the top post. Very interesting findings, also about existing optimization in the master branch. On first sight, these are my thoughts:

Optimization of notjoin queries doesn't make sense until a fast != operator is implemented in bmerge. See "notjoin" joins are slow #2567.
Optimization if a key exists is definitely a good idea.
Optimization if an index exists often makes little difference. In some cases, it can be very good, but can also even slow down, especially for %in% query. Need to investigate, if this is only due to the large result set that needs to be reordered.
Optimization where an index is calculated upfront (auto indexing) is slowing things down (except for the compund query.
Optimization of non-equi queries doesn't make sense until the implementation has been brought to speed (if at all possible)

Maybe, the whole idea of auto indexing needs to be reconsidered, at least for %in% queries? Further analysis needed...

jangorecki · 2018-01-14T01:42:33Z

IMO good to have this in Dev for a while to detect potential issues. To avoid cases like when implementing i optimization before, shortly before release.

MarkusBonsch · 2018-01-14T07:57:03Z

@jangorecki I tend to agree, especially since it is not clear yet, which optimization strategy is really optimal

@mattdowle Confirmed the slowdon of test.data.table() and that it is not due to newly introduced tests. I will investigate.

HughParsonage · 2018-01-14T15:53:12Z

I've encountered a possible bug with this branch. To verify:

library(TeXCheckR)
validate_bibliography(file = "https://raw.githubusercontent.com/HughParsonage/grattex/master/bib/Grattan-Master-Bibliography.bib")
#> Error in stub[[1L]]: object of type 'symbol' is not subsettable

or with more focus (not defending my implementation, but I think it should still work!)

library(data.table)
library(magrittr)
just_key_journal_urls <- fread("https://raw.githubusercontent.com/HughParsonage/grattex/master/bib/Grattan-Master-Bibliography.bib", 
  sep = NULL)[[1]]

newspapers_pattern <-
  paste0("^(url).*",
         "(",
         "(((theguardian)|(afr))\\.com)",
         "|",
         "(((theaustralian)|(theage)|(smh)|(canberratimes)|(greatlakesadvocate))\\.com\\.au)",
         "|",
         "(theconversation\\.((edu\\.au)|(com)))",
         "|",
         "(insidestory\\.org\\.au)",
         ")")

journal_actual_vs_journal_expected <-
  data.table(text = just_key_journal_urls) %>%
  .[, entry_no := cumsum(grepl("^@", text))] %>%
  .[, is_article := any(grepl("^@Article", text)), by = entry_no] %>%
  .[, both_url_and_journal := .N == 4L, by = entry_no] %>%
  .[, is_newspaper := any(grepl(newspapers_pattern, text, perl = TRUE)), by = entry_no] %>%
  .[is_article & both_url_and_journal &is_newspaper] 
#> Error in stub[[1L]]: object of type 'symbol' is not subsettable

HughParsonage · 2018-01-14T16:05:13Z

FWIW, I think the speed improvements for keyed subsets is enough to merit inclusion, especially if this particular optimization can be avoided in cases where it is known to not improve performance. (Could we not just check whether a key is present, and only attempt it if the key matches the subset?)

I think Markus's point about the principle behind automatic optimization for subsetting in general is a good one, though tricky: For me at least, there are cases where I need to query the same column in a small data table multiply, and others where I only need to subset a big data table once. And in both cases there are instances where it's worth reviewing the documentation thoroughly to eke out the best run time for that particular scenario and instances where doing that would be no faster than even the worst case under the default setting.

MarkusBonsch · 2018-01-14T20:09:16Z

@HughParsonage

Thank you for reporting the bug. It should be fixed now.
Concerning optimization strategy, your examples hit the nail on the head: it is very tricky and depends a lot on the context and the task at hand. The only thing that is definitely clear is that optimization should take place when a key is present (as you said as well).

mattdowle · 2018-01-16T01:10:31Z

Great. Good benchmarking. Ok so iiuc there's agreement that the optimization should be applied if there's already an appropriate key or index existing. It's such a big improvement and would be what users expect, that using existing key or index should be on by default. Then turn off auto index creation when more than 1 column is involved, for now. But keep auto index creation when 1 column is involved since that's beneficial iirc. Would that result in no downside by default, so it could be merged?

[ Aside 1 : In future, building indexes could be done in a background thread, perhaps. When the index creation finished it could be stored on the table. Clearly much more complicated with higher risks. But useful if it worked! ]

[ Aside 2 : The tradeoffs are going to change when index creation is parallelized so probably best to delay tradeoff optimization for now. ]

MarkusBonsch · 2018-01-18T02:08:13Z

@mattdowle I have found one or two other tweaks for speed improvements of the optimized mode.
I will implement, update the benchmark and then comment on the best optimization strategy

@mattdowle Concerning slowdown of test.data.table(). I investigated and found that it is really overhead. You find the details in an update to your original comment above.

… details.

MarkusBonsch · 2018-01-20T22:27:21Z

I discovered a way to speed up the reordering after bmerge (see #2366). This is implemented for subsets larger than 1e6 rows now. The updated benchmark in the first post shows that this significantly increases the speed for DT[doubleCol %in% 1:5] (WITH INDEX) from ~40 to ~20 seconds and others as well.

For me the optimal optimization strategy based on the benchmark results would be:

Never optimize for small data.tables (<1e3 rows) because of too much overhead
Never optimize notjoin queries until we have brought them to speed ("notjoin" joins are slow #2567)
Always (with above exceptions) optimize if a key is present.
Always (with above exceptions) optimize if a proper index is present
Auto-indexing is mostly beneficial if multiple subsets on the same columns are executed. Only exception is the combined query DT[intCol == 2L & doubleCol %in% 1:5 & charCol %chin% c("A", "B", "Y","Z")] (WITH NOTHING) that also profits on first execution. Since the speed decrease in the other cases is not too terrible and we may have faster parallelized index calculation in the future, I would keep doing it for all queries.

HughParsonage · 2018-01-27T01:52:31Z

Would it be 'good' (feasible, performant, etc) to provide an argument to [ that specifies that an auto index should not be created? It appears that setting options(datatable.auto.index = FALSE) would do the same thing, but something like DT[id == 100L, auto.index = FALSE] make things a bit more transparent and also makes chaining possible. I don't know R intimately enough to judge the relative performance of an additional arguments versus getOption options.

MarkusBonsch · 2018-01-27T16:31:34Z

@HughParsonage I like the idea. I would prefer on optimize option, however: DT[id == 100L, optimize = 2L].

mattdowle

Truly awesome and compelling benchmarks!
Great comments and tests too.

MarkusBonsch · 2018-01-31T21:26:21Z

Cool, thank you! I am so happy :)

DavidArenburg · 2018-06-12T09:51:17Z

I think this is an edge case that fails because of .prepareFastSubset. This is at least what my investigation brought up (though I maybe mistaken).

MichaelChirico · 2018-06-12T10:11:40Z

the base problem is with how non-equi joins are parsed... there's an open issue about that but I couldn't find it w 5 minutes searching

…

On Tue, Jun 12, 2018, 5:51 PM David Arenburg ***@***.***> wrote: I think this is an edge case <https://stackoverflow.com/q/50810759/3001626> that fails because of .prepareFastSubset. This is at least what my investigation brought up (though I maybe mistaken). — You are receiving this because your review was requested. Reply to this email directly, view it on GitHub <#2494 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHQQdSPwCmzdz2vxNlPTjDO9WRJpFAx6ks5t748mgaJpZM4Qm765> .

mattdowle · 2018-08-30T21:09:18Z

David found the issue thanks: #2931

MarkusBonsch added 4 commits November 19, 2017 07:10

First implementation. Some tests failing.

d9cc1d3

Added tests and documentation.

d87a6b9

Merge branch 'master' into speedySubset

b88ecd6

Updated NEWS.

45d5149

MarkusBonsch requested review from mattdowle, jangorecki and MichaelChirico November 22, 2017 06:53

MichaelChirico approved these changes Nov 22, 2017

View reviewed changes

MarkusBonsch requested a review from arunsrinivasan November 22, 2017 11:28

MarkusBonsch mentioned this pull request Nov 22, 2017

Extend auto indexing to multiple columns #1453

Closed

jangorecki reviewed Nov 22, 2017

View reviewed changes

Adapted to @jangorecki's comments.

631ac1c

mattdowle previously approved these changes Nov 28, 2017

View reviewed changes

MarkusBonsch added 2 commits November 29, 2017 05:16

Merge branch 'master' into speedySubset

789f168

Added extensive tests for optimized queries

8abbe69

Better verbose messages for timing.

a25ea1d

Fixed bug with DT[x & y] queries that was found by @HughParsonage.

e2c60d9

Better verbose message for assessing timing of notjoin queries.

2bc795d

MarkusBonsch added 3 commits January 18, 2018 08:31

Speed improvement in reordering after optimized subset. See #2366 for…

5e40e6d

… details.

Further speedup for optimized subsets.

be75c7e

Added better tests for optimized subsets with 'by'.

a4f2304

Merge branch 'master' into speedySubset

82d2b51

mattdowle approved these changes Jan 31, 2018

View reviewed changes

mattdowle merged commit 05c5122 into master Jan 31, 2018

mattdowle deleted the speedySubset branch January 31, 2018 02:18

MarkusBonsch mentioned this pull request Feb 20, 2018

Fixed optimization where CJ would fail because of too many rows. #2637

Merged

MarkusBonsch mentioned this pull request Apr 14, 2018

data.table should be smarter about compound logical subsetting #2472

Closed

mattdowle mentioned this pull request Aug 30, 2018

Optimization wrong result with column names that are expressions #3020

Closed

Better subsetting optimization for compound queries #2494

Better subsetting optimization for compound queries #2494

Conversation

MarkusBonsch commented Nov 22, 2017 • edited Loading

codecov-io commented Nov 22, 2017 • edited Loading

Codecov Report

MichaelChirico commented Nov 22, 2017

MarkusBonsch commented Nov 22, 2017 • edited by mattdowle Loading

jangorecki Nov 22, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jangorecki Nov 23, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jangorecki left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jangorecki Nov 26, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jangorecki Nov 28, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarkusBonsch commented Nov 23, 2017

MarkusBonsch commented Nov 26, 2017

MarkusBonsch commented Nov 26, 2017

mattdowle left a comment

Choose a reason for hiding this comment

mattdowle commented Nov 28, 2017

jangorecki commented Nov 28, 2017 • edited Loading

jangorecki commented Nov 28, 2017 • edited Loading

MarkusBonsch commented Dec 10, 2017

MarkusBonsch commented Dec 10, 2017

HughParsonage commented Jan 13, 2018 • edited Loading

MarkusBonsch commented Jan 13, 2018

MarkusBonsch commented Jan 13, 2018 • edited Loading

jangorecki commented Jan 14, 2018

MarkusBonsch commented Jan 14, 2018 • edited Loading

HughParsonage commented Jan 14, 2018

HughParsonage commented Jan 14, 2018

MarkusBonsch commented Jan 14, 2018 • edited Loading

mattdowle commented Jan 16, 2018 • edited Loading

MarkusBonsch commented Jan 18, 2018

MarkusBonsch commented Jan 20, 2018

HughParsonage commented Jan 27, 2018 • edited Loading

MarkusBonsch commented Jan 27, 2018

mattdowle left a comment • edited Loading

Choose a reason for hiding this comment

MarkusBonsch commented Jan 31, 2018

DavidArenburg commented Jun 12, 2018

MichaelChirico commented Jun 12, 2018 via email

mattdowle commented Aug 30, 2018

MarkusBonsch commented Nov 22, 2017 •

edited

Loading

codecov-io commented Nov 22, 2017 •

edited

Loading

MarkusBonsch commented Nov 22, 2017 •

edited by mattdowle

Loading

jangorecki Nov 22, 2017 •

edited

Loading

jangorecki Nov 23, 2017 •

edited

Loading

jangorecki Nov 26, 2017 •

edited

Loading

jangorecki Nov 28, 2017 •

edited

Loading

jangorecki commented Nov 28, 2017 •

edited

Loading

jangorecki commented Nov 28, 2017 •

edited

Loading

HughParsonage commented Jan 13, 2018 •

edited

Loading

MarkusBonsch commented Jan 13, 2018 •

edited

Loading

MarkusBonsch commented Jan 14, 2018 •

edited

Loading

MarkusBonsch commented Jan 14, 2018 •

edited

Loading

mattdowle commented Jan 16, 2018 •

edited

Loading

HughParsonage commented Jan 27, 2018 •

edited

Loading

mattdowle left a comment •

edited

Loading