Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better subsetting optimization for compound queries #2494

Merged
merged 31 commits into from
Jan 31, 2018
Merged

Conversation

MarkusBonsch
Copy link
Contributor

@MarkusBonsch MarkusBonsch commented Nov 22, 2017

Closes issue #2472.
This implementation makes fast subsetting with bmerge and keys/indices applicable to a wider range of
subsetting queries in i.
Detailed benchmarks can be found in my post in issue #2472

Logic changes

  • No results of operations on data.tables are changed
  • The use of fast subsetting is now enabled via options(datatable.optimize) = 3.
  • Internally, the use of bmerge for fast subsetting has been extended to the following query types:
    • queries with %chin% like DT[char %chin% c("A", "B")]
    • compound queries if the connector is & and each subquery satisfies the criteria,
      i.e operator is ==, %in%, or %chin%, lefthand side is a column of the data.table,
      and righthand side fulfills several complicated criteria (that I just copy-pasted).
  • Internally, fast subsetting with keys is now also used if the data.table.use.index is FALSE. Previously, this option didn't only prevent the usage of indices, but also the usage of keys. According to @jangorecki, the new behaviour is the desired one.
  • reordering of irows after bmerge has been brought to speed for large subsets of > 1e6 rows, see var %in% vec bmerge is slow #2366
  • Fast subsetting with bmerge for non-equi operators (<, >) has been tested but dismissed because it actually slowed things down considerably.

Implications for unit tests

  • Several unit tests had to be adapted because of changed verbose messages.
  • 27 dedicated unit tests (1437.1 - 1437.27 have been added to test aspects of the new implementation that appeared critical to me.
  • A large set of unit tests (> 100) has been added that test for equal behaviour with optimization 2 and 3. This involves queries with different combinations of subsets in i, combined with different expressions in j and extended to which = TRUE and to grouped queries using 'by' (1437.28ff)

Structure changes
I factored out the whole logic of determining, whether a fast subset can be executed into the new function prepareFastSubset.
The advantage is, that for future changes of these conditions, it is clear, where to adapt the code.
Previously, optimized subsets had their own call to bmerge. Now, they are redirected to the normal join implementation. This causes a small slowdown (to be investigated) but offers the advantage of easier code maintenance and the possibility of including non-equi operators, once the non-equi joins have been brought to speed.

Benchmark
Code at the end of this post.
data.table with 1e9 rows and 3 columns (~50GB):

DT <- data.table(intCol = sample(1L:10L, n, replace = T),
                 doubleCol = sample(1:10, n, replace=T),
                 charCol   = sample(LETTERS, n, replace = T))

Tested each subsetting query in three versions:

  • WITH NOTHING: no index or key exists. An index needs to be created first if optimization is turned on.
  • WITH INDEX: an apropriate index for this query exists
  • WITH KEY: an apropriate key for this query exists.

Tested for 4 different package settings:

  • master_opt: master branch with optimization switched on
  • master_raw: master branch with optimization switched off
  • speedy_opt: speedySubset branch with optimization switched on
  • speedy_raw: speedySubset branch with optimization switched off

Things that are optimized now and weren't before

query master_opt (s) master_raw (s) speedy_opt (s) speedy_raw (s) comment
DT[charCol %chin% c("A", "B", "Y","Z")] (WITH NOTHING) 12.1 12.1 18.1 12.4 If index needs to be created first, we see a significant slowdown in optimized mode.
DT[charCol %chin% c("A", "B", "Y","Z")] (WITH INDEX) 14.5 14.4 8.2 14.9 Some benefit despite reorder after bmerge.
DT[charCol %chin% c("A", "B", "Y","Z")] (WITH KEY) 12.6 12.6 3.5 12.8 significant speed-up due to optimization if proper key exists
DT[intCol == 2L & doubleCol %in% 1:5 & charCol %chin% c("A", "B", "Y","Z")] (WITH NOTHING) 27.8 28 23.3 28 Slight speed improvement despite index creation
DT[intCol == 2L & doubleCol %in% 1:5 & charCol %chin% c("A", "B", "Y","Z")] (WITH INDEX) 43.5 44.6 0.6 45 Tremendous speed improvement if a proper index exists. Reorder after bmerge doesn't spoil because result set is small
DT[intCol == 2L & doubleCol %in% 1:5 & charCol %chin% c("A", "B", "Y","Z")] (WITH KEY) 27.9 28.1 0.18 27.8 Tremendous speed-up if proper key already exists

Things that were optimized before and are still optimized

query master_opt (s) master_raw (s) speedy_opt (s) speedy_raw (s) comment
DT[intCol == 2L] (WITH NOTHING) 5.3 5 5.2 5.2 no observable malus from index creation in optimized mode
DT[intCol == 2L] (WITH INDEX) 3.9 7.1 4.7 7.1 Optimization helps.
DT[intCol == 2L] (WITH KEY ) 2.4 5.2 2.3 5.2 Optimization helps. Here, speedy is not slower than master
DT[doubleCol %in% 1:5] (WITH NOTHING) 36.7 20.9 30 20.9 Index creation leads to slower execution when optimized. Less pronounced in speedySubset because of smarter reordering after bmerge.
DT[doubleCol %in% 1:5] (WITH INDEX) 41.8 31.9 24 33.5 New optimization implementation is fast with existing index
DT[doubleCol %in% 1:5] (WITH KEY) 12.2 22 11.6 22.3 With key, optimization really speeds things up
DT[!intCol == 3L] (WITH NOTHING) 28.3 21.8 291 23 Significant slowdown. See #2567.
DT[!intCol == 3L] (WITH INDEX) 28.8 22.8 29 24 Still slowdown, see #2567
DT[!intCol == 3L] (WITH KEY) 27.5 24.2 28 25 Wow! Even with a key optimization is not beneficial. See #2567.

Things where optimization was tested but discarded
I made a lot of effort to implement optimization for non-equi operators.
But it turned out that non-equi joins are much slower than vector based subsets. Until this is fixed, They won't be "optimized"

query master (s) speedy (s) comment
DT[intCol > 2L](WITH INDEX) 21.6 78.9 Unbearable slowdown
DT[intCol > 2L](WITH KEY) 20.8 26.9 Slowdown

Here is the code for my benchmarks:

library(data.table)
library(microbenchmark)

n <- 1e9
times <- 1L 
set.seed(12563)

## switch off optimization in master
# options(datatable.use.index = FALSE)
## switch off optimization in speedySubset
# options(datatable.optimize = 2L)

DT <- data.table(intCol = sample(1L:10L, n, replace = T),
                 doubleCol = sample(1:10, n, replace=T),
                 charCol   = sample(LETTERS, n, replace = T))

## first test: subset with indices
## one call to auto create relevant indices
DT[intCol == 2L, verbose = TRUE]
DT[doubleCol %in% 1:5, verbose = TRUE]
DT[charCol %chin% c("A", "B", "Y","Z"), verbose = TRUE]
DT[intCol == 2L & doubleCol %in% 1:5 & charCol %chin% c("A", "B", "Y","Z"), verbose = TRUE]
DT[!intCol == 3L, verbose = TRUE]
DT[intCol > 2L, verbose = TRUE]
DT[intCol > 2L & doubleCol <= 6 & charCol %chin% c("A", "B", "Y","Z"), verbose = TRUE]
print("starting index subsets")
test <- microbenchmark(equalIndex    = o <- DT[intCol == 2L],
                       inIndex       = o <- DT[doubleCol %in% 1:5],
                       notjoinIndex  = o <- DT[!intCol == 3L],
                       chinIndex     = o <- DT[charCol %chin% c("A", "B", "Y","Z")],
                       non_equiIndex      = o <- DT[intCol > 2L],
                       combinedIndex = o <- DT[intCol == 2L & doubleCol %in% 1:5 & charCol %chin% c("A", "B", "Y","Z")],
                       non_equi_combinedIndex = o <- DT[intCol > 2L & doubleCol <= 6 & charCol %chin% c("A", "B", "Y","Z")],
                       times = times, unit = "s")

out <- summary(test)

## add tests with an index that contains more columns
setindex(DT, NULL)
setindex(DT, intCol, doubleCol, charCol)
out <- rbind(out, summary(microbenchmark(overIndex = o <- DT[intCol == 2L], times = times, unit = "s")))
print("starting key subset")
## now add the tests with keys
setindex(DT, NULL)

print("starting key subset")
setkey(DT, intCol, doubleCol, charCol)
out <- rbind(out, summary(microbenchmark(equalKey = o <- DT[intCol == 2L], times = times, unit = "s")))

print("starting key subset")
setkey(DT, doubleCol, charCol)
out <- rbind(out, summary(microbenchmark(inKey = o <- DT[doubleCol %in% 1:5], times = times, unit = "s")))

print("starting key subset")
setkey(DT, charCol)
out <- rbind(out, summary(microbenchmark(chinKey = o <- DT[charCol %chin% c("A", "B", "Y","Z")], times = times, unit = "s")))

print("starting key subset")
setkey(DT, intCol, doubleCol, charCol)
out <- rbind(out, summary(microbenchmark(combinedKey = o <- DT[intCol == 2L & doubleCol %in% 1:5 & charCol %chin% c("A", "B", "Y","Z")], times = times, unit = "s")))

print("starting key subset")
setkey(DT, intCol)
out <- rbind(out, summary(microbenchmark(notjoinKey = o <- DT[!intCol == 3L], times = times, unit = "s")))

print("starting key subset")
setkey(DT, intCol)
out <- rbind(out, summary(microbenchmark(non_equiKey = o <- DT[intCol > 2L], times = times, unit = "s")))

print("starting key subset")
setkey(DT, intCol, doubleCol, charCol)
out <- rbind(out, summary(microbenchmark(non_equi_combinedKey = o <- DT[intCol > 2L & doubleCol <= 6 & charCol %chin% c("A", "B", "Y","Z")], times = times, unit = "s")))

## add benchmark without any existing index Can be done only one time
setindex(DT, NULL)
setkey(DT, NULL)

print("starting raw subset")
out <- rbind(out, summary(microbenchmark(equalRaw = o <- DT[intCol == 2L], times = 1L, unit = "s")))

print("starting raw subset")
setindex(DT, NULL)
out <- rbind(out, summary(microbenchmark(inRaw = o <- DT[doubleCol %in% 1:5], times = 1L, unit = "s")))

print("starting raw subset")
setindex(DT, NULL)
out <- rbind(out, summary(microbenchmark(chinRaw = o <- DT[charCol %chin% c("A", "B", "Y","Z")], times = 1L, unit = "s")))

print("starting raw subset")
setindex(DT, NULL)
out <- rbind(out, summary(microbenchmark(combinedRaw = o <- DT[intCol == 2L & doubleCol %in% 1:5 & charCol %chin% c("A", "B", "Y","Z")], times = 1L, unit = "s")))

print("starting raw subset")
setindex(DT, NULL)
out <- rbind(out, summary(microbenchmark(notjoinRaw = o <- DT[!intCol == 3L], times = 1L, unit = "s")))

print("starting raw subset")
setindex(DT, NULL)
out <- rbind(out, summary(microbenchmark(non_equiRaw = o <- DT[intCol > 2L], times = 1L, unit = "s")))

print("starting key subset")
setkey(DT, intCol, doubleCol, charCol)
out <- rbind(out, summary(microbenchmark(non_equi_combinedKey = o <- DT[intCol > 2L & doubleCol <= 6 & charCol %chin% c("A", "B", "Y","Z")], times = times, unit = "s")))

print("saving data")
saveRDS(out, file = "benchmark.rds")
print("done")

Cheers,
Markus

@codecov-io
Copy link

codecov-io commented Nov 22, 2017

Codecov Report

Merging #2494 into master will increase coverage by 0.01%.
The diff coverage is 96.45%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2494      +/-   ##
==========================================
+ Coverage   91.42%   91.44%   +0.01%     
==========================================
  Files          63       63              
  Lines       12111    12203      +92     
==========================================
+ Hits        11073    11159      +86     
- Misses       1038     1044       +6
Impacted Files Coverage Δ
R/data.table.R 97.06% <96.45%> (-0.12%) ⬇️
R/setkey.R 93.79% <0%> (-0.37%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 54c6dc4...82d2b51. Read the comment docs.

@MichaelChirico
Copy link
Member

@franknarf1 mentioned this touches on #1453; that remains a separate issue, correct?

@MarkusBonsch
Copy link
Contributor Author

MarkusBonsch commented Nov 22, 2017

#1453 is definitely touched and the request from the issue title is implemented. However, within the issue, optimized joins for other operators like >, < are mentioned. That has not been addressed.
I added a review request to @arunsrinivasan because he seems to have implemented auto-indexing originally and should definitely have a look at my modifications.
(Edit by Matt: I think I implemented auto-indexing originally. It was done in a rush for the Datacamp course so we could use DT[col==val] in the course and not have to explain that setkey(DT,col); DT[.(val)] was absolutely necessary for speed.)

R/data.table.R Outdated
if (!is.null(attr(x, '.data.table.locked'))) return(NULL) # fix for #958, don't create auto index on '.SD'.
if (!getOption("datatable.use.index")) {
return(NULL) # #1422
## Does this option also prevent the usage of keys? It is interpreted in this way here...
Copy link
Member

@jangorecki jangorecki Nov 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, index should not be created and if it already exists it should not be used. Main purpose of that was to make it easy to benchmark index vs no-index.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your good questions and comments. Here, I am wondering about keys, not indices.

Copy link
Member

@jangorecki jangorecki Nov 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usage of keys should not be affected by this option

R/data.table.R Outdated
}
## Determine, whether the nature of isub in general supports fast binary search
remainingIsub <- isub
i <- list()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use more meaningful name for this variable i might be easily confused with [.data.table i arg.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, this is the list that will become the i arg for bmerge after it is CJed. Therefore, I think the naming makes sense.

}
if (is.null(idx)){
## if nothing else helped, auto create a new index that can be used
if (!getOption("datatable.auto.index")) return(NULL)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just for information... on the other hand this option allows to use indexes only when they were created manually by calling setindex (or created before setting up this option).

@@ -5547,12 +5547,12 @@ test(1375.3, DT[,mean(Petal.Width),by=Species][V1>1,Species:=toupper(Species)]$S
# Secondary keys a.k.a indexes ...
DT = data.table(a=1:10,b=10:1)
test(1376.1, indices(DT), NULL)
test(1376.2, DT[b==7L,verbose=TRUE], DT[4L], output="Creating new index 'b'")
test(1376.2, DT[b==7L,verbose=TRUE], DT[4L], output="Creating new index '__b'")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about the naming you used. I would prefer the old output. __b is just internal attribute name for b index`.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine for me. I will adjust so that the messsage completely reflects the return value for the indices function:
b, but b__c for multi-column indices, OK?

Copy link
Member

@jangorecki jangorecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice PR, especially isolating the logic to new function. I put few comments below.

test(1427, indices(DT), c("bar","baz__bar","baz"))
test(1428, DT[bar==9L, verbose=TRUE], output="Using existing index 'bar'")
test(1429, indices(setnames(DT,"bar","a")), c("baz", "a", "baz__a"))
test(1426, DT[baz==4L, verbose=TRUE], output="Optimized subsetting with index '__baz__bar'")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this test doesn't create new index anymore but use existing one? it is potentially breaking change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I argue in the first comment of this PR, I think there is no chance that this is a breaking change:

Internally, indices are now used, even if not all index columns appear in the query.
Previously, this was prevented out of fear for spurious reordering. This, however, can't happen since the order is restored brute force after bmerge (unit tested in the PR).

The unit test that I inserted to guarantee no reordering is this one:

## test no reordering in groups
 DT <- data.table(x = c(1,1,1,2), y = c(3,2,1,1))
 setindex(DT, x, y)
 test(1437.3, DT[x==1], setindex(data.table(x = c(1,1,1), y = c(3,2,1)), x,y))

What do you think?

Copy link
Member

@jangorecki jangorecki Nov 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding new tests is fine but when changing old one you should have a reason. I don't understand why 'creating index' has change to 'using index' in 1426. Maybe it is copy-paste mistakes which pass the tests because you create and use index in that test. @MarkusBonsch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I wasn't clear enough.
In the master, the index __baz is created in 1426. In the speedySubset, however, no index is created, as the existing index __baz__bar is used:

## on branch speedySubset, repeating test 1426
DT = data.table(a=1:5,b=10:6)
setindex(DT,b)
setnames(DT,"b","foo")
setindex(DT,a,foo)
setnames(DT,"foo","bar")
setnames(DT,"a","baz")
DT[baz==4L, verbose=TRUE]
# Optimized subsetting with index 'baz__bar'
# Starting bmerge ...done in 0.001 secs

As argued above, I believe that this doesn't cause any problems since the original order of the data is restored brute force by the line:
if (length(o$xo)) i = fsort(o$xo[i], internal=TRUE) else i = fsort(i, internal=TRUE) # fix for #1495
If you have the feeling that this change is too dangerous because of unknown side-effects that are not captured by any of the unit-tests, I will revert the change. It was just an optimization during cleanup and not the main purpose of the PR.

Copy link
Member

@jangorecki jangorecki Nov 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarkusBonsch It is very good that we don't need to create new index and can reuse part of the existing index. I was not aware of this improvement in PR. I did not look at the ordering issue and cannot say anything about it now.
What if user will subset many times by baz field? Then it might be better to create this single column index and not to maintain original order by brute force?
To handle that well every multi-column index should include order of its parent subsets of columns within the index. This wouldn't cost much to create (as it has to be computed for the desired index anyway) but it will take much more memory. FYI @arunsrinivasan

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jangorecki You are right that brute force reordering is costly, as has been shown here: #2366 and also in the benchmarks of this PR in #2472
However, it is an existing part of the master and not introduced in this PR.
No matter, which index we use, baz or baz__bar, this brute force reordering is necessary to ensure the original order of the DT. Which index is faster in terms of reordering depends on the original order of the DT. In my opinion, there is no clear indication that a single-column index will be substantially more ordered than a multi-column index in general. If a user has to do many subsets on baz, the fastest option is a key since this will avoid reordering costs almost completely.

@MarkusBonsch
Copy link
Contributor Author

I just discovered that bmerge supports non-equi joins. This makes it so easy to introduce support for fast >=, >, <=, < subsets. I am currently working on that and will update the PR ASAP.

@MarkusBonsch
Copy link
Contributor Author

Unfortunately, bmerge for non-equi joins requires two additional arguments: nqgrp and nqmaxgrp that are complicated to calculate. Therefore, support for non-equi-operators is beyond the scope of this PR.

@MarkusBonsch
Copy link
Contributor Author

I have adapted the code to @jangorecki 's valuable review. IT is final from my perspective.

mattdowle
mattdowle previously approved these changes Nov 28, 2017
Copy link
Member

@mattdowle mattdowle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! I added a non-blocking comment in the Issue where the benchmarks are, related to the benchmarks.

@mattdowle
Copy link
Member

Since this is a reasonably big complicated change, should we postpone it to 1.10.8 or are we confident the test coverage is good?

@jangorecki
Copy link
Member

jangorecki commented Nov 28, 2017

Agree with Matt. Maybe we can branch new code with if (getOption("datatable.multindex", FALSE)) _new_ else _old_ to make it optional for 1.10.6, and default starting from 1.10.7. This will help people to test upcoming feature if they want to.

@jangorecki
Copy link
Member

jangorecki commented Nov 28, 2017

Internally, fast subsetting with keys is now also used if the data.table.use.index is FALSE. Previously, this option didn't only prevent the usage of indices, but also the usage of keys.

@MarkusBonsch By usage of keys you mean setkey(dt, col1); dt[.("a")] and dt["a", on="col1"]? Or something else? Because already now datatable.use.index=FALSE should not prevent to use the key.

@MarkusBonsch
Copy link
Contributor Author

@jangorecki I am referring to setkey(dt, col1); dt[col1 == "a"]. In the master, this is not using the fast subset based on the key if data.table.use.index = FALSE

@MarkusBonsch
Copy link
Contributor Author

Concerning test coverage: I am currently implementing extensive tests with all query combinations of up to three columns with '&'. I can implement the tests when this #2511 bug is fixed.
Concerning benchmarks: I will scale them up as soon as possible.

@HughParsonage
Copy link
Member

HughParsonage commented Jan 13, 2018

I observe a slight downside in performance in some cases, namely when an index or key is not present and the optimized subsetting is attempted. Apologies if I'm just not implementing it correctly.

(The use of verbose = TRUE does not have a significant impact on performance as far as I can see.)

library(magrittr) # for and() alias

Distribution_Course <-
  data.table(Course = c("AdvDip", "AssocDegree", "Bachelors",
                        "BachelorsHons", "Dip", "Doctorate", "Enabling",
                        "GradCert","GradDip", "Masters", "Non-award",
                        "OUAPostgrad", "OUAUndergrad",
                        "Other", "Xinst. prog postgrad",
                        "Xinst. prog undergrad"),
             N = c(255000,
                   520000, 55277000, 1320000, 1335000, 1215000,
                   771000, 788000, 2039000, 9071000, 737000,
                   72000, 1051000, 94000, 25000, 136000))

Distribution_Citizen <-
  data.table(Citizen = c("Aust", "Humanitarian visa", "NZ",
                         "Permanent visa", "Resid. overseas",
                         "Temp visa / diplo."),
              N = c(52851000, 150000, 482000,
                    1627000, 4906000, 14691000))

load_dummy <-
  data.table(Course = sample(Distribution_Course$Course,
                             size = 100e6,
                             replace = TRUE,
                             prob = Distribution_Course$N),
             Citizen = sample(Distribution_Citizen$Citizen,
                             size = 100e6,
                             replace = TRUE,
                             prob = Distribution_Citizen$N))

# Use magrittr::and() to avoid subsetting optimization
system.time(load_dummy[and(Course %chin% c("Bachelors", "BachelorsHons"), Citizen %chin% c("Aust", "Permanent visa", "NZ", "Humanitarian visa")), verbose = TRUE])
# user  system elapsed
# 3.54    0.18    3.73

system.time(load_dummy[Course %chin% c("Bachelors", "BachelorsHons") & Citizen %chin% c("Aust", "Permanent visa", "NZ", "Humanitarian visa"), verbose = TRUE])
# Creating new index 'Citizen__Course'
# Optimized subsetting with index 'Citizen__Course'
# on= matches existing index, using index
# user  system elapsed
# 5.32    1.04    6.37

setindex(load_dummy, NULL)

system.time(setindex(load_dummy, Course, Citizen))
# user  system elapsed
# 0.69    0.21    0.90
system.time(load_dummy[Course %chin% c("Bachelors", "BachelorsHons") & Citizen %chin% c("Aust", "Permanent visa", "NZ", "Humanitarian visa"), verbose = TRUE])
# Optimized subsetting with index 'Course__Citizen'
# on= matches existing index, using index
# user  system elapsed
# 3.39    0.30    3.70

setkey(load_dummy, Course, Citizen)

system.time(load_dummy[Course %chin% c("Bachelors", "BachelorsHons") & Citizen %chin% c("Aust", "Permanent visa", "NZ", "Humanitarian visa"), verbose = TRUE])
# Optimized subsetting with key 'Course, Citizen'
# on= matches existing key, using key
# Starting bmerge ...done in 0 secs
# user  system elapsed
# 1.05    0.09    1.17

@MarkusBonsch
Copy link
Contributor Author

@HughParsonage thank very much for the test. You are completely right: if no index is present, it can definitely be slower. This is probably also true for the existing "optimizations" of %in% and == queries in the master branch. I will include this case in my large scale benchmark in the top post ASAP. Then we can make an informed decision about the best optimization strategy. It may be worthwhile to optimize on existing keys and indices but to drop auto-idexing. Let's see, what the benchmark says.

@MarkusBonsch
Copy link
Contributor Author

MarkusBonsch commented Jan 13, 2018

I updated the benchmark in the top post. Very interesting findings, also about existing optimization in the master branch. On first sight, these are my thoughts:

  • Optimization of notjoin queries doesn't make sense until a fast != operator is implemented in bmerge. See "notjoin" joins are slow #2567.
  • Optimization if a key exists is definitely a good idea.
  • Optimization if an index exists often makes little difference. In some cases, it can be very good, but can also even slow down, especially for %in% query. Need to investigate, if this is only due to the large result set that needs to be reordered.
  • Optimization where an index is calculated upfront (auto indexing) is slowing things down (except for the compund query.
  • Optimization of non-equi queries doesn't make sense until the implementation has been brought to speed (if at all possible)

Maybe, the whole idea of auto indexing needs to be reconsidered, at least for %in% queries? Further analysis needed...

@jangorecki
Copy link
Member

IMO good to have this in Dev for a while to detect potential issues. To avoid cases like when implementing i optimization before, shortly before release.

@MarkusBonsch
Copy link
Contributor Author

MarkusBonsch commented Jan 14, 2018

@jangorecki I tend to agree, especially since it is not clear yet, which optimization strategy is really optimal

@mattdowle Confirmed the slowdon of test.data.table() and that it is not due to newly introduced tests. I will investigate.

@HughParsonage
Copy link
Member

I've encountered a possible bug with this branch. To verify:

library(TeXCheckR)
validate_bibliography(file = "https://raw.githubusercontent.com/HughParsonage/grattex/master/bib/Grattan-Master-Bibliography.bib")
#> Error in stub[[1L]]: object of type 'symbol' is not subsettable

or with more focus (not defending my implementation, but I think it should still work!)

library(data.table)
library(magrittr)
just_key_journal_urls <- fread("https://raw.githubusercontent.com/HughParsonage/grattex/master/bib/Grattan-Master-Bibliography.bib", 
  sep = NULL)[[1]]

newspapers_pattern <-
  paste0("^(url).*",
         "(",
         "(((theguardian)|(afr))\\.com)",
         "|",
         "(((theaustralian)|(theage)|(smh)|(canberratimes)|(greatlakesadvocate))\\.com\\.au)",
         "|",
         "(theconversation\\.((edu\\.au)|(com)))",
         "|",
         "(insidestory\\.org\\.au)",
         ")")

journal_actual_vs_journal_expected <-
  data.table(text = just_key_journal_urls) %>%
  .[, entry_no := cumsum(grepl("^@", text))] %>%
  .[, is_article := any(grepl("^@Article", text)), by = entry_no] %>%
  .[, both_url_and_journal := .N == 4L, by = entry_no] %>%
  .[, is_newspaper := any(grepl(newspapers_pattern, text, perl = TRUE)), by = entry_no] %>%
  .[is_article & both_url_and_journal &is_newspaper] 
#> Error in stub[[1L]]: object of type 'symbol' is not subsettable

@HughParsonage
Copy link
Member

FWIW, I think the speed improvements for keyed subsets is enough to merit inclusion, especially if this particular optimization can be avoided in cases where it is known to not improve performance. (Could we not just check whether a key is present, and only attempt it if the key matches the subset?)

I think Markus's point about the principle behind automatic optimization for subsetting in general is a good one, though tricky: For me at least, there are cases where I need to query the same column in a small data table multiply, and others where I only need to subset a big data table once. And in both cases there are instances where it's worth reviewing the documentation thoroughly to eke out the best run time for that particular scenario and instances where doing that would be no faster than even the worst case under the default setting.

@MarkusBonsch
Copy link
Contributor Author

MarkusBonsch commented Jan 14, 2018

@HughParsonage

  • Thank you for reporting the bug. It should be fixed now.
  • Concerning optimization strategy, your examples hit the nail on the head: it is very tricky and depends a lot on the context and the task at hand. The only thing that is definitely clear is that optimization should take place when a key is present (as you said as well).

@mattdowle
Copy link
Member

mattdowle commented Jan 16, 2018

Great. Good benchmarking. Ok so iiuc there's agreement that the optimization should be applied if there's already an appropriate key or index existing. It's such a big improvement and would be what users expect, that using existing key or index should be on by default. Then turn off auto index creation when more than 1 column is involved, for now. But keep auto index creation when 1 column is involved since that's beneficial iirc. Would that result in no downside by default, so it could be merged?

[ Aside 1 : In future, building indexes could be done in a background thread, perhaps. When the index creation finished it could be stored on the table. Clearly much more complicated with higher risks. But useful if it worked! ]

[ Aside 2 : The tradeoffs are going to change when index creation is parallelized so probably best to delay tradeoff optimization for now. ]

@MarkusBonsch
Copy link
Contributor Author

@mattdowle I have found one or two other tweaks for speed improvements of the optimized mode.
I will implement, update the benchmark and then comment on the best optimization strategy

@mattdowle Concerning slowdown of test.data.table(). I investigated and found that it is really overhead. You find the details in an update to your original comment above.

@MarkusBonsch
Copy link
Contributor Author

I discovered a way to speed up the reordering after bmerge (see #2366). This is implemented for subsets larger than 1e6 rows now. The updated benchmark in the first post shows that this significantly increases the speed for DT[doubleCol %in% 1:5] (WITH INDEX) from ~40 to ~20 seconds and others as well.

For me the optimal optimization strategy based on the benchmark results would be:

  1. Never optimize for small data.tables (<1e3 rows) because of too much overhead
  2. Never optimize notjoin queries until we have brought them to speed ("notjoin" joins are slow #2567)
  3. Always (with above exceptions) optimize if a key is present.
  4. Always (with above exceptions) optimize if a proper index is present
  5. Auto-indexing is mostly beneficial if multiple subsets on the same columns are executed. Only exception is the combined query DT[intCol == 2L & doubleCol %in% 1:5 & charCol %chin% c("A", "B", "Y","Z")] (WITH NOTHING) that also profits on first execution. Since the speed decrease in the other cases is not too terrible and we may have faster parallelized index calculation in the future, I would keep doing it for all queries.

@HughParsonage
Copy link
Member

HughParsonage commented Jan 27, 2018

Would it be 'good' (feasible, performant, etc) to provide an argument to [ that specifies that an auto index should not be created? It appears that setting options(datatable.auto.index = FALSE) would do the same thing, but something like DT[id == 100L, auto.index = FALSE] make things a bit more transparent and also makes chaining possible. I don't know R intimately enough to judge the relative performance of an additional arguments versus getOption options.

@MarkusBonsch
Copy link
Contributor Author

@HughParsonage I like the idea. I would prefer on optimize option, however: DT[id == 100L, optimize = 2L].

Copy link
Member

@mattdowle mattdowle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Truly awesome and compelling benchmarks!
Great comments and tests too.

@mattdowle mattdowle merged commit 05c5122 into master Jan 31, 2018
@mattdowle mattdowle deleted the speedySubset branch January 31, 2018 02:18
@MarkusBonsch
Copy link
Contributor Author

Cool, thank you! I am so happy :)

@DavidArenburg
Copy link
Member

I think this is an edge case that fails because of .prepareFastSubset. This is at least what my investigation brought up (though I maybe mistaken).

@MichaelChirico
Copy link
Member

MichaelChirico commented Jun 12, 2018 via email

@mattdowle
Copy link
Member

David found the issue thanks: #2931

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants