-
Notifications
You must be signed in to change notification settings - Fork 991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better subsetting optimization for compound queries #2494
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2494 +/- ##
==========================================
+ Coverage 91.42% 91.44% +0.01%
==========================================
Files 63 63
Lines 12111 12203 +92
==========================================
+ Hits 11073 11159 +86
- Misses 1038 1044 +6
Continue to review full report at Codecov.
|
@franknarf1 mentioned this touches on #1453; that remains a separate issue, correct? |
#1453 is definitely touched and the request from the issue title is implemented. However, within the issue, optimized joins for other operators like >, < are mentioned. That has not been addressed. |
R/data.table.R
Outdated
if (!is.null(attr(x, '.data.table.locked'))) return(NULL) # fix for #958, don't create auto index on '.SD'. | ||
if (!getOption("datatable.use.index")) { | ||
return(NULL) # #1422 | ||
## Does this option also prevent the usage of keys? It is interpreted in this way here... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, index should not be created and if it already exists it should not be used. Main purpose of that was to make it easy to benchmark index vs no-index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your good questions and comments. Here, I am wondering about keys, not indices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
usage of keys should not be affected by this option
R/data.table.R
Outdated
} | ||
## Determine, whether the nature of isub in general supports fast binary search | ||
remainingIsub <- isub | ||
i <- list() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use more meaningful name for this variable i
might be easily confused with [.data.table
i
arg.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, this is the list that will become the i
arg for bmerge
after it is CJ
ed. Therefore, I think the naming makes sense.
} | ||
if (is.null(idx)){ | ||
## if nothing else helped, auto create a new index that can be used | ||
if (!getOption("datatable.auto.index")) return(NULL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just for information... on the other hand this option allows to use indexes only when they were created manually by calling setindex
(or created before setting up this option).
inst/tests/tests.Rraw
Outdated
@@ -5547,12 +5547,12 @@ test(1375.3, DT[,mean(Petal.Width),by=Species][V1>1,Species:=toupper(Species)]$S | |||
# Secondary keys a.k.a indexes ... | |||
DT = data.table(a=1:10,b=10:1) | |||
test(1376.1, indices(DT), NULL) | |||
test(1376.2, DT[b==7L,verbose=TRUE], DT[4L], output="Creating new index 'b'") | |||
test(1376.2, DT[b==7L,verbose=TRUE], DT[4L], output="Creating new index '__b'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about the naming you used. I would prefer the old output. __b
is just internal attribute name for b
index`.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine for me. I will adjust so that the messsage completely reflects the return value for the indices
function:
b
, but b__c
for multi-column indices, OK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice PR, especially isolating the logic to new function. I put few comments below.
inst/tests/tests.Rraw
Outdated
test(1427, indices(DT), c("bar","baz__bar","baz")) | ||
test(1428, DT[bar==9L, verbose=TRUE], output="Using existing index 'bar'") | ||
test(1429, indices(setnames(DT,"bar","a")), c("baz", "a", "baz__a")) | ||
test(1426, DT[baz==4L, verbose=TRUE], output="Optimized subsetting with index '__baz__bar'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this test doesn't create new index anymore but use existing one? it is potentially breaking change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I argue in the first comment of this PR, I think there is no chance that this is a breaking change:
Internally, indices are now used, even if not all index columns appear in the query.
Previously, this was prevented out of fear for spurious reordering. This, however, can't happen since the order is restored brute force after bmerge (unit tested in the PR).
The unit test that I inserted to guarantee no reordering is this one:
## test no reordering in groups
DT <- data.table(x = c(1,1,1,2), y = c(3,2,1,1))
setindex(DT, x, y)
test(1437.3, DT[x==1], setindex(data.table(x = c(1,1,1), y = c(3,2,1)), x,y))
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding new tests is fine but when changing old one you should have a reason. I don't understand why 'creating index' has change to 'using index' in 1426. Maybe it is copy-paste mistakes which pass the tests because you create and use index in that test. @MarkusBonsch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I wasn't clear enough.
In the master, the index __baz
is created in 1426. In the speedySubset, however, no index is created, as the existing index __baz__bar
is used:
## on branch speedySubset, repeating test 1426
DT = data.table(a=1:5,b=10:6)
setindex(DT,b)
setnames(DT,"b","foo")
setindex(DT,a,foo)
setnames(DT,"foo","bar")
setnames(DT,"a","baz")
DT[baz==4L, verbose=TRUE]
# Optimized subsetting with index 'baz__bar'
# Starting bmerge ...done in 0.001 secs
As argued above, I believe that this doesn't cause any problems since the original order of the data is restored brute force by the line:
if (length(o$xo)) i = fsort(o$xo[i], internal=TRUE) else i = fsort(i, internal=TRUE) # fix for #1495
If you have the feeling that this change is too dangerous because of unknown side-effects that are not captured by any of the unit-tests, I will revert the change. It was just an optimization during cleanup and not the main purpose of the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MarkusBonsch It is very good that we don't need to create new index and can reuse part of the existing index. I was not aware of this improvement in PR. I did not look at the ordering issue and cannot say anything about it now.
What if user will subset many times by baz
field? Then it might be better to create this single column index and not to maintain original order by brute force?
To handle that well every multi-column index should include order of its parent subsets of columns within the index. This wouldn't cost much to create (as it has to be computed for the desired index anyway) but it will take much more memory. FYI @arunsrinivasan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jangorecki You are right that brute force reordering is costly, as has been shown here: #2366 and also in the benchmarks of this PR in #2472
However, it is an existing part of the master and not introduced in this PR.
No matter, which index we use, baz
or baz__bar
, this brute force reordering is necessary to ensure the original order of the DT. Which index is faster in terms of reordering depends on the original order of the DT. In my opinion, there is no clear indication that a single-column index will be substantially more ordered than a multi-column index in general. If a user has to do many subsets on baz
, the fastest option is a key since this will avoid reordering costs almost completely.
I just discovered that |
Unfortunately, bmerge for non-equi joins requires two additional arguments: |
I have adapted the code to @jangorecki 's valuable review. IT is final from my perspective. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! I added a non-blocking comment in the Issue where the benchmarks are, related to the benchmarks.
Since this is a reasonably big complicated change, should we postpone it to 1.10.8 or are we confident the test coverage is good? |
Agree with Matt. Maybe we can branch new code with |
@MarkusBonsch By usage of keys you mean |
@jangorecki I am referring to setkey(dt, col1); dt[col1 == "a"]. In the master, this is not using the fast subset based on the key if data.table.use.index = FALSE |
Concerning test coverage: I am currently implementing extensive tests with all query combinations of up to three columns with '&'. I can implement the tests when this #2511 bug is fixed. |
I observe a slight downside in performance in some cases, namely when an index or key is not present and the optimized subsetting is attempted. Apologies if I'm just not implementing it correctly. (The use of library(magrittr) # for and() alias
Distribution_Course <-
data.table(Course = c("AdvDip", "AssocDegree", "Bachelors",
"BachelorsHons", "Dip", "Doctorate", "Enabling",
"GradCert","GradDip", "Masters", "Non-award",
"OUAPostgrad", "OUAUndergrad",
"Other", "Xinst. prog postgrad",
"Xinst. prog undergrad"),
N = c(255000,
520000, 55277000, 1320000, 1335000, 1215000,
771000, 788000, 2039000, 9071000, 737000,
72000, 1051000, 94000, 25000, 136000))
Distribution_Citizen <-
data.table(Citizen = c("Aust", "Humanitarian visa", "NZ",
"Permanent visa", "Resid. overseas",
"Temp visa / diplo."),
N = c(52851000, 150000, 482000,
1627000, 4906000, 14691000))
load_dummy <-
data.table(Course = sample(Distribution_Course$Course,
size = 100e6,
replace = TRUE,
prob = Distribution_Course$N),
Citizen = sample(Distribution_Citizen$Citizen,
size = 100e6,
replace = TRUE,
prob = Distribution_Citizen$N))
# Use magrittr::and() to avoid subsetting optimization
system.time(load_dummy[and(Course %chin% c("Bachelors", "BachelorsHons"), Citizen %chin% c("Aust", "Permanent visa", "NZ", "Humanitarian visa")), verbose = TRUE])
# user system elapsed
# 3.54 0.18 3.73
system.time(load_dummy[Course %chin% c("Bachelors", "BachelorsHons") & Citizen %chin% c("Aust", "Permanent visa", "NZ", "Humanitarian visa"), verbose = TRUE])
# Creating new index 'Citizen__Course'
# Optimized subsetting with index 'Citizen__Course'
# on= matches existing index, using index
# user system elapsed
# 5.32 1.04 6.37
setindex(load_dummy, NULL)
system.time(setindex(load_dummy, Course, Citizen))
# user system elapsed
# 0.69 0.21 0.90
system.time(load_dummy[Course %chin% c("Bachelors", "BachelorsHons") & Citizen %chin% c("Aust", "Permanent visa", "NZ", "Humanitarian visa"), verbose = TRUE])
# Optimized subsetting with index 'Course__Citizen'
# on= matches existing index, using index
# user system elapsed
# 3.39 0.30 3.70
setkey(load_dummy, Course, Citizen)
system.time(load_dummy[Course %chin% c("Bachelors", "BachelorsHons") & Citizen %chin% c("Aust", "Permanent visa", "NZ", "Humanitarian visa"), verbose = TRUE])
# Optimized subsetting with key 'Course, Citizen'
# on= matches existing key, using key
# Starting bmerge ...done in 0 secs
# user system elapsed
# 1.05 0.09 1.17 |
@HughParsonage thank very much for the test. You are completely right: if no index is present, it can definitely be slower. This is probably also true for the existing "optimizations" of %in% and == queries in the master branch. I will include this case in my large scale benchmark in the top post ASAP. Then we can make an informed decision about the best optimization strategy. It may be worthwhile to optimize on existing keys and indices but to drop auto-idexing. Let's see, what the benchmark says. |
I updated the benchmark in the top post. Very interesting findings, also about existing optimization in the master branch. On first sight, these are my thoughts:
Maybe, the whole idea of auto indexing needs to be reconsidered, at least for %in% queries? Further analysis needed... |
IMO good to have this in Dev for a while to detect potential issues. To avoid cases like when implementing i optimization before, shortly before release. |
@jangorecki I tend to agree, especially since it is not clear yet, which optimization strategy is really optimal @mattdowle Confirmed the slowdon of |
I've encountered a possible bug with this branch. To verify: library(TeXCheckR)
validate_bibliography(file = "https://raw.githubusercontent.com/HughParsonage/grattex/master/bib/Grattan-Master-Bibliography.bib")
#> Error in stub[[1L]]: object of type 'symbol' is not subsettable or with more focus (not defending my implementation, but I think it should still work!) library(data.table)
library(magrittr)
just_key_journal_urls <- fread("https://raw.githubusercontent.com/HughParsonage/grattex/master/bib/Grattan-Master-Bibliography.bib",
sep = NULL)[[1]]
newspapers_pattern <-
paste0("^(url).*",
"(",
"(((theguardian)|(afr))\\.com)",
"|",
"(((theaustralian)|(theage)|(smh)|(canberratimes)|(greatlakesadvocate))\\.com\\.au)",
"|",
"(theconversation\\.((edu\\.au)|(com)))",
"|",
"(insidestory\\.org\\.au)",
")")
journal_actual_vs_journal_expected <-
data.table(text = just_key_journal_urls) %>%
.[, entry_no := cumsum(grepl("^@", text))] %>%
.[, is_article := any(grepl("^@Article", text)), by = entry_no] %>%
.[, both_url_and_journal := .N == 4L, by = entry_no] %>%
.[, is_newspaper := any(grepl(newspapers_pattern, text, perl = TRUE)), by = entry_no] %>%
.[is_article & both_url_and_journal &is_newspaper]
#> Error in stub[[1L]]: object of type 'symbol' is not subsettable |
FWIW, I think the speed improvements for keyed subsets is enough to merit inclusion, especially if this particular optimization can be avoided in cases where it is known to not improve performance. (Could we not just check whether a key is present, and only attempt it if the key matches the subset?) I think Markus's point about the principle behind automatic optimization for subsetting in general is a good one, though tricky: For me at least, there are cases where I need to query the same column in a small data table multiply, and others where I only need to subset a big data table once. And in both cases there are instances where it's worth reviewing the documentation thoroughly to eke out the best run time for that particular scenario and instances where doing that would be no faster than even the worst case under the default setting. |
|
Great. Good benchmarking. Ok so iiuc there's agreement that the optimization should be applied if there's already an appropriate key or index existing. It's such a big improvement and would be what users expect, that using existing key or index should be on by default. Then turn off auto index creation when more than 1 column is involved, for now. But keep auto index creation when 1 column is involved since that's beneficial iirc. Would that result in no downside by default, so it could be merged? [ Aside 1 : In future, building indexes could be done in a background thread, perhaps. When the index creation finished it could be stored on the table. Clearly much more complicated with higher risks. But useful if it worked! ] [ Aside 2 : The tradeoffs are going to change when index creation is parallelized so probably best to delay tradeoff optimization for now. ] |
@mattdowle I have found one or two other tweaks for speed improvements of the optimized mode. @mattdowle Concerning slowdown of test.data.table(). I investigated and found that it is really overhead. You find the details in an update to your original comment above. |
I discovered a way to speed up the reordering after bmerge (see #2366). This is implemented for subsets larger than 1e6 rows now. The updated benchmark in the first post shows that this significantly increases the speed for For me the optimal optimization strategy based on the benchmark results would be:
|
Would it be 'good' (feasible, performant, etc) to provide an argument to |
@HughParsonage I like the idea. I would prefer on optimize option, however: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Truly awesome and compelling benchmarks!
Great comments and tests too.
Cool, thank you! I am so happy :) |
I think this is an edge case that fails because of |
the base problem is with how non-equi joins are parsed... there's an open
issue about that but I couldn't find it w 5 minutes searching
…On Tue, Jun 12, 2018, 5:51 PM David Arenburg ***@***.***> wrote:
I think this is an edge case
<https://stackoverflow.com/q/50810759/3001626> that fails because of
.prepareFastSubset. This is at least what my investigation brought up
(though I maybe mistaken).
—
You are receiving this because your review was requested.
Reply to this email directly, view it on GitHub
<#2494 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHQQdSPwCmzdz2vxNlPTjDO9WRJpFAx6ks5t748mgaJpZM4Qm765>
.
|
David found the issue thanks: #2931 |
Closes issue #2472.
This implementation makes fast subsetting with
bmerge
and keys/indices applicable to a wider range ofsubsetting queries in
i
.Detailed benchmarks can be found in my post in issue #2472
Logic changes
options(datatable.optimize) = 3
.%chin%
likeDT[char %chin% c("A", "B")]
&
and each subquery satisfies the criteria,i.e operator is ==, %in%, or %chin%, lefthand side is a column of the data.table,
and righthand side fulfills several complicated criteria (that I just copy-pasted).
data.table.use.index
isFALSE
. Previously, this option didn't only prevent the usage of indices, but also the usage of keys. According to @jangorecki, the new behaviour is the desired one.Implications for unit tests
which = TRUE
and to grouped queries using 'by' (1437.28ff)Structure changes
I factored out the whole logic of determining, whether a fast subset can be executed into the new function prepareFastSubset.
The advantage is, that for future changes of these conditions, it is clear, where to adapt the code.
Previously, optimized subsets had their own call to bmerge. Now, they are redirected to the normal join implementation. This causes a small slowdown (to be investigated) but offers the advantage of easier code maintenance and the possibility of including non-equi operators, once the non-equi joins have been brought to speed.
Benchmark
Code at the end of this post.
data.table with 1e9 rows and 3 columns (~50GB):
Tested each subsetting query in three versions:
Tested for 4 different package settings:
Things that are optimized now and weren't before
Things that were optimized before and are still optimized
Things where optimization was tested but discarded
I made a lot of effort to implement optimization for non-equi operators.
But it turned out that non-equi joins are much slower than vector based subsets. Until this is fixed, They won't be "optimized"
Here is the code for my benchmarks:
Cheers,
Markus