Filter Helper Functions #4133

ColeMiller1 · 2019-12-20T06:24:54Z

dplyr has many helper functions for filtering including filter_at(.tbl, .vars, .vars_predicate, .preserve = FALSE) and top_n(x, n, wt). These functions are particularly helpful in combination with group_by() such as:

library(dplyr)
tib <- as_tibble(iris)
tib%>%
  group_by(Species)%>%
  top_n(3, Sepal.Length)

While optimizations have been made for data.table in .SD[], a helper function in the i could be 1) more performant and 2) allow update-by-reference in the j. Additionally, since the dt[dt[, .I[1:3], by = fct]$V1] is typically faster than the .SD[] method, a helper function would allow us to better achieve that peak performance and be used in chaining operations instead of saving an intermediate data.table object.

On SO, these types of helper functions really help dplyr shine - I would really like assisting data.table but this is too extreme for a random PR. I have done enough work that non-exported top.n(n, wt, by) and filter.at(cols, logic, by, all.vars = TRUE) currently work with all tests checking out OK on my PC. As for performance:

top.n:

foo2 <- data.table(
  x = runif(n = 10^7),
  grp = sample(letters[1:5], 10^7, replace = T)
)

# A tibble: 3 x 13
  expression                                              min median `itr/sec` mem_alloc `gc/sec` n_itr
  <bch:expr>                                            <bch> <bch:>     <dbl> <bch:byt>    <dbl> <int>
1 foo2[top.n(-3L, x, grp)]                               1.7s  2.04s     0.503     298MB    0.805    10
2 foo2[foo2[, .I[frank(x) <= 3L], grp]$V1]              1.76s  2.04s     0.490     336MB    0.980    10
3 foo2[, .SD[frank(x, ties.method = "min") <= 3L], grp] 1.71s   1.9s     0.513     298MB    0.872    10

filter.at:

foo <- data.table(
  x = as.character(runif(n = 10^6)),
  y = as.character(runif(n = 10^6)),
  z = as.character(runif(n = 10^6))
)

# A tibble: 3 x 13
  expression                                              min median `itr/sec` mem_alloc `gc/sec` n_itr
  <bch:expr>                                            <bch> <bch:>     <dbl> <bch:byt>    <dbl> <int>
1 foo[filter.at(c("x", "y", "z"), like(x, "123"))]      290ms  301ms      3.32     7.9MB    0.831     4
2 foo[like(x, "123")][like(y, "123")][like(z, "123")]   291ms  291ms      3.35    8.13MB    0         5
3 foo[like(x, "123") & like(y, "123") & like(z, "123")] 858ms  884ms      1.10    22.9MB    0         5

Notice filter.at() has a nice performance gain and would help in closing issues like 4105 as the code includes optimizations.

Is this something that is wanted? If it is, I could start working on a PR that would also include filter.all, top.n.bottom, sample.n, subset.n, and slice.n. If it is not wanted, I completely understand. Thank you for all of your work.

The text was updated successfully, but these errors were encountered:

jangorecki · 2019-12-20T15:50:39Z

It is much better to optimize head(n=2) etc rather than adding new functions. Don't know what is the body of filter.at fun but note the comments in #4105, solution proposed there is not appropriate because it can actually increase time.

MichaelChirico · 2019-12-20T16:06:08Z

see also #3804

ColeMiller1 · 2019-12-20T19:37:20Z

Both functions are highly NSE. The optimized filter.at() is largely: EDIT: for the example above this is what is formed

ind = which(like(x[['x']], '123'))
for (col in c('y', 'z'){
  ind = ind[like(x[[col]], '123')]
}

All other functions substitute in to largely end up with dt[, .I[...], by = ...]$V1 or in the case of logical evaluations dt[, Reduce('&', lapply(.SD, function(x) ...), by = ...]$V1. EDIT: removed a lot of code which would be better off in a PR.

jangorecki · 2019-12-21T08:26:30Z

Your initial pseudo code chunk seems to be missing closing square bracket. AFAIU it will not address the problem because it can quite easily add an overhead (time and memory) when filter returns large portions of data.

ColeMiller1 · 2019-12-23T05:22:57Z

I made edits and will be more careful in the future. My apologies.

This issue is more related to consistency and syntactic sugar than pure performance. For users who want to do topn, it would be great to allow it by group. Unfortunately, the current API does not lead to that:

library(data.table)
dt = as.data.table(iris)

dt[order(-Sepal.Length)[1:3]]

#unintended results
dt[order(-Sepal.Length)[1:3], by = Species] 

#largely correct results with column re-ordering
dt[order(-Sepal.Length), .SD[1:3], by = Species]

#largely correct with by reordered based on top group result but more performant
dt[dt[order(-Sepal.Length), .I[1:3], by = Species]$V1]

#proposed
dt[top.n(3, Sepal.Length)]                    ##evals dt[order(-Sepal.Length)[1:min(.N, 3)]]
dt[top.n(3, Sepal.Length, Species)]           ##evals dt[dt[order(-Sepal.Length), .I[1:min(.N, 3)], by = Species]$V1]
dt[top.n(3, Sepal.Length, Species, ties = T)] ##evals dt[dt[, .I[frank(-Sepal.Length, ties.method = 'min') <= 3], by = Species]$V1]

For filtering, I agree that for functions that return mostly TRUE results that there would be a performance penalty using a filter.at approach. Still, users are very interested in functional programming due to tidyverse documentation which makes dt[filter.at(cols, fx)] very attractive as opposed to dt[dt[, Reduce('&', lapply(.SD, fx)), .SDcols = cols]] or similarly dt[filter.at(cols, fx, fct)] versus dt[dt[, Reduce('&', lapply(.SD, fx)), .SDcols = cols, by = fct]$V1].

In essence, these helper functions would help unite the i argument with the .SDcols and by arguments. My initial implementation of these result in good performance that tries to keep a consistent API. I would be happy to make a PR but again, I understand if this is too much and am content if the issue is closed.

jangorecki · 2019-12-23T07:33:44Z

Aim for consistency in data.table is consistency to base R. That is why we prefer head() rather than new topn() functions. Of course we can consider new helper functions, that is why best probably to fill WIP PR and show how that would help in practice. Please don't invest to much time into it, just single use case well presenting usefulness of your proposal. And don't focus on performance now, but be ready to explain how would efficient implementation be made. If the helper is just meant to replace Reduce() then probably not much performance wise implementation will be there.

sritchie73 · 2020-01-14T14:33:10Z

@ColeMiller1 perhaps more useful is to use/modify the dtplyr package?

ColeMiller1 · 2020-01-14T22:04:41Z

@sritchie73 I will close as there is no interest in this from the community.

ColeMiller1 mentioned this issue Dec 27, 2019

Subset.i #4139

Closed

ColeMiller1 closed this as completed Jan 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filter Helper Functions #4133

Filter Helper Functions #4133

ColeMiller1 commented Dec 20, 2019

jangorecki commented Dec 20, 2019

MichaelChirico commented Dec 20, 2019

ColeMiller1 commented Dec 20, 2019 •

edited

Loading

jangorecki commented Dec 21, 2019 •

edited

Loading

ColeMiller1 commented Dec 23, 2019

jangorecki commented Dec 23, 2019 •

edited

Loading

sritchie73 commented Jan 14, 2020

ColeMiller1 commented Jan 14, 2020

Filter Helper Functions #4133

Filter Helper Functions #4133

Comments

ColeMiller1 commented Dec 20, 2019

jangorecki commented Dec 20, 2019

MichaelChirico commented Dec 20, 2019

ColeMiller1 commented Dec 20, 2019 • edited Loading

jangorecki commented Dec 21, 2019 • edited Loading

ColeMiller1 commented Dec 23, 2019

jangorecki commented Dec 23, 2019 • edited Loading

sritchie73 commented Jan 14, 2020

ColeMiller1 commented Jan 14, 2020

ColeMiller1 commented Dec 20, 2019 •

edited

Loading

jangorecki commented Dec 21, 2019 •

edited

Loading

jangorecki commented Dec 23, 2019 •

edited

Loading