Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter Helper Functions #4133

Closed
ColeMiller1 opened this issue Dec 20, 2019 · 8 comments
Closed

Filter Helper Functions #4133

ColeMiller1 opened this issue Dec 20, 2019 · 8 comments

Comments

@ColeMiller1
Copy link
Contributor

dplyr has many helper functions for filtering including filter_at(.tbl, .vars, .vars_predicate, .preserve = FALSE) and top_n(x, n, wt). These functions are particularly helpful in combination with group_by() such as:

library(dplyr)
tib <- as_tibble(iris)
tib%>%
  group_by(Species)%>%
  top_n(3, Sepal.Length)

While optimizations have been made for data.table in .SD[], a helper function in the i could be 1) more performant and 2) allow update-by-reference in the j. Additionally, since the dt[dt[, .I[1:3], by = fct]$V1] is typically faster than the .SD[] method, a helper function would allow us to better achieve that peak performance and be used in chaining operations instead of saving an intermediate data.table object.

On SO, these types of helper functions really help dplyr shine - I would really like assisting data.table but this is too extreme for a random PR. I have done enough work that non-exported top.n(n, wt, by) and filter.at(cols, logic, by, all.vars = TRUE) currently work with all tests checking out OK on my PC. As for performance:

top.n:

foo2 <- data.table(
  x = runif(n = 10^7),
  grp = sample(letters[1:5], 10^7, replace = T)
)

# A tibble: 3 x 13
  expression                                              min median `itr/sec` mem_alloc `gc/sec` n_itr
  <bch:expr>                                            <bch> <bch:>     <dbl> <bch:byt>    <dbl> <int>
1 foo2[top.n(-3L, x, grp)]                               1.7s  2.04s     0.503     298MB    0.805    10
2 foo2[foo2[, .I[frank(x) <= 3L], grp]$V1]              1.76s  2.04s     0.490     336MB    0.980    10
3 foo2[, .SD[frank(x, ties.method = "min") <= 3L], grp] 1.71s   1.9s     0.513     298MB    0.872    10

filter.at:

foo <- data.table(
  x = as.character(runif(n = 10^6)),
  y = as.character(runif(n = 10^6)),
  z = as.character(runif(n = 10^6))
)

# A tibble: 3 x 13
  expression                                              min median `itr/sec` mem_alloc `gc/sec` n_itr
  <bch:expr>                                            <bch> <bch:>     <dbl> <bch:byt>    <dbl> <int>
1 foo[filter.at(c("x", "y", "z"), like(x, "123"))]      290ms  301ms      3.32     7.9MB    0.831     4
2 foo[like(x, "123")][like(y, "123")][like(z, "123")]   291ms  291ms      3.35    8.13MB    0         5
3 foo[like(x, "123") & like(y, "123") & like(z, "123")] 858ms  884ms      1.10    22.9MB    0         5

Notice filter.at() has a nice performance gain and would help in closing issues like 4105 as the code includes optimizations.

Is this something that is wanted? If it is, I could start working on a PR that would also include filter.all, top.n.bottom, sample.n, subset.n, and slice.n. If it is not wanted, I completely understand. Thank you for all of your work.

@jangorecki
Copy link
Member

It is much better to optimize head(n=2) etc rather than adding new functions. Don't know what is the body of filter.at fun but note the comments in #4105, solution proposed there is not appropriate because it can actually increase time.

@MichaelChirico
Copy link
Member

see also #3804

@ColeMiller1
Copy link
Contributor Author

ColeMiller1 commented Dec 20, 2019

Both functions are highly NSE. The optimized filter.at() is largely: EDIT: for the example above this is what is formed

ind = which(like(x[['x']], '123'))
for (col in c('y', 'z'){
  ind = ind[like(x[[col]], '123')]
}

All other functions substitute in to largely end up with dt[, .I[...], by = ...]$V1 or in the case of logical evaluations dt[, Reduce('&', lapply(.SD, function(x) ...), by = ...]$V1. EDIT: removed a lot of code which would be better off in a PR.

@jangorecki
Copy link
Member

jangorecki commented Dec 21, 2019

Your initial pseudo code chunk seems to be missing closing square bracket. AFAIU it will not address the problem because it can quite easily add an overhead (time and memory) when filter returns large portions of data.

@ColeMiller1
Copy link
Contributor Author

I made edits and will be more careful in the future. My apologies.

This issue is more related to consistency and syntactic sugar than pure performance. For users who want to do topn, it would be great to allow it by group. Unfortunately, the current API does not lead to that:

library(data.table)
dt = as.data.table(iris)

dt[order(-Sepal.Length)[1:3]]

#unintended results
dt[order(-Sepal.Length)[1:3], by = Species] 

#largely correct results with column re-ordering
dt[order(-Sepal.Length), .SD[1:3], by = Species]

#largely correct with by reordered based on top group result but more performant
dt[dt[order(-Sepal.Length), .I[1:3], by = Species]$V1]

#proposed
dt[top.n(3, Sepal.Length)]                    ##evals dt[order(-Sepal.Length)[1:min(.N, 3)]]
dt[top.n(3, Sepal.Length, Species)]           ##evals dt[dt[order(-Sepal.Length), .I[1:min(.N, 3)], by = Species]$V1]
dt[top.n(3, Sepal.Length, Species, ties = T)] ##evals dt[dt[, .I[frank(-Sepal.Length, ties.method = 'min') <= 3], by = Species]$V1]

For filtering, I agree that for functions that return mostly TRUE results that there would be a performance penalty using a filter.at approach. Still, users are very interested in functional programming due to tidyverse documentation which makes dt[filter.at(cols, fx)] very attractive as opposed to dt[dt[, Reduce('&', lapply(.SD, fx)), .SDcols = cols]] or similarly dt[filter.at(cols, fx, fct)] versus dt[dt[, Reduce('&', lapply(.SD, fx)), .SDcols = cols, by = fct]$V1].

In essence, these helper functions would help unite the i argument with the .SDcols and by arguments. My initial implementation of these result in good performance that tries to keep a consistent API. I would be happy to make a PR but again, I understand if this is too much and am content if the issue is closed.

@jangorecki
Copy link
Member

jangorecki commented Dec 23, 2019

Aim for consistency in data.table is consistency to base R. That is why we prefer head() rather than new topn() functions. Of course we can consider new helper functions, that is why best probably to fill WIP PR and show how that would help in practice. Please don't invest to much time into it, just single use case well presenting usefulness of your proposal. And don't focus on performance now, but be ready to explain how would efficient implementation be made. If the helper is just meant to replace Reduce() then probably not much performance wise implementation will be there.

@ColeMiller1 ColeMiller1 mentioned this issue Dec 27, 2019
@sritchie73
Copy link
Contributor

@ColeMiller1 perhaps more useful is to use/modify the dtplyr package?

@ColeMiller1
Copy link
Contributor Author

@sritchie73 I will close as there is no interest in this from the community.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants