-
Notifications
You must be signed in to change notification settings - Fork 993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filter Helper Functions #4133
Comments
It is much better to optimize head(n=2) etc rather than adding new functions. Don't know what is the body of filter.at fun but note the comments in #4105, solution proposed there is not appropriate because it can actually increase time. |
see also #3804 |
Both functions are highly NSE. The optimized
All other functions substitute in to largely end up with |
Your initial pseudo code chunk seems to be missing closing square bracket. AFAIU it will not address the problem because it can quite easily add an overhead (time and memory) when filter returns large portions of data. |
I made edits and will be more careful in the future. My apologies. This issue is more related to consistency and syntactic sugar than pure performance. For users who want to do
For filtering, I agree that for functions that return mostly TRUE results that there would be a performance penalty using a In essence, these helper functions would help unite the |
Aim for consistency in data.table is consistency to base R. That is why we prefer head() rather than new topn() functions. Of course we can consider new helper functions, that is why best probably to fill WIP PR and show how that would help in practice. Please don't invest to much time into it, just single use case well presenting usefulness of your proposal. And don't focus on performance now, but be ready to explain how would efficient implementation be made. If the helper is just meant to replace Reduce() then probably not much performance wise implementation will be there. |
@ColeMiller1 perhaps more useful is to use/modify the dtplyr package? |
@sritchie73 I will close as there is no interest in this from the community. |
dplyr
has many helper functions for filtering includingfilter_at(.tbl, .vars, .vars_predicate, .preserve = FALSE)
andtop_n(x, n, wt)
. These functions are particularly helpful in combination withgroup_by()
such as:While optimizations have been made for
data.table
in.SD[]
, a helper function in thei
could be 1) more performant and 2) allow update-by-reference in thej
. Additionally, since thedt[dt[, .I[1:3], by = fct]$V1]
is typically faster than the.SD[]
method, a helper function would allow us to better achieve that peak performance and be used in chaining operations instead of saving an intermediatedata.table
object.On SO, these types of helper functions really help
dplyr
shine - I would really like assistingdata.table
but this is too extreme for a random PR. I have done enough work that non-exportedtop.n(n, wt, by)
andfilter.at(cols, logic, by, all.vars = TRUE)
currently work with all tests checking out OK on my PC. As for performance:top.n:
filter.at:
Notice
filter.at()
has a nice performance gain and would help in closing issues like 4105 as the code includes optimizations.Is this something that is wanted? If it is, I could start working on a PR that would also include filter.all, top.n.bottom, sample.n, subset.n, and slice.n. If it is not wanted, I completely understand. Thank you for all of your work.
The text was updated successfully, but these errors were encountered: