-
Notifications
You must be signed in to change notification settings - Fork 990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Subset.i #4139
Subset.i #4139
Conversation
Example uses for filter.at
# for anonymous functions vs. functions in the environment | ||
is_fx = any(as.character(logic) == 'x') || length(as.character(logic)) > 1 | ||
|
||
# Predicates wtih rle and cumsum do not work with chaining i.e. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if there will be a user defined function that behaves like rle
or cumsum
, how we are going to handle that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The is_fx
is poorly named. Mainly, the optimization is for if the user supplies an anonymous function. A previously defined function would be evaluated using Reduce
. That is, fx = cumsum(x) > 3; dt[filter.at(TRUE, fx)]
would not be optimized in a for loop. However, dt[filter.at(TRUE, some_function(x))]
, would be optimized as long as some_function(x)
is not 'rle|cumsum|min|max|sum'
. TO DO: change is_fx
to is_anon_fx
for clarity but I am interested in what you are working on before resolving.
if (getOption("datatable.optimize")>=1L) assign("order", forder, ienv) | ||
isub = tryCatch(eval(.massagei(l[['i']]), x, ienv), error=function(e) .checkTypos(e, names_x)) | ||
} else if (is.null(l$by)) { | ||
isub = do.call(`[.data.table`, l) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is generally a bad idea to call [.data.table
from inside [.data.table
. I recall there was a single(?) place where we already did that, and we wanted to get rid of it. If there is another way, then we should consider to use that instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line 2003 - i = x[i, which=TRUE]
. I can untangle the calls without by
- just need to include extra code to handle the .SDcols
argument. I would need to dig very deeply to remove the call with the by
but I will look into it. It would be easier if #852 were further along to make the various by
and j
checks more modular.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jangorecki I have taken this comment under consideration. I am starting to look into C methods to do a quicker topn
which should assist me with filter.at
. Plus, the former is an actual feature that has upvotes.
For example, using forderv
can get head by group with:
dt = data.table(V1 = sample(50, 500, TRUE))
tmp = data.table:::forderv(dt[['V1']], sort = FALSE, retGrp = T)
dt[tmp[attr(tmp, 'starts')]]
But obviously, we allocate a large integer vector just to immediately subset. A new method that only returns a vector of what is used should be more performant.
Also, I appreciate your time and comments. I will likely close this is in a couple of days as while this is speedy as is, I acknowledge it is somewhat hacky.
Thanks for PR, looks interesting. I put some initial comments just by looking at the code. I like your test scripts. I am doing exactly the same in the initial development of new features. Removing |
Codecov Report
@@ Coverage Diff @@
## master #4139 +/- ##
==========================================
+ Coverage 99.41% 99.42% +<.01%
==========================================
Files 72 72
Lines 13909 13977 +68
==========================================
+ Hits 13828 13896 +68
Misses 81 81
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about those helper functions. They just makes an API to efficient operations more user friendly, but eventually when we optimize current API well, they will became redundant.
|
||
#' @param cols required; can accept all values that .SDcols accepts | ||
#' @param logic required; a function or unquoted text that results in logic evaluation. The | ||
#' unquoted text must include `x` as an argument. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unquoted text must include
x
as an argument.
why you chose such an API? I haven't seen it anywhere in base R and package that I am using. IMO it is better to at least expect function(x) ...logic...
then x
is a naturally expected object in the logic. User may have x
variable defined in their code and expect it to be found in current scenario. This is how scoping in R should behave, proposed API works against it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
purrr
uses ~ .x
shorthand - I admit that I may have have taken it to the extreme.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is counterinituitive, the purr approach, because ~
has quite different context in base R. And yes, your approach is even more...
Allows for two new helper functions in i:
filter.at(cols, logic, by, all.vars = TRUE)
andtop.n(n, wt, by, ties = FALSE)
. Closes #4133 and #3804. WIPThe helper functions assist end-users by helping make the
by
function available ini
while also allowing developers to optimize for the most likely use cases. For example, depending on the arguments provided,top.n
allows for similarities tohead(dt, n)
, optimized non-grouping withdt[order(-wt)[1:n]]
, and then performant groupings withdt[dt[order(-wt), .I[1:n], by = grp]$V1]
.See the
filter.at.R
andtop.n.R
for many use cases which I would use to make a vignette if this is ultimately merged, but here are a few examples:filter.at:
top.n