-
Notifications
You must be signed in to change notification settings - Fork 990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
last/first get argument na.rm #5168
base: master
Are you sure you want to change the base?
Conversation
edit: consistency issue fixed options(datatable.verbose=TRUE)
DT = data.table(a=c(1:3,NA), b=1:2, c=c(1L, rep(NA, 3)))
options(datatable.optimize=2L)
DT[, last(.SD, na.rm=TRUE), b]
#> Argument 'by' after substitute: b
#> Finding groups using forderv ... forder.c received 4 rows and 1 columns
#> 0.000s elapsed (0.001s cpu)
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
#> Getting back original order ... forder.c received a vector type 'integer' length 2
#> 0.000s elapsed (0.000s cpu)
#> lapply optimization changed j from 'last(.SD, na.rm = TRUE)' to 'list(last(a, na.rm = TRUE), last(c, na.rm = TRUE))'
#> GForce optimized j to 'list(glast(a, na.rm = TRUE), glast(c, na.rm = TRUE))'
#> Making each group and running j (GForce TRUE) ... gforce initial population of grp took 0.000
#> gforce assign high and low took 0.000
#> gforce eval took 0.000
#> 0.001s elapsed (0.001s cpu)
#> b a c
#> 1: 1 3 1
#> 2: 2 2 NA
options(datatable.optimize=1L)
DT[, last(.SD, na.rm=TRUE), b]
#> Argument 'by' after substitute: b
#> Finding groups using forderv ... forder.c received 4 rows and 1 columns
#> 0.000s elapsed (0.000s cpu)
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
#> Getting back original order ... forder.c received a vector type 'integer' length 2
#> 0.000s elapsed (0.001s cpu)
#> lapply optimization changed j from 'last(.SD, na.rm = TRUE)' to 'list(last(a, na.rm = TRUE), last(c, na.rm = TRUE))'
#> Old mean optimization is on, left j unchanged.
#> Making each group and running j (GForce FALSE) ... last: using last(x[!is.na(x)]): na.rm=TRUE
#> last: using last(x[!is.na(x)]): na.rm=TRUE
#> last: using last(x[!is.na(x)]): na.rm=TRUE
#> last: using last(x[!is.na(x)]): na.rm=TRUE
#>
#> collecting discontiguous groups took 0.000s for 2 groups
#> eval(j) took 0.000s for 2 calls
#> 0.000s elapsed (0.000s cpu)
#> b a c
#> 1: 1 3 1
#> 2: 2 2 NA
options(datatable.optimize=0L)
DT[, last(.SD, na.rm=TRUE), b]
#> Argument 'by' after substitute: b
#> Finding groups using forderv ... forder.c received 4 rows and 1 columns
#> 0.000s elapsed (0.001s cpu)
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
#> Getting back original order ... forder.c received a vector type 'integer' length 2
#> 0.000s elapsed (0.000s cpu)
#> All optimizations are turned off
#> Making each group and running j (GForce FALSE) ... last: using x[, lapply(.SD, last, na.rm=TRUE)]): na.rm=TRUE
#> last: using last(x[!is.na(x)]): na.rm=TRUE
#> last: using last(x[!is.na(x)]): na.rm=TRUE
#> The result of j is a named list. It's very inefficient to create the same names over and over again for each group. When j=list(...), any names are detected, removed and put back after grouping has completed, for efficiency. Using j=transform(), for example, prevents that speedup (consider changing to :=). This message may be upgraded to warning in future.
#> last: using x[, lapply(.SD, last, na.rm=TRUE)]): na.rm=TRUE
#> last: using last(x[!is.na(x)]): na.rm=TRUE
#> last: using last(x[!is.na(x)]): na.rm=TRUE
#>
#> collecting discontiguous groups took 0.000s for 2 groups
#> eval(j) took 0.001s for 2 calls
#> 0.001s elapsed (0.001s cpu)
#> b a c
#> 1: 1 3 1
#> 2: 2 2 NA |
Codecov Report
@@ Coverage Diff @@
## master #5168 +/- ##
==========================================
- Coverage 99.51% 99.49% -0.02%
==========================================
Files 78 77 -1
Lines 14756 14675 -81
==========================================
- Hits 14684 14601 -83
- Misses 72 74 +2
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
@mattdowle I fixed expected behavior by turning off apply optimization for |
…1 (melt) with 'match.call(fun, orig_call): invalid definition argument' after running the new tests at the prompt; so removed variables and created DT directly in new tests
R/data.table.R
Outdated
} | ||
# only lapply optimize if first/last has na.rm=FALSE see also #5168 | ||
headopt = jsub[[1L]] == "head" || jsub[[1L]] == "tail" | ||
firstopt = jsub[[1L]] == "first" || jsub[[1L]] == "last" && !narm_arg(first, jsub) ## fix for #2030 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iiuc, the ||
part here needs parens around it. and if so, there's a test missing because DT[, first(.SD, na.rm=TRUE), by=]
is still being optimized but that's not intended. Will do ...
In fact I'm about to change it so that first(DT, na.rm=TRUE)
does it column-wise, so then first(.SD, na.rm=TRUE)
can be optimized with a consistent result. na.rm="row"
can remove rows containing any NA.
… to use named arguments now that n and na.rm can be items 3 and 4 with either appearing first
If it's only output to test differently we can do c( |
…ength-1 and pad with NA
"wip shift multiple n return data.table rather than list" commit. |
@jangorecki shift multiple n returning list rather than data.table was causing problems in this PR yes. Since a data.table is a list too, I don't see much of an issue. But as I wrote it is wip. Since the list returned contains vectors of the same length, marking it as a data.table so anything using it knows it has a regular shape makes some sense to me. I don't think calling it out as a 'serious' breaking change helps. |
@ben-schwen do you want to update this PR to |
Ping @ben-schwen appreciate your help resolving conflicts here :) |
Will do. Not sure if I find the time today but definitely over the next week! |
Done all but |
324ce38
to
966f00b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SET_TRUELENGTH is not API
That was my plan. Matt's change in gsumm.c look favorable and make sense but I want to split them into multiple PRs since they are hard to grasp |
@ben-schwen I'm wondering whether to include this for 1.17.0, WDYT? |
These changes almost certainly have a stumbling block. I'm already working on the PR for it but wouldn't pull it into the 1.17 release. I guess we better aim for a stable release. |
Closes #4446
Closes #4239
Adds an
na.rm
argument with defaultna.rm=FALSE
tofirst
/last
and their respective GForce optimized versionsgfirst
/glast
.Here should be noted that
gfirst(na.rm=TRUE)
andglast(na.rm=TRUE)
return the first resp. last non NA value if such a value exists. If such a value does not existgfirst
/glast
will still return NA similar to gmedian(), gvar() and gsd().