Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do .I and .SD work? #3668

Closed
hadley opened this issue Jun 27, 2019 · 9 comments
Closed

How do .I and .SD work? #3668

hadley opened this issue Jun 27, 2019 · 9 comments

Comments

@hadley
Copy link
Contributor

hadley commented Jun 27, 2019

I can access .N from inside a function inside a data table by evaluating in the parent frame. The same technique doesn't work for .I and .SD. Are they only created lazily if they appear in the AST?

library(data.table)

dt_N <- function() eval(quote(.N), parent.frame())
dt_I <- function() eval(quote(.I), parent.frame())
dt_SD <- function() eval(quote(.SD), parent.frame())

dt <- data.table(x = 1:10)

dt[, dt_N()]
#> [1] 10
dt[, dt_I()]
#> [1] 0
dt[, dt_SD()]
#> Null data.table (0 rows and 0 cols)

Created on 2019-06-27 by the reprex package (v0.2.1.9000)

@franknarf1
Copy link
Contributor

From what I can tell, the j expression is searched for the symbols .I and .SD. Eg,

use.I = ".I" %chin% av

If they are found, then they'll be added to the environment, so these variants work:

dt[, {.SD; dt_SD()}]
dt[, {.I; dt_I()}]

This is not idiomatic metaprogramming, and will interfere with optimization:

dt[, lapply({.SD; dt_SD()}, max), by=rep(1:2, each=5), verbose=TRUE]

The result of j is a named list. It's very inefficient to create the same names over and over again for each group. When j=list(...), any names are detected, removed and put back after grouping has completed, for efficiency. Using j=transform(), for example, prevents that speedup (consider changing to :=). This message may be upgraded to warning in future.

... while dt[, lapply(.SD, max), by=rep(1:2, each=5), verbose=TRUE] will use GForce, know what to do with the names, etc.

A couple idioms are eval(quote(DT[stuff])) or DT[eval(myi), eval(myj), by=eval(myby)]. The first way may fit better with chaining:

> (my_lazy_call <- call("[", x = as.name("dt"), j = as.name(".SD")))
dt[j = .SD]
> (my_lazier_call <- call("[", x = my_lazy_call, j = as.name(".N")))
dt[j = .SD][j = .N]

@hadley
Copy link
Contributor Author

hadley commented Jun 27, 2019

For my purposes it seems like it'll be easiest to generate .I myself (with seq_len(.N)) and I'll transform . to .SD in the expressions that I pass on to data.table.

@ToeKneeFan
Copy link

@hadley If you're planning on making dt_I() a user-visible function in dtplyr to exactly reproduce .I, if .I occurs in the same [.data.table call as a grouping by = operation, that will not work. .I gives the indices in the original data.table, not in the group-level subset.

Here's an example to illustrate this:

library(data.table)
dt <- data.table(X = rep(c("A", "B", "C"), each = 3))
dt[, seq_len(.N), by = X]
#    X V1
# 1: A  1
# 2: A  2
# 3: A  3
# 4: B  1
# 5: B  2
# 6: B  3
# 7: C  1
# 8: C  2
# 9: C  3
dt[, .I, by = X]
#    X I
# 1: A 1
# 2: A 2
# 3: A 3
# 4: B 4
# 5: B 5
# 6: B 6
# 7: C 7
# 8: C 8
# 9: C 9

@hadley
Copy link
Contributor Author

hadley commented Jun 27, 2019

Ok, then seq_len(.N) is definitely what I'm looking for. Thanks!

@ghowoo
Copy link

ghowoo commented Jun 27, 2019

It is a very nice explanation. Could you please also illuminate a little bit about the difference between seq_len(.N) and 1:.N? I have been using 1:.N and it is my first time know seq_len(.N).
It works in the same way:
dt[, V1 := 1:.N, by = X]
X V1
1: A 1
2: A 2
3: A 3
4: B 1
5: B 2
6: B 3
7: C 1
8: C 2
9: C 3

But, it doesn't work in this way:
dt[, 1:.N, by = X]
Error in [.data.table(dt, , 1:.N, by = X) :
Item 3 of j is 3 which is outside the column number range [1,ncol=2]

@ToeKneeFan
Copy link

ToeKneeFan commented Jun 27, 2019

@ghowoo
dt[, V1 := 1:.N, by = X] is a shortcut for dt[, c("V1") := list(1:.N), by = X]: "take dt, then calculate the assignment of 1:.N to a column named "V1", grouping by X." See the help documentation for data.table and vignettes for more details.

dt[, 1:.N, by = X] fails because using integers only in the j field implicitly sets with = F; that is, [.data.table thinks you mean "select the columns numbered from 1 to n, where n happens to also be how many rows there are" (which fails when .N is greater than the number of columns). This is due to a change in v1.9.8 (25 Nov 2016) (from old NEWS.0.md):

When j contains no unquoted variable names (whether column names or not), with= is now automatically set to FALSE. Thus, DT[,1], DT[,"someCol"], DT[,c("colA","colB")] and DT[,100:109] now work as we all expect them to; i.e., returning columns, #1188, #1149. Since there are no variable names there is no ambiguity as to what was intended. DT[,colName1:colName2] no longer needs with=FALSE either since that is also unambiguous. That is a single call to the : function so with=TRUE could make no sense, despite the presence of unquoted variable names. These changes can be made since nobody can be using the existing behaviour of returning back the literal j value since that can never be useful. This provides a new ability and should not break any existing code. Selecting a single column still returns a 1-column data.table (not a vector, unlike data.frame by default) for type consistency for code (e.g. within DT[...][...] chains) that can sometimes select several columns and sometime one, as has always been the case in data.table. In future, DT[,myCols] (i.e. a single variable name) will look for myCols in calling scope without needing to set with=FALSE too, just as a single symbol appearing in i does already. The new behaviour can be turned on now by setting the tersely named option: options(datatable.WhenJisSymbolThenCallingScope=TRUE). The default is currently FALSE to give you time to change your code. In this future state, one way (i.e. DT[,theColName]) to select the column as a vector rather than a 1-column data.table will no longer work leaving the two other ways that have always worked remaining (since data.table is still just a list after all): DT[["someCol"]] and DT$someCol. Those base R methods are faster too (when iterated many times) by avoiding the small argument checking overhead inside the more flexible DT[...] syntax as has been highlighted in example(data.table) for many years. In the next release, DT[,someCol] will continue with old current behaviour but start to warn if the new option is not set. Then the default will change to TRUE to nudge you to move forward whilst still retaining a way for you to restore old behaviour for this feature only, whilst still allowing you to benefit from other new features of the latest release without changing your code. Then finally after an estimated 2 years from now, the option will be removed.

seq_len(a) and 1:a are equivalent in value for positive integers; the former is faster and will avoid weird answers when a is nonpositive (e.g., 1:0 will give c(1,0)).

Please note that these sorts of questions are usually best addressed on StackOverflow (so that bugs, feature requests, and other issues are not lost amid usage questions). I assume hadley is being given a little leeway because he is developing dtplyr (#3641) and some of his questions pertain to implementation, which is understandable. Nevertheless, I hope this helps!

@ghowoo
Copy link

ghowoo commented Jun 28, 2019

@
Thanks for the detailed explanation. I understand it is a more usage related question than an issue related issue. It really helps.

@hadley hadley closed this as completed Jul 1, 2019
@MichaelChirico
Copy link
Member

@hadley I think @franknarf1's explanation is pretty much on point.

FWIW I'll add this related issue: #1206 -- .i would alias seq_len(.N)... eventually

@hadley
Copy link
Contributor Author

hadley commented Jul 2, 2019

@MichaelChirico FWIW in dplyr we avoid doing computation when it's not needed by using active bindings in the evaluation environment; that avoids the need to do static analysis of the input expressions (which we try to avoid as much as possible).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants