How do .I and .SD work? #3668

hadley · 2019-06-27T14:48:51Z

I can access .N from inside a function inside a data table by evaluating in the parent frame. The same technique doesn't work for .I and .SD. Are they only created lazily if they appear in the AST?

library(data.table)

dt_N <- function() eval(quote(.N), parent.frame())
dt_I <- function() eval(quote(.I), parent.frame())
dt_SD <- function() eval(quote(.SD), parent.frame())

dt <- data.table(x = 1:10)

dt[, dt_N()]
#> [1] 10
dt[, dt_I()]
#> [1] 0
dt[, dt_SD()]
#> Null data.table (0 rows and 0 cols)

^{Created on 2019-06-27 by the reprex package (v0.2.1.9000)}

The text was updated successfully, but these errors were encountered:

franknarf1 · 2019-06-27T17:41:06Z

From what I can tell, the j expression is searched for the symbols .I and .SD. Eg,

data.table/R/data.table.R

Line 891 in dbb0d0b

use.I = ".I" %chin% av

If they are found, then they'll be added to the environment, so these variants work:

dt[, {.SD; dt_SD()}]
dt[, {.I; dt_I()}]

This is not idiomatic metaprogramming, and will interfere with optimization:

dt[, lapply({.SD; dt_SD()}, max), by=rep(1:2, each=5), verbose=TRUE]

The result of j is a named list. It's very inefficient to create the same names over and over again for each group. When j=list(...), any names are detected, removed and put back after grouping has completed, for efficiency. Using j=transform(), for example, prevents that speedup (consider changing to :=). This message may be upgraded to warning in future.

... while dt[, lapply(.SD, max), by=rep(1:2, each=5), verbose=TRUE] will use GForce, know what to do with the names, etc.

A couple idioms are eval(quote(DT[stuff])) or DT[eval(myi), eval(myj), by=eval(myby)]. The first way may fit better with chaining:

> (my_lazy_call <- call("[", x = as.name("dt"), j = as.name(".SD")))
dt[j = .SD]
> (my_lazier_call <- call("[", x = my_lazy_call, j = as.name(".N")))
dt[j = .SD][j = .N]

hadley · 2019-06-27T18:36:17Z

For my purposes it seems like it'll be easiest to generate .I myself (with seq_len(.N)) and I'll transform . to .SD in the expressions that I pass on to data.table.

ToeKneeFan · 2019-06-27T21:24:22Z

@hadley If you're planning on making dt_I() a user-visible function in dtplyr to exactly reproduce .I, if .I occurs in the same [.data.table call as a grouping by = operation, that will not work. .I gives the indices in the original data.table, not in the group-level subset.

Here's an example to illustrate this:

library(data.table)
dt <- data.table(X = rep(c("A", "B", "C"), each = 3))
dt[, seq_len(.N), by = X]
#    X V1
# 1: A  1
# 2: A  2
# 3: A  3
# 4: B  1
# 5: B  2
# 6: B  3
# 7: C  1
# 8: C  2
# 9: C  3
dt[, .I, by = X]
#    X I
# 1: A 1
# 2: A 2
# 3: A 3
# 4: B 4
# 5: B 5
# 6: B 6
# 7: C 7
# 8: C 8
# 9: C 9

hadley · 2019-06-27T22:26:18Z

Ok, then seq_len(.N) is definitely what I'm looking for. Thanks!

ghowoo · 2019-06-27T22:51:02Z

It is a very nice explanation. Could you please also illuminate a little bit about the difference between seq_len(.N) and 1:.N? I have been using 1:.N and it is my first time know seq_len(.N).
It works in the same way:
dt[, V1 := 1:.N, by = X]
X V1
1: A 1
2: A 2
3: A 3
4: B 1
5: B 2
6: B 3
7: C 1
8: C 2
9: C 3

But, it doesn't work in this way:
dt[, 1:.N, by = X]
Error in [.data.table(dt, , 1:.N, by = X) :
Item 3 of j is 3 which is outside the column number range [1,ncol=2]

ToeKneeFan · 2019-06-27T23:45:40Z

@ghowoo
dt[, V1 := 1:.N, by = X] is a shortcut for dt[, c("V1") := list(1:.N), by = X]: "take dt, then calculate the assignment of 1:.N to a column named "V1", grouping by X." See the help documentation for data.table and vignettes for more details.

dt[, 1:.N, by = X] fails because using integers only in the j field implicitly sets with = F; that is, [.data.table thinks you mean "select the columns numbered from 1 to n, where n happens to also be how many rows there are" (which fails when .N is greater than the number of columns). This is due to a change in v1.9.8 (25 Nov 2016) (from old NEWS.0.md):

When j contains no unquoted variable names (whether column names or not), with= is now automatically set to FALSE. Thus, DT[,1], DT[,"someCol"], DT[,c("colA","colB")] and DT[,100:109] now work as we all expect them to; i.e., returning columns, #1188, #1149. Since there are no variable names there is no ambiguity as to what was intended. DT[,colName1:colName2] no longer needs with=FALSE either since that is also unambiguous. That is a single call to the : function so with=TRUE could make no sense, despite the presence of unquoted variable names. These changes can be made since nobody can be using the existing behaviour of returning back the literal j value since that can never be useful. This provides a new ability and should not break any existing code. Selecting a single column still returns a 1-column data.table (not a vector, unlike data.frame by default) for type consistency for code (e.g. within DT[...][...] chains) that can sometimes select several columns and sometime one, as has always been the case in data.table. In future, DT[,myCols] (i.e. a single variable name) will look for myCols in calling scope without needing to set with=FALSE too, just as a single symbol appearing in i does already. The new behaviour can be turned on now by setting the tersely named option: options(datatable.WhenJisSymbolThenCallingScope=TRUE). The default is currently FALSE to give you time to change your code. In this future state, one way (i.e. DT[,theColName]) to select the column as a vector rather than a 1-column data.table will no longer work leaving the two other ways that have always worked remaining (since data.table is still just a list after all): DT[["someCol"]] and DT$someCol. Those base R methods are faster too (when iterated many times) by avoiding the small argument checking overhead inside the more flexible DT[...] syntax as has been highlighted in example(data.table) for many years. In the next release, DT[,someCol] will continue with old current behaviour but start to warn if the new option is not set. Then the default will change to TRUE to nudge you to move forward whilst still retaining a way for you to restore old behaviour for this feature only, whilst still allowing you to benefit from other new features of the latest release without changing your code. Then finally after an estimated 2 years from now, the option will be removed.

seq_len(a) and 1:a are equivalent in value for positive integers; the former is faster and will avoid weird answers when a is nonpositive (e.g., 1:0 will give c(1,0)).

Please note that these sorts of questions are usually best addressed on StackOverflow (so that bugs, feature requests, and other issues are not lost amid usage questions). I assume hadley is being given a little leeway because he is developing dtplyr (#3641) and some of his questions pertain to implementation, which is understandable. Nevertheless, I hope this helps!

ghowoo · 2019-06-28T13:44:54Z

@
Thanks for the detailed explanation. I understand it is a more usage related question than an issue related issue. It really helps.

MichaelChirico · 2019-07-02T03:28:30Z

@hadley I think @franknarf1's explanation is pretty much on point.

FWIW I'll add this related issue: #1206 -- .i would alias seq_len(.N)... eventually

hadley · 2019-07-02T12:14:53Z

@MichaelChirico FWIW in dplyr we avoid doing computation when it's not needed by using active bindings in the evaluation environment; that avoids the need to do static analysis of the input expressions (which we try to avoid as much as possible).

hadley closed this as completed Jul 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How do .I and .SD work? #3668

How do .I and .SD work? #3668

hadley commented Jun 27, 2019

franknarf1 commented Jun 27, 2019

Uh oh!

hadley commented Jun 27, 2019

Uh oh!

ToeKneeFan commented Jun 27, 2019

Uh oh!

hadley commented Jun 27, 2019

Uh oh!

ghowoo commented Jun 27, 2019

Uh oh!

ToeKneeFan commented Jun 27, 2019 •

edited

Loading

Uh oh!

ghowoo commented Jun 28, 2019

Uh oh!

MichaelChirico commented Jul 2, 2019

Uh oh!

hadley commented Jul 2, 2019

Uh oh!

How do .I and .SD work? #3668

How do .I and .SD work? #3668

Comments

hadley commented Jun 27, 2019

franknarf1 commented Jun 27, 2019

Uh oh!

hadley commented Jun 27, 2019

Uh oh!

ToeKneeFan commented Jun 27, 2019

Uh oh!

hadley commented Jun 27, 2019

Uh oh!

ghowoo commented Jun 27, 2019

Uh oh!

ToeKneeFan commented Jun 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghowoo commented Jun 28, 2019

Uh oh!

MichaelChirico commented Jul 2, 2019

Uh oh!

hadley commented Jul 2, 2019

Uh oh!

ToeKneeFan commented Jun 27, 2019 •

edited

Loading