Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Passing named lists to .SDcols / .SD #5020

Open
grantmcdermott opened this issue May 21, 2021 · 26 comments
Open

Passing named lists to .SDcols / .SD #5020

grantmcdermott opened this issue May 21, 2021 · 26 comments
Labels
programming parameterizing queries: get, mget, eval, env top request One of our most-requested issues
Milestone

Comments

@grantmcdermott
Copy link
Contributor

grantmcdermott commented May 21, 2021

Is there any scope/appetite for supporting multiple .SD and .SDcols?

Motivation: I frequently encounter situations where I need to perform different aggregation tasks on distinct column groups. One group of columns will be aggregated as means, another group will be aggregated as medians, yet another group will be aggregated as sums, etc. In these cases, only one group can be passed through the convenience features of .SD(cols), while the other group(s) must all be aggregated manually.

Here's a simple (and somewhat ill-advised) example that illustrates the mechanics:

library(data.table)
d = as.data.table(iris)

d[, 
  c(lapply(.SD, sum),
    list(Petal.Length = mean(Petal.Length), Petal.Width = mean(Petal.Width))), 
 .SDcols = patterns("^Sepal"), 
 by = Species]
#>       Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1:     setosa        250.3       171.4        1.462       0.246
#> 2: versicolor        296.8       138.5        4.260       1.326
#> 3:  virginica        329.4       148.7        5.552       2.026

Here the summed Sepal columns get the .SD convenience treatment, while I have to manually take the mean of the Petal columns separately (and name them; see also #1227 (comment)).

My proposal is to allow something like this instead:

d[, 
  c(lapply(.SD, sum), lapply(.SD2, mean)), 
  .SDcols = patterns("^Sepal"), .SDcols2 = patterns("^Petal"),
  by = Species]

I'm assuming here that you can match the relevant subsets based on the index (.SD2 => .SDcols2). If this is easy to do for one additional subset, then in principle it seems possible for any additional subsets (.SD3 => .SDcols3, etc). Of course, this may impose some small overhead that doesn't pass the cost-benefit test.

Feel free to close if this seems undesirable / too much work. Thanks for considering!

Update. Related: #1063 (comment) and possibly #4970

Proposed solution in the former is:

s_cols = grep("^Sepal", names(d), value = TRUE)
p_cols = grep("^Petal", names(d), value = TRUE)
d[, 
  {
   SD = unclass(.SD)
   c(lapply(SD[s_cols], sum), lapply(SD[p_cols], mean))
  },
  .SDcols = c(s_cols, p_cols),
  by = Species]
#>       Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1:     setosa        250.3       171.4        1.462       0.246
#> 2: versicolor        296.8       138.5        4.260       1.326
#> 3:  virginica        329.4       148.7        5.552       2.026
SessionInfo

> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Arch Linux

Matrix products: default
BLAS/LAPACK: /usr/lib/libopenblas_haswellp-r0.3.13.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

@renkun-ken
Copy link
Member

I have such use cases too.

If we don't want to introduce new parameters, it might make some sense to allow .SDcols to accept a list of character vectors so that .SD provides a list of data.tables for such purpose. Then an example code looks like

dt[, rowSums(.SD[[1]]) / rowSums(.SD[[2]]), .SDcols = list(c("a", "b"), c("x", "y", "z")]

or named list of .SDcols:

dt[, rowSums(.SD$a) / rowSums(.SD$b), .SDcols = list(a = c("m", "n"), b = c("x", "y", "z"))]

or a list of patterns:

dt[, rowSums(.SD$a) / rowSums(.SD$b), .SDcols = list(a = patterns("^m\\d+$"), b = patterns("^n\\d+$"))]

@grantmcdermott
Copy link
Contributor Author

@renkun-ken Oh, I like that too.

@MichaelChirico
Copy link
Member

thanks for raising. I've had something like @renkun-ken 's suggestion in mind for a long time, I would have sworn it's already an issue but I couldn't find anything searching.

unfortunately I don't think it can be so simple? since .SD[[I]] currently refers to the ith column of .SD, right?

@renkun-ken
Copy link
Member

unfortunately I don't think it can be so simple? since .SD[[I]] currently refers to the ith column of .SD, right?

Yes, when .SDcols is a character vector. If .SDcols accepts a list, then it seems to require that the user know that .SD would also become a list of data.tables instead.

@avimallu
Copy link
Contributor

avimallu commented May 22, 2021

This would work very well with what I proposed partially in #4970 (more like requested a feature). Building upon @renkun-ken's idea of .SDcols accepting a list, we can potentially do the following without much workaround and breaking changes (to existing implementation):

  1. SDcols can accept a list of character, any function that returns logical or integer output, or an integer itself.
  2. This is much like how the current .SDcols accept any of the individual type of any of the above argument classes.
  3. When .SDcols is named, we can refer to it as so: .SD["name"] or we can always use their indexes.
  4. We would still need the use to refer to the first/last or nth item of .SD with .SD["name"][1L] or for a specific column by .SD["name"][["column"]].
  5. In the event that length(.SDcols) has a length of 1 or is a non-list class, we could revert to existing behaviour.

If my understanding of the code is right, we have the ability to currently differentiate between all the three different types of arguments to .SDcols above, so with some minor changes (which I'm guessing will be implementable in R itself), we should be able to add that feature. It would make codes like the following possible:

Lahman::Batting)[
  , .(
    lapply(.SD["run_types"], sum),
    lapply(.SD["stints"], uniqueN),
    lapply(.SD["patt"], \(x) sum(x)/uniqueN(yearID))
  ),
  playerID,
  .SDcols = list(
    run_types = c("R", "IBB", "SO"),
    stints = 3:4,
    patt = patterns("$R^|(X.B)|HR"))]

I think the potential syntactical inconsistency with .SD[["column"]] not referring to column in .SD can be forgiven, similar to how DT[x] is not filtering for values indicated by x if it is a data.table object (without bending our definition of a join to be a filter).

If/when #4970 and #4883 are implemented, it would allow for ridiculously flexible operations involving across, multiple functions and multiple .SD without breaking existing syntax.

@renkun-ken
Copy link
Member

renkun-ken commented May 22, 2021

Thanks @avimallu for the thoughts on this.

In the event that length(.SDcols) has a length of 1 or is a non-list class, we could revert to existing behaviour.

If we are building consistent behavior for programming purposes, I guess we should not let length-1 list revert to existing behavior but a list of one data.table instead so that the following example could behave consistently.

dt_rowSums_sd <- function(data, col_groups) {
  d1 <- data[, lapply(.SD, rowSums), keyby = name, .SDcols = col_groups]
  d1[, lapply(.SD, sd, na.rm = TRUE), keyby = name, .SDcols = -"name"]
}

dt_rowSums_sd(data, list(g1 = c("a", "b")))
dt_rowSums_sd(data, list(g1 = c("a", "b"), g2 = c("x", "y")))

I guess if user provides a list of one character vector instead of the vector directly, it is most likely on purpose.

@jangorecki
Copy link
Member

jangorecki commented May 22, 2021

This may sounds like a self promotion of new metaprogramming interface, but I would advocate to use it instead of extending .SD.
It would look like

dt[, rowSums(.l1) / rowSums(.l2), env = list(
  .l1=list("a", "b"),
  .l2=list("x", "y", "z")
)]

substitute2(rowSums(.l1) / rowSums(.l2), env = list(.l1=list("a", "b"), .l2=list("x", "y", "z")))
#rowSums(list(a, b))/rowSums(list(x, y, z))

Providing columns by patterns should not be an issue because we can do whatever we want to prepare env arg.

.SD is just a special case of env usage

DT[, .SD, env=list(.SD=as.list(names(DT)))]

So if more general solution is already available then why not to use it?

@jangorecki jangorecki added the programming parameterizing queries: get, mget, eval, env label May 22, 2021
@grantmcdermott
Copy link
Contributor Author

Very cool @jangorecki, I hadn't seen the new metaprogramming interface. I'll upgrade to the dev version of table.table to try it out shortly.

In the meantime, just so I have something concrete to compare, would you mind showing me how you would implement my example at the very top of the thread? Don't worry about pattern matching, I'm just interested to see how it would work.

@jangorecki
Copy link
Member

jangorecki commented May 22, 2021

## mockup function that will return character column names
patterns = function(x) paste(sub("^", "", x, fixed=TRUE), c("Width","Length"), sep=".")

substitute2(c(
  lapply(.SD, sum), lapply(.SD2, mean)
  ), env = list(
    .SD = as.list(patterns("^Sepal")),
    .SD2 = as.list(patterns("^Petal"))
))
#c(lapply(list(Sepal.Width, Sepal.Length), sum), lapply(list(Petal.Width, Petal.Length), mean))

It is probably not wise to use .SD name in env. If we would use .SDcols as well, it would be ignored because .SD from env is being substituted at the very beginning and it doesn't exist anymore when .SDcols would be processed.

d[, 
  c(lapply(.SD, sum), lapply(.SD2, mean)), 
  env = list(.SD = as.list(patterns("^Sepal")), .SD2 = as.list(patterns("^Petal"))),
  by = Species]

@grantmcdermott
Copy link
Contributor Author

Oh, very nice.

It's very close close to what I was looking for originally... and looks like it just about provides a more generalisable, drop-in replacement for .SD. Which I assume was partly your goal?

My one remaining issue is the naming of the columns. Retaining the original column names is a side-feature of .SD that I really like. (Again, discussed elsewhere.) Is there a way to do that here instead of returning "V1", "V2", etc.

@jangorecki
Copy link
Member

jangorecki commented May 22, 2021

Possibly this will do

patterns = function(x) setNames(nm=paste(sub("^", "", x, fixed=TRUE), c("Width","Length"), sep="."))
#...
#c(lapply(list(Sepal.Width = Sepal.Width, Sepal.Length = Sepal.Length), 
#    sum), lapply(list(Petal.Width = Petal.Width, Petal.Length = Petal.Length), 
#    mean))

@jangorecki
Copy link
Member

jangorecki commented May 22, 2021

Replacing .SD was not the goal because it works pretty well. Intention was more to replace get/mget which AFAIK could carry performance overhead.
https://rdatatable.gitlab.io/data.table/library/data.table/html/substitute2.html this manual can be useful

@grantmcdermott
Copy link
Contributor Author

grantmcdermott commented May 22, 2021

Excellent, thanks Jan. (Sorry for sporadic replies; I'm sneaking time online in between weekend parenting...)

Stepping back, here's my quick summary of this thread so far. Feel free to push back on or add to what I'm about to say.

  1. The advantage of sticking with / modifying .SD is that it provides a familiar interface that will be intuitive to all data.table users. I particularly like @renkun-ken's suggestion of allowing .SD to accept (named) lists that could then be referenced accordingly.
  2. Having said that, I've just stumbled on this example from Jan in a related thread, which I missed earlier. I gets me (us!) close to what I was originally looking for. However, I do worry that it's a bit opaque for new users and, let's be honest, is not as syntax friendly as equivalent operations in other packages or languages. I also don't think it's quite so straightforward to combine other convenience feature like patterns, but I might be wrong here.
s_cols = grep("^Sepal", names(d), value = TRUE)
p_cols = grep("^Petal", names(d), value = TRUE)

d[, 
  {
   SD = unclass(.SD)
   c(lapply(SD[s_cols], sum), lapply(SD[p_cols], mean))
  },
  .SDcols = c(s_cols, p_cols),
  by = Species]
  1. The new metaprogramming interface is very cool and I need to spend more time with it. As a fairly experienced data.table user and advocate, I'm a bit concerned that we're sending mixed signals to new users about when to use .SD and when to switch to env. I like this earlier example from Jan (slightly adapted):
patterns2 = function(x) setNames(nm=paste(sub("^", "", x, fixed=TRUE), c("Width","Length"), sep="."))
d[, 
  c(lapply(SD1, sum), lapply(SD2, mean)), 
  env = list(SD1 = as.list(patterns2("^Sepal")), SD2 = as.list(patterns2("^Petal"))),
  by = Species]

It's probably just a limitation of this knock-up patterns2 function, but ideally I'd like to be able to combine it with another variable. However, this leaves a blank column:

d[, 
  c(lapply(SD1, sum), lapply(SD2, mean)), 
  env = list(SD1 = as.list(c(patterns2("^Sepal"), "Petal.Width")), SD2 = as.list(patterns2("^Petal"))),
  by = Species]

@matthewgson
Copy link

I'm happy to find this post, and env looks easy to comprehend. Would there be a way to assign column names inside the bracket expression in following example?

cols1 = c('var1','var2')
cols2 = c('var3','var4')
DT[, 
      c( lapply(SD1, sum), lapply(SD2, mean) ), 
      env = list(SD1 = as.list(cols1), .SD2 = as.list(cols2)),
      by = Species]

# pseudo code that I'd like to perform
DT[, 
      c(cols1, cols2) = c( lapply(SD1, sum), lapply(SD2, mean) ), 
      env = list(SD1 = as.list(cols1), .SD2 = as.list(cols2)),
      by = Species]

@jangorecki
Copy link
Member

Yes, there is a way. I am not on workstation now. If you play a little bit you should figure out. Try substitute2 manual examples. Also be aware setNames can be used here as well, but I would check verbose=T if it doesn't switch off GForce optimization.

@matthewgson
Copy link

matthewgson commented May 29, 2021

@jangorecki Thank you - setNames function was what I was looking for, and it feels handy for me. I'm not clear about GForce optimization but here's what I got..!

DT = as.data.table(iris)
cols1 = c('Sepal.Length','Sepal.Width')
cols2 = c('Petal.Length','Petal.Width')
DT[, setNames(c(lapply(SD1, mean), lapply(SD2, median)), c(cols1,cols2)),
   by = Species,
   env = list(SD1 = as.list(cols1), SD2 = as.list(cols2)),
   verbose = TRUE]

Argument 'by' after substitute: Species
Argument 'j'  after substitute: setNames(c(lapply(list(Sepal.Length, Sepal.Width), mean), lapply(list(Petal.Length, Petal.Width), median)), c(col1, col2))
Detected that j uses these columns: [Sepal.Length, Sepal.Width, Petal.Length, Petal.Width]
Finding groups using forderv ... forder.c received 150 rows and 1 columns
0.000s elapsed (0.001s cpu) 
Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
lapply optimization is on, j unchanged as 'setNames(c(lapply(list(Sepal.Length, Sepal.Width), mean), lapply(list(Petal.Length, Petal.Width), median)), c(col1, col2))'
GForce is on, left j unchanged
Old mean optimization is on, left j unchanged.
Making each group and running j (GForce FALSE) ... The result of j is a named list. It's very inefficient to create the same names over and over again for each group. When j=list(...), any names are detected, removed and put back after grouping has completed, for efficiency. Using j=transform(), for example, prevents that speedup (consider changing to :=). This message may be upgraded to warning in future.

  memcpy contiguous groups took 0.000s for 3 groups
  eval(j) took 0.000s for 3 calls
0.001s elapsed (0.001s cpu) 

      Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1:     setosa        5.006       3.428         1.50         0.2
2: versicolor        5.936       2.770         4.35         1.3
3:  virginica        6.588       2.974         5.55         2.0

@jangorecki
Copy link
Member

I see nice verbose info, and GForce is disabled so this setNames approach is not ideal. I will have a look later on what are the alternatives.

@jangorecki
Copy link
Member

jangorecki commented May 31, 2021

@matthewgson so...
this works fine

DT[, c(lapply(SD1, sum), lapply(SD2, mean)), by=Species,
   env = list(
     SD1 = as.list(setNames(nm=cols1)),
     SD2 = as.list(setNames(nm=cols2))
)]

the problem is that, as of now, it will not be GForce optimized. I described it in #5032. To make GForce working as of now you need to combine those two lists manually.

DT[, .j, by=Species,
  env = list(.j = c(
    lapply(lapply(setNames(nm=cols1), as.name), function(v) call("sum", v)),
    lapply(lapply(setNames(nm=cols2), as.name), function(v) call("mean", v))
))]

Note that the latter would be much more complex using base R substitute because in substitute2 we have an "enlisting" feature that turns a list (result fromc()) into list call, therefore it is interpreted in j as written "by hand".

@renkun-ken
Copy link
Member

Didn't know the env= interface turns a list into list call. Then I think the env= is good enough for the purpose here.

@jangorecki
Copy link
Member

Just for reference, it is in examples in manual

## list elements are enlist'ed into list calls
(cl1 = substitute(f(lst), list(lst = list(1L, 2L))))
#f(list(1L, 2L))
(cl2 = substitute2(f(lst), I(list(lst = list(1L, 2L)))))
#f(list(1L, 2L))
(cl3 = substitute2(f(lst), list(lst = I(list(1L, 2L)))))
#f(list(1L, 2L))
(cl4 = substitute2(f(lst), list(lst = quote(list(1L, 2L)))))
#f(list(1L, 2L))
(cl5 = substitute2(f(lst), list(lst = list(1L, 2L))))
#f(list(1L, 2L))
cl1[[2L]] ## base R substitute with list element
#[[1]]
#[1] 1
#
#[[2]]
#[1] 2
#
cl2[[2L]] ## same
#[[1]]
#[1] 1
#
#[[2]]
#[1] 2
#
cl3[[2L]] ## same
#[[1]]
#[1] 1
#
#[[2]]
#[1] 2
#
cl4[[2L]] ## desired
#list(1L, 2L)
cl5[[2L]] ## automatically
#list(1L, 2L)

@grantmcdermott
Copy link
Contributor Author

Personally, I'd still like to be able to pass a named list on to .SD, so that users can easily handle multiple aggregations in a familiar interface. Together with the aforementioned PR #4883 (which looks go to go from my local testing?), this would eliminate one of my few remaining data.table bug bears ;-)

Having said that, it's not clear to me that enough other users feel the same. And --- don't get me wrong --- the env functionality looks great. I'll try to incorporate it once it fully supports GForce, per Jan's comment above.

I'll close this issue then in the interests of reducing clutter for the DT devs. Thanks everyone for the useful discussion.

@Kamgang-B
Copy link
Contributor

Kamgang-B commented Sep 11, 2021

My 2 cents.

After reading the discussion above and the great new argument env, I still think that a named list .SDcols have significant benefits that env does not.

env argument certainly can achieve what we can do with a list .SDcols. In this sense, env encapsulates the .SDcols functionality. But when piping/chaining, list .SDcols can definitely do what env cannot. This is because env in many queries requires direct access to column names beforehand by users when they need to use a subset of columns, which is doable only at the beginning of a chain (see 4- below).

env is really very great as a programming interface. e.g. It makes it easy and powerful to use data.tables inside functions.

When it comes to data cleaning/wrangling, where interactivity with data is very frequent, I think that env argument is very verbose, hard to use when chaining/piping and requires more mental effort to code or understand other users' code than a list .SDcols in cases where both a list .SDcols and env can be used to achieve the same purpose.

I will try to explain the benefit that a list .SDcols brings over env argument in addition to the great arguments already advanced in @grantmcdermott 's comment above, here.

By list .SDcols, I mean an extension of the current .SDcols to a list where each element has exactly the same properties/features as the current implementation of .SDcols, just like multiple .SDcols stored in a list.

like the current .SDcols, a list .SDcols would support column ranges (startcolname:endcolname or startcolint:endcolint), negation (leading - or !), regex, logical vectors, character vectors and functions in a handy and straightforward way. The equivalent way to reach the same purpose using env is quite verbose, less readable, more user-specific, and harder to code.

# data for an illustration
A = as.data.table(ggplot2::diamonds)

env is quite verbose compared to named list SDcols

# 1- list .SDcols
A[j = c(first2=lapply(SD1, first, 2), rg=lapply(SD2, range), med=lapply(SD3, median), ndist=lapply(SD4, uniqueN)),
  .SDcols = list(SD1=is.ordered, SD2=x:z, SD3=5:7, SD4=!is.numeric)]

# 2- equivalent using env
A[j = c(first2=lapply(SD1, first, 2), rg=lapply(SD2, range), med=lapply(SD3, median), ndist=lapply(SD4, uniqueN)),
  env = list(
	SD1=as.list(setNames(nm=names(A)[sapply(A, is.ordered)])),
	SD2=as.list(setNames(nm=names(A[, x:z]))),
	SD3=as.list(setNames(nm=names(A)[5:7])),
	SD4=as.list(setNames(nm=names(A)[sapply(A, function(x) !is.numeric(x))]))
  )]

The env approach is really verbose and not friendly. It requires significantly more mental effort to understand what is being done. Note that this would still be true if even a user create a custom function to abstract the repetitive parts in the env argument. While in the .SDcols list case we can pass functions, column ranges (startcolname:endcolname or startcolint:endcolint), characters vectors, logical vectors, etc., all of these eventually combined with the negation operator (e.g. !is.numeric, !patterns(...), !c("x", "y"), etc.), this is definitely not straightforward with the env argument.

Piping/chaining with env is not practical because it requires access to the column names by the user when chaining

Assuming that we want to pipe the data.table from the previous query for additional operations, we can do something like

# 3- list .SDcols case
A |> 
  DT(j = c(first2=lapply(SD1, first, 2), rg=lapply(SD2, range), med=lapply(SD3, median), ndist=lapply(SD4, uniqueN)),
     .SDcols = list(SD1=is.ordered, SD2=x:z, SD3=5:7, SD4=!is.numeric)) |> 
  DT(j = c(lapply(SD1, droplevels), lapply(SD2, mean), SD3),
     .SDcols = list(SD1=is.factor, SD2=is.double, SD3=patterns("^ndist")))  # here the columns that satisfy the conditions are internally determined using the most recent data.table in the chain

# 4- what is the `env` equivalent ?
A |> 
  DT(j = c(first2=lapply(SD1, first, 2), rg=lapply(SD2, range), med=lapply(SD3, median), ndist=lapply(SD4, uniqueN)),
     env = list(
	 SD1=as.list(setNames(nm=names(A)[sapply(A, is.ordered)])),
	 SD2=as.list(setNames(nm=names(A[, x:z]))),
	 SD3=as.list(setNames(nm=names(A)[5:7])),
	 SD4=as.list(setNames(nm=names(A)[sapply(A, function(x) !is.numeric(x))]))
     )) |> 
  # HOW TO GET THE COLUMN NAMES IN THE env WHEN CHAINING?
  DT(j = c(lapply(SD1, droplevels), lapply(SD2, mean), SD3),
     env = list(
	 SD1=as.list(setNames(nm=names(?????)[sapply(?????, is.factor)])),          # ??
	 SD2=as.list(setNames(nm=names(?????)[sapply(?????, is.double)])),          # ??
	 SD3=as.list(setNames(nm=grep("^ndist", names(?????), value=TRUE)))))       # ??

As said above, env argument is really a great and powerful programming interface tool (for instance when using a data.table in a function) and offers a better alternative to get/mget functions. I don't really see it as a good alternative to a named list .SDcols.
A name list would be a very handy, intuitive, concise, and easy to learn extension of .SDcols more suitable for
data cleaning/wrangling where users spend a great amount of time interacting continuously/frequently with data.

I really wonder if this issue should not be reopened. Named list .SDcols is just very nice and straightforward!
I would suggest to reopened it unless it is absolutely certain that this feature cannot be provided.

@grantmcdermott
Copy link
Contributor Author

grantmcdermott commented Sep 11, 2021

Thanks for the detailed comment @Kamgang-B. The truth is that I've somewhat regretted closing this issue, because I too feel that it would be a really nice feature. (In addition and complementary to env.) I've reopened it and changed the title to explicitly reference a named .SDcols list, because I think that's what we've settled on.

Jan and the other DT devs should ofc feel free to re-close if they disagree.

@grantmcdermott grantmcdermott changed the title Multiple .SD and .SDcols support (.SD2, .SDcols2, etc) Passing named lists to .SDcols / .SD Sep 11, 2021
@renkun-ken
Copy link
Member

I think @Kamgang-B made a very good point of the simplicity and power of using .SDcols over env= for the use cases of quickly selecting columns.

@jangorecki
Copy link
Member

Well explained advantages of .SD over env. Agree this FR make sense.

@mattdowle mattdowle added this to the 1.14.3 milestone Oct 12, 2021
@jangorecki jangorecki modified the milestones: 1.14.3, 1.14.5 Jul 19, 2022
@jangorecki jangorecki modified the milestones: 1.14.11, 1.15.1 Oct 29, 2023
@MichaelChirico MichaelChirico added the top request One of our most-requested issues label Apr 14, 2024
@MichaelChirico MichaelChirico modified the milestones: 1.16.0, 1.17.0 Jul 10, 2024
@MichaelChirico
Copy link
Member

Really nice thread and comment by @Kamgang-B. It also resolves my concern about the earlier suggestions to do .SDcols = list(a = ..., b = ...) combined with .SD$a, which creates a fundamental ambiguity about whether $a refers to a column or a data.table. @avimallu's approach to use [ instead is an improvement (we'd maybe want to make .SD a new S3 class like ("sd.data.table", "data.table", "data.frame")?), but still hits some fundamental conflicts since DT["key"] is already valid for other data.tables (even though there are no hits for this pattern as of today).

The approach to require .SDcols be a named list, whose names would then become corresponding symbols available for evaluation in j, looks good to me (it will be a bit harder to implement).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
programming parameterizing queries: get, mget, eval, env top request One of our most-requested issues
Projects
None yet
Development

No branches or pull requests

8 participants