-
Notifications
You must be signed in to change notification settings - Fork 991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Passing named lists to .SDcols
/ .SD
#5020
Comments
I have such use cases too. If we don't want to introduce new parameters, it might make some sense to allow dt[, rowSums(.SD[[1]]) / rowSums(.SD[[2]]), .SDcols = list(c("a", "b"), c("x", "y", "z")] or named list of dt[, rowSums(.SD$a) / rowSums(.SD$b), .SDcols = list(a = c("m", "n"), b = c("x", "y", "z"))] or a list of patterns: dt[, rowSums(.SD$a) / rowSums(.SD$b), .SDcols = list(a = patterns("^m\\d+$"), b = patterns("^n\\d+$"))] |
@renkun-ken Oh, I like that too. |
thanks for raising. I've had something like @renkun-ken 's suggestion in mind for a long time, I would have sworn it's already an issue but I couldn't find anything searching. unfortunately I don't think it can be so simple? since .SD[[I]] currently refers to the ith column of .SD, right? |
Yes, when |
This would work very well with what I proposed partially in #4970 (more like requested a feature). Building upon @renkun-ken's idea of
If my understanding of the code is right, we have the ability to currently differentiate between all the three different types of arguments to Lahman::Batting)[
, .(
lapply(.SD["run_types"], sum),
lapply(.SD["stints"], uniqueN),
lapply(.SD["patt"], \(x) sum(x)/uniqueN(yearID))
),
playerID,
.SDcols = list(
run_types = c("R", "IBB", "SO"),
stints = 3:4,
patt = patterns("$R^|(X.B)|HR"))] I think the potential syntactical inconsistency with If/when #4970 and #4883 are implemented, it would allow for ridiculously flexible operations involving |
Thanks @avimallu for the thoughts on this.
If we are building consistent behavior for programming purposes, I guess we should not let length-1 list revert to existing behavior but a list of one data.table instead so that the following example could behave consistently. dt_rowSums_sd <- function(data, col_groups) {
d1 <- data[, lapply(.SD, rowSums), keyby = name, .SDcols = col_groups]
d1[, lapply(.SD, sd, na.rm = TRUE), keyby = name, .SDcols = -"name"]
}
dt_rowSums_sd(data, list(g1 = c("a", "b")))
dt_rowSums_sd(data, list(g1 = c("a", "b"), g2 = c("x", "y"))) I guess if user provides a list of one character vector instead of the vector directly, it is most likely on purpose. |
This may sounds like a self promotion of new metaprogramming interface, but I would advocate to use it instead of extending dt[, rowSums(.l1) / rowSums(.l2), env = list(
.l1=list("a", "b"),
.l2=list("x", "y", "z")
)]
substitute2(rowSums(.l1) / rowSums(.l2), env = list(.l1=list("a", "b"), .l2=list("x", "y", "z")))
#rowSums(list(a, b))/rowSums(list(x, y, z)) Providing columns by patterns should not be an issue because we can do whatever we want to prepare
DT[, .SD, env=list(.SD=as.list(names(DT)))] So if more general solution is already available then why not to use it? |
Very cool @jangorecki, I hadn't seen the new metaprogramming interface. I'll upgrade to the dev version of table.table to try it out shortly. In the meantime, just so I have something concrete to compare, would you mind showing me how you would implement my example at the very top of the thread? Don't worry about pattern matching, I'm just interested to see how it would work. |
## mockup function that will return character column names
patterns = function(x) paste(sub("^", "", x, fixed=TRUE), c("Width","Length"), sep=".")
substitute2(c(
lapply(.SD, sum), lapply(.SD2, mean)
), env = list(
.SD = as.list(patterns("^Sepal")),
.SD2 = as.list(patterns("^Petal"))
))
#c(lapply(list(Sepal.Width, Sepal.Length), sum), lapply(list(Petal.Width, Petal.Length), mean)) It is probably not wise to use d[,
c(lapply(.SD, sum), lapply(.SD2, mean)),
env = list(.SD = as.list(patterns("^Sepal")), .SD2 = as.list(patterns("^Petal"))),
by = Species] |
Oh, very nice. It's very close close to what I was looking for originally... and looks like it just about provides a more generalisable, drop-in replacement for My one remaining issue is the naming of the columns. Retaining the original column names is a side-feature of |
Possibly this will do patterns = function(x) setNames(nm=paste(sub("^", "", x, fixed=TRUE), c("Width","Length"), sep="."))
#...
#c(lapply(list(Sepal.Width = Sepal.Width, Sepal.Length = Sepal.Length),
# sum), lapply(list(Petal.Width = Petal.Width, Petal.Length = Petal.Length),
# mean)) |
Replacing .SD was not the goal because it works pretty well. Intention was more to replace |
Excellent, thanks Jan. (Sorry for sporadic replies; I'm sneaking time online in between weekend parenting...) Stepping back, here's my quick summary of this thread so far. Feel free to push back on or add to what I'm about to say.
s_cols = grep("^Sepal", names(d), value = TRUE)
p_cols = grep("^Petal", names(d), value = TRUE)
d[,
{
SD = unclass(.SD)
c(lapply(SD[s_cols], sum), lapply(SD[p_cols], mean))
},
.SDcols = c(s_cols, p_cols),
by = Species]
patterns2 = function(x) setNames(nm=paste(sub("^", "", x, fixed=TRUE), c("Width","Length"), sep="."))
d[,
c(lapply(SD1, sum), lapply(SD2, mean)),
env = list(SD1 = as.list(patterns2("^Sepal")), SD2 = as.list(patterns2("^Petal"))),
by = Species] It's probably just a limitation of this knock-up d[,
c(lapply(SD1, sum), lapply(SD2, mean)),
env = list(SD1 = as.list(c(patterns2("^Sepal"), "Petal.Width")), SD2 = as.list(patterns2("^Petal"))),
by = Species] |
I'm happy to find this post, and
|
Yes, there is a way. I am not on workstation now. If you play a little bit you should figure out. Try substitute2 manual examples. Also be aware setNames can be used here as well, but I would check verbose=T if it doesn't switch off GForce optimization. |
@jangorecki Thank you - DT = as.data.table(iris)
cols1 = c('Sepal.Length','Sepal.Width')
cols2 = c('Petal.Length','Petal.Width')
DT[, setNames(c(lapply(SD1, mean), lapply(SD2, median)), c(cols1,cols2)),
by = Species,
env = list(SD1 = as.list(cols1), SD2 = as.list(cols2)),
verbose = TRUE]
Argument 'by' after substitute: Species
Argument 'j' after substitute: setNames(c(lapply(list(Sepal.Length, Sepal.Width), mean), lapply(list(Petal.Length, Petal.Width), median)), c(col1, col2))
Detected that j uses these columns: [Sepal.Length, Sepal.Width, Petal.Length, Petal.Width]
Finding groups using forderv ... forder.c received 150 rows and 1 columns
0.000s elapsed (0.001s cpu)
Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
lapply optimization is on, j unchanged as 'setNames(c(lapply(list(Sepal.Length, Sepal.Width), mean), lapply(list(Petal.Length, Petal.Width), median)), c(col1, col2))'
GForce is on, left j unchanged
Old mean optimization is on, left j unchanged.
Making each group and running j (GForce FALSE) ... The result of j is a named list. It's very inefficient to create the same names over and over again for each group. When j=list(...), any names are detected, removed and put back after grouping has completed, for efficiency. Using j=transform(), for example, prevents that speedup (consider changing to :=). This message may be upgraded to warning in future.
memcpy contiguous groups took 0.000s for 3 groups
eval(j) took 0.000s for 3 calls
0.001s elapsed (0.001s cpu)
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1: setosa 5.006 3.428 1.50 0.2
2: versicolor 5.936 2.770 4.35 1.3
3: virginica 6.588 2.974 5.55 2.0
|
I see nice verbose info, and GForce is disabled so this setNames approach is not ideal. I will have a look later on what are the alternatives. |
@matthewgson so... DT[, c(lapply(SD1, sum), lapply(SD2, mean)), by=Species,
env = list(
SD1 = as.list(setNames(nm=cols1)),
SD2 = as.list(setNames(nm=cols2))
)] the problem is that, as of now, it will not be GForce optimized. I described it in #5032. To make GForce working as of now you need to combine those two lists manually. DT[, .j, by=Species,
env = list(.j = c(
lapply(lapply(setNames(nm=cols1), as.name), function(v) call("sum", v)),
lapply(lapply(setNames(nm=cols2), as.name), function(v) call("mean", v))
))] Note that the latter would be much more complex using base R |
Didn't know the |
Just for reference, it is in examples in manual ## list elements are enlist'ed into list calls
(cl1 = substitute(f(lst), list(lst = list(1L, 2L))))
#f(list(1L, 2L))
(cl2 = substitute2(f(lst), I(list(lst = list(1L, 2L)))))
#f(list(1L, 2L))
(cl3 = substitute2(f(lst), list(lst = I(list(1L, 2L)))))
#f(list(1L, 2L))
(cl4 = substitute2(f(lst), list(lst = quote(list(1L, 2L)))))
#f(list(1L, 2L))
(cl5 = substitute2(f(lst), list(lst = list(1L, 2L))))
#f(list(1L, 2L))
cl1[[2L]] ## base R substitute with list element
#[[1]]
#[1] 1
#
#[[2]]
#[1] 2
#
cl2[[2L]] ## same
#[[1]]
#[1] 1
#
#[[2]]
#[1] 2
#
cl3[[2L]] ## same
#[[1]]
#[1] 1
#
#[[2]]
#[1] 2
#
cl4[[2L]] ## desired
#list(1L, 2L)
cl5[[2L]] ## automatically
#list(1L, 2L) |
Personally, I'd still like to be able to pass a named list on to Having said that, it's not clear to me that enough other users feel the same. And --- don't get me wrong --- the I'll close this issue then in the interests of reducing clutter for the DT devs. Thanks everyone for the useful discussion. |
My 2 cents. After reading the discussion above and the great new argument
When it comes to data cleaning/wrangling, where interactivity with data is very frequent, I think that I will try to explain the benefit that a list By list .SDcols, I mean an extension of the current like the current
|
Thanks for the detailed comment @Kamgang-B. The truth is that I've somewhat regretted closing this issue, because I too feel that it would be a really nice feature. (In addition and complementary to Jan and the other DT devs should ofc feel free to re-close if they disagree. |
.SD
and .SDcols
support (.SD2
, .SDcols2
, etc) .SDcols
/ .SD
I think @Kamgang-B made a very good point of the simplicity and power of using |
Well explained advantages of .SD over env. Agree this FR make sense. |
Really nice thread and comment by @Kamgang-B. It also resolves my concern about the earlier suggestions to do The approach to require |
Is there any scope/appetite for supporting multiple
.SD
and.SDcols
?Motivation: I frequently encounter situations where I need to perform different aggregation tasks on distinct column groups. One group of columns will be aggregated as means, another group will be aggregated as medians, yet another group will be aggregated as sums, etc. In these cases, only one group can be passed through the convenience features of
.SD(cols)
, while the other group(s) must all be aggregated manually.Here's a simple (and somewhat ill-advised) example that illustrates the mechanics:
Here the summed Sepal columns get the
.SD
convenience treatment, while I have to manually take the mean of the Petal columns separately (and name them; see also #1227 (comment)).My proposal is to allow something like this instead:
I'm assuming here that you can match the relevant subsets based on the index (
.SD2
=>.SDcols2
). If this is easy to do for one additional subset, then in principle it seems possible for any additional subsets (.SD3
=>.SDcols3
, etc). Of course, this may impose some small overhead that doesn't pass the cost-benefit test.Feel free to close if this seems undesirable / too much work. Thanks for considering!
Update. Related: #1063 (comment) and possibly #4970
Proposed solution in the former is:
SessionInfo
The text was updated successfully, but these errors were encountered: