-
Notifications
You must be signed in to change notification settings - Fork 991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FR: .SDcols support list of expressions #6619
Comments
I'm not sure I understand your proposition. A patch of the man page would help. |
Sorry it wasn't quite clear, I'll try again. (And I think it helps to bring the example of the diff, sorry about that.) The notion has nothing to do with multiple rows, it is all about selecting columns using more than one of the available My
What if I want to collect all numeric columns plus one column that is not numeric? Let's say I have a time column, some numeric columns, some character columns, and perhaps others. Currently, I would need to do something like DT = data.table(int1=1L, int2=2L, chr1="A", chr2="B", num1=1, num2=2, lgl=TRUE)
DT[, .SD, .SDcols = c("chr1", names(which(sapply(DT, is.numeric))))]
# chr1 int1 int2 num1 num2
# <char> <int> <int> <num> <num>
# 1: A 1 2 1 2
### suggested replacement syntax for that:
DT[, .SD, .SDcols = .("chr1", is.numeric)] It's all about terse heterogeneous column selection. What calculations are done on the data is not really important. Another example: ### double all columns that end in "2" that are not strings
DT[, names(.SD) := lapply(.SD, `*`, 2), .SDcols = .(patterns("2$"), --is.character)]
DT
# int1 int2 chr1 chr2 num1 num2 lgl
# 1: 1 4 A B 1 4 TRUE I'm not looking to reshape anything (long-to-wide or wide-to-long), so |
Note that this proposal conflicts directly with the proposed extension to I agree mixing pattern- and type-based filters is clunky as of now. FWIW I would only use two filters: cols = union(names(sapply(DT, is.numeric)), grep("^lgl$|2$", names(DT), value=TRUE)) A different approach could be to accept .SDcols = is.numeric | patterns("2$") | "lgl" That requires some metaprogramming which reduces its appeal; it will also overload usage of |
I was looking at #5020 and I don't know that it and this FRs are not compatible: I could see the utility in allowing (say) .SDcols = list(a = .("lgl", is.numeric), b = .("oth", is.character)) I understand that the three-selections I demo'd with could be reduced based on more patterns. The "power" of this FR is that it supports function calls in addition to non-function-calls, which are not something that can be combined without the preprocessing of your I'm not opposed to the use of .SDcols = (is.numeric & patterns("2$")) | (is.character & patterns("1$")) though I have no use-case atm where that would be justified. In my design and PR, I toiled momentarily with the "always OR" vs "always AND", choosing the former with no clear way to get the latter. Your suggestion of .SDcols = list(a = is.numeric & patterns("2$"), b = is.character & patterns("1$")) For the use of .SDcols = ~ (is.numeric & patterns("2$")) | (is.character & patterns("1$"))
.SDcols = list(a = ~ is.numeric & patterns("2$"), b = is.character) I toyed briefly with how to implement 5020, thinking it might be simpler to do both in one PR, but I'm not sure how to implement/support |
There are use-cases where I want to use
.SDcols=function
and add an extra column or two. Currently, I think I need to do something like:(where
chrvar
is a non-numeric column that I want included in the output). What I'd like to be able to do isThe below patch supports this notion. While it might be easy to infer that this is similar to
dplyr::select
, I didn't intend to follow it perfectly. Some points about my attempt so far:.SDcols=
.patterns(.)
, just as a non-list argument would.--patterns("r2$")
; I chose--
because-
is already taken for negation,!!
is used for other (symbol-evaluation), and I couldn't think of another intuitive and not-otherwise-used unary operator;DT[, .SD, .SDcols=c(1,1,2,3)]
), so a user can still choose repeated columns, but must do it explicitly within more of the expressions.I think I have it working completely:
All tests in the current repo pass without changes.
Possible extension:
patterns(..., mustWork=FALSE)
, similar todplyr::any_of()
semantics, i.e., no error if not found. This also passes all tests but I didn't want to combine with this PR since it's mostly different additive functionality. And because I think it might go against design intentions of the package.The minimal patch for this.
This does not include documentation, and I minimized indentation changes for the sake of highlighting only. I provide it here only for terse discussion, a PR will be a better place to go over more details.
The vast majority of the code is due to:
for
loop to iterate over the.SDcols
expressions (stored incolsubs
, iterated ascolsub
).SDcols <-
tocolsub <-
ansvars
,sdvars
, andansvals
, and thencombine
-ing the new values correctly (add/subtract) with itTest performance comparison (no significant difference with this patch).
I'm happy to convert this into a PR.
Output from `sessionInfo()`.
The text was updated successfully, but these errors were encountered: