-
Notifications
You must be signed in to change notification settings - Fork 993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement DT[, across(.SD, fun1, fun2, fun3), by=group] #4970
Comments
My 2¢: df %>%
group_by(g1, g2) %>%
summarise(
across(where(is.numeric), mean),
across(where(is.factor), nlevels),
n = n(),
) Where is this useful? as.data.table(Lahman::Batting)[, .(
across(patterns("$R^|(X.B)|HR"), .(sum, mean)),
across(c("stint", "teamID"), .(last, uniqueN)),
across(.SD, .(uniqueN, \(x) sum(x)/uniqueN(yearID))),
playerID,
.SDcols = c("R", "IBB", "SO")] My understanding of baseball is a little rusty. What I was aiming for was to create a single table by payer that gives me
all in one shot. I often have to revert to summarizing my data in Excel or, if the data is too big, calculate all these independently in R and do a join at the end. In this specific case, we could even employ the proposed P.S. - I realize that .SD provides a list, while the others provide character vectors. I was aiming for more flexibility as opposed to using only |
Hi @mattdowle, I think what you're proposing here is closely related to what's being discussed in #1063, specifically the discussion around "colwise". #1063 groups together row operations and column operations into one issue, but they seem pretty separate to me (and I think rowwise operations would be better solved by implementing more functions like pmin/pmax as you suggested for psum in #3467). I also like your suggestion of across (rather than colwise), and the proposed syntax since it will be familiar to Before we can implement across, I think we should solve #2311 by merging my PR #4883, since this addresses how columns are named in this situation. Once #4883 is merged, we could just implement across so that it expands into several lapply(.SD,) calls concatenated together with c(). This will ensure GForce optimization is used without any additional work. E.g:
would just internally be expanded to
and the resulting column names would be An outstanding question is how we might name columns when functions are not explicitly named:
Interactively it would be convenient for them to be |
Also see this discussion with Hadley (tidyverse/dtplyr#173) on translating dplyr::across to data.table syntax for the dtplyr package. Note that the dplyr across allows arbitrary specification of how the function name and input column names are combined to determine how the output columns are named (specifying both order and the separator) but I'm not sure that's a road we want to go down (see the names argument here: https://dplyr.tidyverse.org/reference/across.html) . Sticking with base R's naming behavior (e.g. |
Any thought to reinvigorating this? #4883 was merged a couple of months ago. The code there is a little more detailed than I want to dive in on to try to implement this myself (at least, not at this moment). BTW, I further suggest the first argument should default to the current across = function(x = .SD, funs) {
} As for the notion of
Part of me wants to go with the first option to remove any/all ambiguity, but I'm conscious of interactive usability. |
Inspired by
dplyr::across
and triggered by JuliaData/DataFrames.jl#2725 (comment)Instead of :
it could be
I didn't find any related issues or PRs in a quick search. If there are any, and S.O. questions, please link them here.
The text was updated successfully, but these errors were encountered: