Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aggregate same variable as groupby on-the-fly expression won't make it length 1 #4079

Open
jangorecki opened this issue Nov 26, 2019 · 3 comments
Milestone

Comments

@jangorecki
Copy link
Member

Reported by @geofflazzarini in #3103 (comment)
There should be no difference in TotalA column in results

library(data.table)
dt = data.table(SomeNumberA=c(1,1,1),SomeNumberB=c(1,1,1))
dt[, .(.N, TotalA=sum(SomeNumberA), TotalB=sum(SomeNumberB)), by=SomeNumberA]
#   SomeNumberA     N TotalA TotalB
#         <num> <int>  <num>  <num>
#1:           1     3      1      3
dt[, .(.N, TotalA=sum(SomeNumberA), TotalB=sum(SomeNumberB)), by=as.factor(SomeNumberA)]
#   as.factor     N TotalA TotalB
#      <fctr> <int>  <num>  <num>
#1:         1     3      3      3

This might be tricky because of "Inside each group, why are the group variables length-1?" in FAQ.

@jangorecki
Copy link
Member Author

another example where making grouping column length 1 might cause problems, spotted by @st-pasha
V2 is all NA because shift(A*2) operates on scalar A, and shift on scalar will give NA.

DT = data.table(A=c(1, 2, 1, 1, 2), B=3:7)
DT[, .(A*2, shift(A*2), B*2, shift(B*2)), by=A]
#       A    V1    V2    V3    V4
#   <num> <num> <num> <num> <num>
#1:     1     2    NA     6    NA
#2:     1     2    NA    10     6
#3:     1     2    NA    12    10
#4:     2     4    NA     8    NA
#5:     2     4    NA    14     8

@jangorecki
Copy link
Member Author

While feature of having grouping columns length 1 within a group is useful, it comes at the cost of consistency. If we imagine a shiny app where user just chose columns to aggregate and groupby, then it is not difficult to reach such cases. I think it would be useful to optionally provide grouping column of length of the group.

@jangorecki
Copy link
Member Author

jangorecki commented May 8, 2020

More or less the example I was talking about in my last post, where this feature actually bites: http://www.pivottabler.org.uk/articles/v12-performance.html#a-warning-about-data-table

The data.table has a few inconsistencies when grouping and aggregating on the same variables. This can cause a pivot table to have inconsistencies e.g. pivot table cells with the wrong values that is most obvious when row or columns totals that don’t equal the sum of the values in the row or column:

Possible solution, a significant breaking change, would be to provide scalar values inside .BY but leave the full (repeated) values inside regular variables in j.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant