aggregate same variable as groupby on-the-fly expression won't make it length 1 #4079

jangorecki · 2019-11-26T15:16:36Z

Reported by @geofflazzarini in #3103 (comment)
There should be no difference in TotalA column in results

library(data.table)
dt = data.table(SomeNumberA=c(1,1,1),SomeNumberB=c(1,1,1))
dt[, .(.N, TotalA=sum(SomeNumberA), TotalB=sum(SomeNumberB)), by=SomeNumberA]
#   SomeNumberA     N TotalA TotalB
#         <num> <int>  <num>  <num>
#1:           1     3      1      3
dt[, .(.N, TotalA=sum(SomeNumberA), TotalB=sum(SomeNumberB)), by=as.factor(SomeNumberA)]
#   as.factor     N TotalA TotalB
#      <fctr> <int>  <num>  <num>
#1:         1     3      3      3

This might be tricky because of "Inside each group, why are the group variables length-1?" in FAQ.

The text was updated successfully, but these errors were encountered:

jangorecki · 2019-11-26T15:20:21Z

another example where making grouping column length 1 might cause problems, spotted by @st-pasha
V2 is all NA because shift(A*2) operates on scalar A, and shift on scalar will give NA.

DT = data.table(A=c(1, 2, 1, 1, 2), B=3:7)
DT[, .(A*2, shift(A*2), B*2, shift(B*2)), by=A]
#       A    V1    V2    V3    V4
#   <num> <num> <num> <num> <num>
#1:     1     2    NA     6    NA
#2:     1     2    NA    10     6
#3:     1     2    NA    12    10
#4:     2     4    NA     8    NA
#5:     2     4    NA    14     8

jangorecki · 2019-11-26T15:22:35Z

While feature of having grouping columns length 1 within a group is useful, it comes at the cost of consistency. If we imagine a shiny app where user just chose columns to aggregate and groupby, then it is not difficult to reach such cases. I think it would be useful to optionally provide grouping column of length of the group.

jangorecki · 2020-05-08T01:29:09Z

More or less the example I was talking about in my last post, where this feature actually bites: http://www.pivottabler.org.uk/articles/v12-performance.html#a-warning-about-data-table

The data.table has a few inconsistencies when grouping and aggregating on the same variables. This can cause a pivot table to have inconsistencies e.g. pivot table cells with the wrong values that is most obvious when row or columns totals that don’t equal the sum of the values in the row or column:

Possible solution, a significant breaking change, would be to provide scalar values inside .BY but leave the full (repeated) values inside regular variables in j.

jangorecki added the consistency label Nov 26, 2019

jangorecki mentioned this issue Nov 21, 2020

Columns appearing in the function in by= disappers in j #1427

Open

jangorecki added this to the 2.0.0 milestone Nov 30, 2023

jangorecki mentioned this issue Nov 30, 2023

Inconsistent lengths of col when in both by and .SDcols #4317

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aggregate same variable as groupby on-the-fly expression won't make it length 1 #4079

aggregate same variable as groupby on-the-fly expression won't make it length 1 #4079

jangorecki commented Nov 26, 2019

jangorecki commented Nov 26, 2019

jangorecki commented Nov 26, 2019

jangorecki commented May 8, 2020 •

edited

Loading

aggregate same variable as groupby on-the-fly expression won't make it length 1 #4079

aggregate same variable as groupby on-the-fly expression won't make it length 1 #4079

Comments

jangorecki commented Nov 26, 2019

jangorecki commented Nov 26, 2019

jangorecki commented Nov 26, 2019

jangorecki commented May 8, 2020 • edited Loading

jangorecki commented May 8, 2020 •

edited

Loading