Columns appearing in the function in `by=` disappers in j #1427

renkun-ken · 2015-11-10T05:57:22Z

It seems that all columns appearing the the function that produces the by column will not be included in j nor in .SD.

require(data.table)
dt <- data.table(a=c(1001,1002,1011,1012), x = c(1,2,3,4))
dt[, .SD, by = .(i = substr(a, 3, 4))]
#     i x
#1: 01 1
#2: 02 2
#3: 11 3
#4: 12 4
dt[, .SD, by = .(i = substr(a, 1, 3))]
#      i x
#1: 100 1
#2: 100 2
#3: 101 3
#4: 101 4

For example, I have a data table of a long list of yyyyMMdd dates from 20150101 to 20151001 and I use by = substr(date, 1, 6) to group the data into year-months. But in each group accessed either from .SD or in the scope of j, date column disappears so that I cannot get the original date in this way. I am not sure if previous versions had this problem (I remember its behavior does not look like this before or I'm wrong).

To walk around I have to first make the new column year_month first and then by = year_month.

I'm using latest version of data.table (v1.9.6) in CRAN.

The text was updated successfully, but these errors were encountered:

DavidArenburg · 2015-11-10T07:48:15Z

Yeah, I agree it's quite inconvenient and was thinking about a similar FR myself. So +1 for this. Though I wonder what the majority of users prefer. It might need to add another argument in order to be possible to generate both options.

franknarf1 · 2015-11-10T17:44:49Z

I've long been annoyed by this and had assumed it was filed as a bug already. Even if you try to be clever and get around it with dt[, .SD, by = .(i = substr(dt$a, 3, 4))], the same thing happens.

jangorecki · 2015-11-10T18:15:15Z

Cannot that be tricked with .SDcols?

franknarf1 · 2015-11-10T19:04:06Z

@jangorecki Maybe. Also dt[, .SD, by = .(i = substr(dt[["a"]], 3, 4))].

Anyway, I'm not sure that we need to change the default, but my workaround feels awfully hacky.

MichaelChirico · 2015-12-02T03:04:47Z

Just wrote an SO answer where this would have made for a shiny clean solution.

Would also be nice (though sub-optimal, IMO) to be able to add by to .SD with := if necessary, but we can't as of now:

.SD is locked. Using := in .SD's j is reserved for possible future use; a tortuously flexible way to modify by group. Use := in j directly to modify by group by reference.

smingerson · 2020-06-22T18:06:33Z

Just ran into the same issue, this was unexpected. My use was to spit out one CSV for each year, while retaining original date for comparison. My expectation was that .SD would include columns used in by if they were not used directly as the grouping variable. For example, .SD would not include the column purchase_date if by=list(purchase_date), but if by=list(year(purchase_date)) or similar, .SD would include purchase_date.

For now, I'll just generate the column beforehand. I do see that the documentation for .SD does say "excluding any columns used in by", I guess I had never noticed that bit before.

MichaelChirico · 2020-06-22T18:18:01Z

The by column values are available in j via .BY, e.g. .BY$purchase_date.

I would say it's redundant to store the column data both in the file name and in the file, so using .BY$purchase_date for the file name and also including it as a column can be avoided.

I do understand the convenience factor of the redundant column, and I think fwrite(c(.BY, .SD), ...) would suit that purpose, but I think the default of excluding the by columns from .SD still makes sense in this light.

Two more notes:

We should strive make it easier for column info to be read from the file name as part of fread for directories #2582
Spark is also taking this approach by default for partitioned tables -- in spark.partitionBy('year', 'month', 'day').write.parquet(...), year/month/day are excluded from the output files & only included in the output path like year=2020/month=06/day=10/file-...parquet.gz.

smingerson · 2020-06-22T18:35:07Z

I think we had a misunderstanding. I am using fwrite, not fread. My only reason for splitting the file is to make it manageable in Excel (☹).

I want to write out one file for each year's worth of data. But, each row in the data should maintain the actual column. The example below should make this clearer than my ramblings did.

library(data.table)

dt <- data.table(a = 1, d = as.Date(c('2020-01-02', '2018-01-02')))

dt
#>    a          d
#> 1: 1 2020-01-02
#> 2: 1 2018-01-02
# Agree it makes sense to exclude the by column, since it
# is exactly as is present in the data.;
dt[,print(.SD), by = list(d)]
#>    a
#> 1: 1
#>    a
#> 1: 1
#> Empty data.table (0 rows and 1 cols): d

# This is the case that, to me, worked unexpectedly. On one hand,
# it leads to a more consistent interface. On the other, it leads
# to less convenience (in this single case of writing files that
# I have come up with...)
# Pretend that is fwrite instead of print, with a filename like
# paste0(.BY$d, "check.csv"))
dt[,print(.SD), by = list(year = year(d))]
#>    a
#> 1: 1
#>    a
#> 1: 1
#> Empty data.table (0 rows and 1 cols): year
# Ideally, this would write a,d to file, not just a.

Created on 2020-06-22 by the reprex package (v0.3.0)

My thought process was the by columns are the columns generated which encode the actual grouping. In the second example, d does not encode the actual grouping, so it doesn't count as one of the by columns.

To be clear, not saying the current behavior is wrong, or bad, just that it was unexpected in my mental model.

As I said, it's not much to ask to create the grouping column explicitly before writing, so I have done that.

MichaelChirico · 2020-06-23T02:29:45Z

Yes, recognize you're talking about fwrite -- I mentioned fread tangentially as part of the I/O loop -- reading back into R with fread data that you've written with fwrite.

Your use case is slightly different in that the next input step is Excel, not R.

fwrite(c(.BY, .SD)) doesn't work -- I was reminded of this issue just now: #4368

ColeMiller1 · 2020-06-23T11:47:16Z

Another work around is cbind(by_var, .SD):

library(data.table)
dt <- data.table(a = 1, d = as.Date(c('2020-01-02', '2018-01-02')))
dt[,print(cbind(d, .SD)), by = list(year = year(d))]
#>             d     a
#>        <Date> <num>
#> 1: 2020-01-02     1
#>             d     a
#>        <Date> <num>
#> 1: 2018-01-02     1
#> Empty data.table (0 rows and 1 cols): year

These are the two lines that cause the byvars to not be in .SD:

data.table/R/data.table.R

Line 752 in ad7b67c

allbyvars = intersect(all.vars(bysub), names_x)

data.table/R/data.table.R

Line 920 in ad7b67c

    
           ansvars = sdvars = dupdiff(names_x, union(bynames, allbyvars))   # TO DO: allbyvars here for vars used by 'by'. Document.

While I would expect a different output, @Henrik-P pointed this help text out in #3262

.SD is a data.table containing the Subset of x's Data for each group, excluding any columns used in by

Here, column d is technically be used in by. Although in this case d is only being used to make another column as part of a call.

If we wanted to address this, I would start with refactoring how the bysub is probed with NSE.

smingerson · 2020-06-23T12:22:47Z

Thank you for the alternative workaround, always good to learn something new. I didn't know that d would be available in J, since it isn't contained in .BY as one of the grouping variables. I have given a careful read of the special-symbols docs several times, but I misunderstood the text in .SD and .BY.

What confused me is I considered the columns used in by to be those which are returned when computing something by group.

jangorecki · 2020-11-21T18:22:39Z

related: #4079

UweBlock · 2021-07-25T22:12:55Z

I find it confusing and contra-intuitive that a computed expression in by= removes all the columns involved in the computation from .SD (although this is well documented in ?.SD). I had expected that a computed grouping variable would create an extra column and keeping the other columns untouched.

I came across this issue while answering this question on SO which requires to create a grouping variable on-the-fly.

My expectation was that the code below should be working

# sample data
ch <- data.table(list_name=c("a","a","","b","","c","c","","d"),
                 name=c(1,2,"",3,"",4,5,"",6))
cl <- data.table(ques=c("a","b","c"),value=c("AB","CD","EF"))
# my expectation
ch[, rbind(.SD, cl[ques %in% list_name], use.names = FALSE), 
   by = .(grp = rleid(list_name))]

but gave the error message

Error in rbindlist(l, use.names, fill, idcol) :
Item 2 has 2 columns, inconsistent with item 1 which has 1 columns. To fill missing columns use fill=TRUE.

because column list_name was missing from .SD.

I had to ressort to the workaround proposed by ColeMiller1:

ch[, cbind(grp = rleid(list_name), .SD)][
  , rbind(.SD, cl[ques %in% list_name], use.names = FALSE), by = grp]

I found three other workarounds

ch[, rbind(.SD, cl[ques %in% list_name], use.names = FALSE), 
   by = .(grp = rleid(ch[["list_name"]]))]

and

ch[, cbind(list_name, .SD), by = .(grp = rleid(list_name))][
  , rbind(.SD, cl[ques %in% list_name], use.names = FALSE), by = grp]

and

ch[, grp := rleid(list_name)][
  , rbind(.SD, cl[ques %in% list_name], use.names = FALSE), by = grp]
ch[, grp := NULL]

which look even more convoluted.

On the other hand, I understand that it might be annoying as well if grouping variables appear twice in the result: First as grouping var in front and then a second time as part of .SD. (And I can't think of a simple workaround to remove those duplicate columns).

Would it be possible to distinguish in the by= and keyby= clauses between

Simple column names which will be removed from .SD and
columns which are used in expressions to create computed grouping variables and which will be kept in .SD?

Examples:

df1 <- data.table(x = c("a1", "a1", "a2"), y = 11:13, z = 21:23)

# case 1
df1[, lapply(.SD, toString), by = x]
##     x      y      z
## 1: a1 11, 12 21, 22
## 2: a2     13     23

# case 2 (actual behaviour)
df1[, lapply(.SD, toString), by = .(grp = substr(x, 1, 1))]
##    grp          y          z
## 1:   a 11, 12, 13 21, 22, 23

# case 2 (suggested behaviour)
df1[, lapply(.SD, toString), by = .(grp = substr(x, 1, 1))]
##    grp          x          y          z
## 1:   a a1, a1, a2 11, 12, 13 21, 22, 23

# case 2 (suggested behaviour, default name for computed grouping variable)
df1[, lapply(.SD, toString), by = .(substr(x, 1, 1))]
##     g1          x          y          z
## 1:   a a1, a1, a2 11, 12, 13 21, 22, 23

# case 2 (suggested behaviour, force column to appear in .SD)
df1[, lapply(.SD, toString), by = .(I(x))]
##    g1      x      y      z
## 1: a1 a1, a1 11, 12 21, 22
## 2: a2     a2     13     23

The default names g1, g2, etc for computed grouping variables are named in a similar way as unnamed expressions in j (v1, v2, etc).

The I() function in by= forces a column to appear in .SD . This is taken from base R (the new substitute2() function uses I() as well).

arunsrinivasan added the bug label Nov 19, 2015

arunsrinivasan added this to the v2.0.0 milestone Nov 19, 2015

renkun-ken mentioned this issue Sep 23, 2017

one column being needed both in grouping and in j #2379

Closed

mattdowle removed this from the Candidate milestone May 10, 2018

Henrik-P mentioned this issue Jan 9, 2019

Empty data.table produced with .SD when grouping by all columns #3262

Open

jangorecki mentioned this issue Jun 5, 2020

f[] under groupby h2oai/datatable#2470

Open

jangorecki added this to the 2.0.0 milestone Nov 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Columns appearing in the function in `by=` disappers in j #1427

Columns appearing in the function in `by=` disappers in j #1427

renkun-ken commented Nov 10, 2015 •

edited by arunsrinivasan

Loading

DavidArenburg commented Nov 10, 2015

franknarf1 commented Nov 10, 2015

jangorecki commented Nov 10, 2015

franknarf1 commented Nov 10, 2015

MichaelChirico commented Dec 2, 2015

smingerson commented Jun 22, 2020

MichaelChirico commented Jun 22, 2020 •

edited

Loading

smingerson commented Jun 22, 2020

MichaelChirico commented Jun 23, 2020

ColeMiller1 commented Jun 23, 2020

smingerson commented Jun 23, 2020

jangorecki commented Nov 21, 2020

UweBlock commented Jul 25, 2021 •

edited

Loading

Columns appearing in the function in by= disappers in j #1427

Columns appearing in the function in by= disappers in j #1427

Comments

renkun-ken commented Nov 10, 2015 • edited by arunsrinivasan Loading

DavidArenburg commented Nov 10, 2015

franknarf1 commented Nov 10, 2015

jangorecki commented Nov 10, 2015

franknarf1 commented Nov 10, 2015

MichaelChirico commented Dec 2, 2015

smingerson commented Jun 22, 2020

MichaelChirico commented Jun 22, 2020 • edited Loading

smingerson commented Jun 22, 2020

MichaelChirico commented Jun 23, 2020

ColeMiller1 commented Jun 23, 2020

smingerson commented Jun 23, 2020

jangorecki commented Nov 21, 2020

UweBlock commented Jul 25, 2021 • edited Loading

Columns appearing in the function in `by=` disappers in j #1427

Columns appearing in the function in `by=` disappers in j #1427

renkun-ken commented Nov 10, 2015 •

edited by arunsrinivasan

Loading

MichaelChirico commented Jun 22, 2020 •

edited

Loading

UweBlock commented Jul 25, 2021 •

edited

Loading