Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Columns appearing in the function in by= disappers in j #1427

Open
renkun-ken opened this issue Nov 10, 2015 · 13 comments
Open

Columns appearing in the function in by= disappers in j #1427

renkun-ken opened this issue Nov 10, 2015 · 13 comments
Labels
Milestone

Comments

@renkun-ken
Copy link
Member

renkun-ken commented Nov 10, 2015

It seems that all columns appearing the the function that produces the by column will not be included in j nor in .SD.

require(data.table)
dt <- data.table(a=c(1001,1002,1011,1012), x = c(1,2,3,4))
dt[, .SD, by = .(i = substr(a, 3, 4))]
#     i x
#1: 01 1
#2: 02 2
#3: 11 3
#4: 12 4
dt[, .SD, by = .(i = substr(a, 1, 3))]
#      i x
#1: 100 1
#2: 100 2
#3: 101 3
#4: 101 4

For example, I have a data table of a long list of yyyyMMdd dates from 20150101 to 20151001 and I use by = substr(date, 1, 6) to group the data into year-months. But in each group accessed either from .SD or in the scope of j, date column disappears so that I cannot get the original date in this way. I am not sure if previous versions had this problem (I remember its behavior does not look like this before or I'm wrong).

To walk around I have to first make the new column year_month first and then by = year_month.

I'm using latest version of data.table (v1.9.6) in CRAN.

@DavidArenburg
Copy link
Member

Yeah, I agree it's quite inconvenient and was thinking about a similar FR myself. So +1 for this. Though I wonder what the majority of users prefer. It might need to add another argument in order to be possible to generate both options.

@franknarf1
Copy link
Contributor

I've long been annoyed by this and had assumed it was filed as a bug already. Even if you try to be clever and get around it with dt[, .SD, by = .(i = substr(dt$a, 3, 4))], the same thing happens.

@jangorecki
Copy link
Member

Cannot that be tricked with .SDcols?

@franknarf1
Copy link
Contributor

@jangorecki Maybe. Also dt[, .SD, by = .(i = substr(dt[["a"]], 3, 4))].

Anyway, I'm not sure that we need to change the default, but my workaround feels awfully hacky.

@arunsrinivasan arunsrinivasan added this to the v2.0.0 milestone Nov 19, 2015
@MichaelChirico
Copy link
Member

Just wrote an SO answer where this would have made for a shiny clean solution.

Would also be nice (though sub-optimal, IMO) to be able to add by to .SD with := if necessary, but we can't as of now:

.SD is locked. Using := in .SD's j is reserved for possible future use; a tortuously flexible way to modify by group. Use := in j directly to modify by group by reference.

@smingerson
Copy link

Just ran into the same issue, this was unexpected. My use was to spit out one CSV for each year, while retaining original date for comparison. My expectation was that .SD would include columns used in by if they were not used directly as the grouping variable. For example, .SD would not include the column purchase_date if by=list(purchase_date), but if by=list(year(purchase_date)) or similar, .SD would include purchase_date.

For now, I'll just generate the column beforehand. I do see that the documentation for .SD does say "excluding any columns used in by", I guess I had never noticed that bit before.

@MichaelChirico
Copy link
Member

MichaelChirico commented Jun 22, 2020

The by column values are available in j via .BY, e.g. .BY$purchase_date.

I would say it's redundant to store the column data both in the file name and in the file, so using .BY$purchase_date for the file name and also including it as a column can be avoided.

I do understand the convenience factor of the redundant column, and I think fwrite(c(.BY, .SD), ...) would suit that purpose, but I think the default of excluding the by columns from .SD still makes sense in this light.

Two more notes:

  • We should strive make it easier for column info to be read from the file name as part of fread for directories #2582
  • Spark is also taking this approach by default for partitioned tables -- in spark.partitionBy('year', 'month', 'day').write.parquet(...), year/month/day are excluded from the output files & only included in the output path like year=2020/month=06/day=10/file-...parquet.gz.

@smingerson
Copy link

I think we had a misunderstanding. I am using fwrite, not fread. My only reason for splitting the file is to make it manageable in Excel (☹).

I want to write out one file for each year's worth of data. But, each row in the data should maintain the actual column. The example below should make this clearer than my ramblings did.

library(data.table)

dt <- data.table(a = 1, d = as.Date(c('2020-01-02', '2018-01-02')))

dt
#>    a          d
#> 1: 1 2020-01-02
#> 2: 1 2018-01-02
# Agree it makes sense to exclude the by column, since it
# is exactly as is present in the data.;
dt[,print(.SD), by = list(d)]
#>    a
#> 1: 1
#>    a
#> 1: 1
#> Empty data.table (0 rows and 1 cols): d

# This is the case that, to me, worked unexpectedly. On one hand,
# it leads to a more consistent interface. On the other, it leads
# to less convenience (in this single case of writing files that
# I have come up with...)
# Pretend that is fwrite instead of print, with a filename like
# paste0(.BY$d, "check.csv"))
dt[,print(.SD), by = list(year = year(d))]
#>    a
#> 1: 1
#>    a
#> 1: 1
#> Empty data.table (0 rows and 1 cols): year
# Ideally, this would write a,d to file, not just a.

Created on 2020-06-22 by the reprex package (v0.3.0)

My thought process was the by columns are the columns generated which encode the actual grouping. In the second example, d does not encode the actual grouping, so it doesn't count as one of the by columns.

To be clear, not saying the current behavior is wrong, or bad, just that it was unexpected in my mental model.

As I said, it's not much to ask to create the grouping column explicitly before writing, so I have done that.

@MichaelChirico
Copy link
Member

Yes, recognize you're talking about fwrite -- I mentioned fread tangentially as part of the I/O loop -- reading back into R with fread data that you've written with fwrite.

Your use case is slightly different in that the next input step is Excel, not R.

fwrite(c(.BY, .SD)) doesn't work -- I was reminded of this issue just now: #4368

@ColeMiller1
Copy link
Contributor

Another work around is cbind(by_var, .SD):

library(data.table)
dt <- data.table(a = 1, d = as.Date(c('2020-01-02', '2018-01-02')))
dt[,print(cbind(d, .SD)), by = list(year = year(d))]
#>             d     a
#>        <Date> <num>
#> 1: 2020-01-02     1
#>             d     a
#>        <Date> <num>
#> 1: 2018-01-02     1
#> Empty data.table (0 rows and 1 cols): year

These are the two lines that cause the byvars to not be in .SD:

allbyvars = intersect(all.vars(bysub), names_x)

ansvars = sdvars = dupdiff(names_x, union(bynames, allbyvars)) # TO DO: allbyvars here for vars used by 'by'. Document.

While I would expect a different output, @Henrik-P pointed this help text out in #3262

.SD is a data.table containing the Subset of x's Data for each group, excluding any columns used in by

Here, column d is technically be used in by. Although in this case d is only being used to make another column as part of a call.

If we wanted to address this, I would start with refactoring how the bysub is probed with NSE.

@smingerson
Copy link

Thank you for the alternative workaround, always good to learn something new. I didn't know that d would be available in J, since it isn't contained in .BY as one of the grouping variables. I have given a careful read of the special-symbols docs several times, but I misunderstood the text in .SD and .BY.

What confused me is I considered the columns used in by to be those which are returned when computing something by group.

@jangorecki
Copy link
Member

related: #4079

@UweBlock
Copy link
Contributor

UweBlock commented Jul 25, 2021

I find it confusing and contra-intuitive that a computed expression in by= removes all the columns involved in the computation from .SD (although this is well documented in ?.SD). I had expected that a computed grouping variable would create an extra column and keeping the other columns untouched.

I came across this issue while answering this question on SO which requires to create a grouping variable on-the-fly.

My expectation was that the code below should be working

# sample data
ch <- data.table(list_name=c("a","a","","b","","c","c","","d"),
                 name=c(1,2,"",3,"",4,5,"",6))
cl <- data.table(ques=c("a","b","c"),value=c("AB","CD","EF"))
# my expectation
ch[, rbind(.SD, cl[ques %in% list_name], use.names = FALSE), 
   by = .(grp = rleid(list_name))]

but gave the error message

Error in rbindlist(l, use.names, fill, idcol) :
Item 2 has 2 columns, inconsistent with item 1 which has 1 columns. To fill missing columns use fill=TRUE.

because column list_name was missing from .SD.

I had to ressort to the workaround proposed by ColeMiller1:

ch[, cbind(grp = rleid(list_name), .SD)][
  , rbind(.SD, cl[ques %in% list_name], use.names = FALSE), by = grp]

I found three other workarounds

ch[, rbind(.SD, cl[ques %in% list_name], use.names = FALSE), 
   by = .(grp = rleid(ch[["list_name"]]))]

and

ch[, cbind(list_name, .SD), by = .(grp = rleid(list_name))][
  , rbind(.SD, cl[ques %in% list_name], use.names = FALSE), by = grp]

and

ch[, grp := rleid(list_name)][
  , rbind(.SD, cl[ques %in% list_name], use.names = FALSE), by = grp]
ch[, grp := NULL]

which look even more convoluted.


On the other hand, I understand that it might be annoying as well if grouping variables appear twice in the result: First as grouping var in front and then a second time as part of .SD. (And I can't think of a simple workaround to remove those duplicate columns).

Would it be possible to distinguish in the by= and keyby= clauses between

  1. Simple column names which will be removed from .SD and
  2. columns which are used in expressions to create computed grouping variables and which will be kept in .SD?

Examples:

df1 <- data.table(x = c("a1", "a1", "a2"), y = 11:13, z = 21:23)

# case 1
df1[, lapply(.SD, toString), by = x]
##     x      y      z
## 1: a1 11, 12 21, 22
## 2: a2     13     23

# case 2 (actual behaviour)
df1[, lapply(.SD, toString), by = .(grp = substr(x, 1, 1))]
##    grp          y          z
## 1:   a 11, 12, 13 21, 22, 23

# case 2 (suggested behaviour)
df1[, lapply(.SD, toString), by = .(grp = substr(x, 1, 1))]
##    grp          x          y          z
## 1:   a a1, a1, a2 11, 12, 13 21, 22, 23

# case 2 (suggested behaviour, default name for computed grouping variable)
df1[, lapply(.SD, toString), by = .(substr(x, 1, 1))]
##     g1          x          y          z
## 1:   a a1, a1, a2 11, 12, 13 21, 22, 23

# case 2 (suggested behaviour, force column to appear in .SD)
df1[, lapply(.SD, toString), by = .(I(x))]
##    g1      x      y      z
## 1: a1 a1, a1 11, 12 21, 22
## 2: a2     a2     13     23



The default names g1, g2, etc for computed grouping variables are named in a similar way as unnamed expressions in j (v1, v2, etc).

The I() function in by= forces a column to appear in .SD . This is taken from base R (the new substitute2() function uses I() as well).

@jangorecki jangorecki added this to the 2.0.0 milestone Nov 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants