Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Melt on multiple columns: undocumented behavior #4047

Closed
otoomet opened this issue Nov 16, 2019 · 10 comments
Closed

Melt on multiple columns: undocumented behavior #4047

otoomet opened this issue Nov 16, 2019 · 10 comments
Labels
reshape dcast melt
Milestone

Comments

@otoomet
Copy link

otoomet commented Nov 16, 2019

Consider a simple data table:

dt <- data.table(i = 1:2, na = rnorm(2), nb=rnorm(2), 
                          ua=runif(2), ub=runif(2))
dt
##    i         na        nb         ua        ub
## 1: 1  0.8675148 1.1900491 0.09394934 0.2717421
## 2: 2 -0.1700282 0.9188715 0.58017687 0.5443863

when melting it into multiple colums we get:

melt(dt, measure.vars=list(c("na", "nb"), c("ua", "ub")))
##    i variable     value1     value2
## 1: 1        1  0.8675148 0.09394934
## 2: 2        1 -0.1700282 0.58017687
## 3: 1        2  1.1900491 0.27174213
## 4: 2        2  0.9188715 0.54438634

In particular variable is a factor with levels "1" and "2". This behavior seems to be undocumented. ?melt tells

‘list’ is a generalization of the vector version - each
element of the list (which should be ‘integer’ or
‘character’ as above) will become a ‘melt’ed column.

and

From version ‘1.9.6’, ‘melt’ gains a feature with ‘measure.vars’
accepting a list of ‘character’ or ‘integer’ vectors as well to
melt into multiple columns in a single function call efficiently.
The function ‘patterns’ can be used to provide regular expression
patterns. When used along with ‘melt’, if ‘cols’ argument is not
provided, the patterns will be matched against ‘names(data)’, for
convenience.

However, I cannot find anything about

  1. the fact that the variable will use numbered labels to denote the original columns, and
  2. what is the relationship between the numeric label and the original column name.

I know there are related feature requests (#2551and #3396). I am also aware of related solutions (e.g. on SO) that revolve around renaming the corresponding factor levels. However, for such solutions to be considered safe, the behavior of numeric levels should be documented and considered part of the API.

data.table 1.12.6; R 3.4, 3.6.

@jangorecki jangorecki added the reshape dcast melt label Dec 1, 2019
@ssh352
Copy link

ssh352 commented Jan 19, 2020

I have the same question. It would be nice for variable column to have factors "a" and "b" instead of 1 and 2.

@ggrothendieck
Copy link

ggrothendieck commented Apr 17, 2020

Look at this example, which has data.table, tidyr and base reshape versions of the same problem for comparison. The data.table version is also shown below.

https://stackoverflow.com/questions/61259478/manipulating-data-frame-format-in-base-r/61260008#61260008

variable.name is the name of an index into the times. We need to perform that indexing in post processing where pivot_longer does it automatically. Even base reshape is slightly better here as there is a times= argument that provides the times which are automatically indexed into so you don't have to do it yourself. Maybe melt could have such an argument or perhaps some other method of doing this. This would slightly simplify the second last line of code below.

Also it would be nice if melt had a drop_na argument to drop rows having NAs like pivot_longer does and have a key= argument to sort the result to reduce the amount of post processing. This would eliminate the last line in the code below.

library(data.table)

input <-
structure(list(id = c(1, 3, 6), school = structure(1:3, .Label = c("A", 
"B", "C"), class = "factor"), read_1 = c(20, 22, 24), read_1_sp = c(TRUE, 
FALSE, TRUE), read_2 = c(45, 47, 49), read_2_sp = c(FALSE, FALSE, 
FALSE), math_1 = c(20, 22, NA), math_1_sp = c(TRUE, FALSE, NA
), math_2 = c(NA, 35, 37), math_2_sp = c(NA, FALSE, FALSE)), class = "data.frame", 
row.names = c(NA, -3L))

input2 <- as.data.table(input, key = c("id", "school"))
times <- grep("\\d$", names(input2), value = TRUE)  # score col names

melt(input2, measure = patterns(sp = "sp", score = "\\d$"), variable.name = "subject")[, 
  c("subject", "no"):= fread(text = times[subject], sep = "_")][,   
  na.omit(.SD), key = key(input2)]  

@tdhock
Copy link
Member

tdhock commented Sep 29, 2020

hi this is solved by a new function in nc package which uses melt.data.table internally,

library(data.table)
dt <- data.table(
  i = 1:2, na = rnorm(2), nb=rnorm(2), 
  ua=runif(2), ub=runif(2))  
nc::capture_melt_multiple(dt, column="[un]", letter="[ab]")
#>    i letter          n         u
#> 1: 1      a  0.5509765 0.7095506
#> 2: 2      a  0.7278650 0.2971809
#> 3: 1      b -0.4690630 0.9605627
#> 4: 2      b -1.4568312 0.3414062
nc::capture_melt_multiple(dt, letter="[un]", column="[ab]")
#>    i letter         a          b
#> 1: 1      n 0.5509765 -0.4690630
#> 2: 2      n 0.7278650 -1.4568312
#> 3: 1      u 0.7095506  0.9605627
#> 4: 2      u 0.2971809  0.3414062

@tdhock tdhock closed this as completed Sep 29, 2020
@jangorecki
Copy link
Member

Again, we should first discuss if that functionality is going to be in scope of DT before closing.

@jangorecki jangorecki reopened this Sep 29, 2020
@tdhock
Copy link
Member

tdhock commented Oct 1, 2020

the original question: "what is the relationship between the numeric label and the original column name?" it is true that we should add documentation about what values go into the variable column. This is related to #4455 in which the problem is that the numeric label in the variable column is not consistent between na.rm=TRUE and FALSE.

AFAICT the closest documentation is that variable.name is name for the measured variable names column which is somewhat ambiguous. I would recommend revising to something like:
variable.name: name of output column containing information about which input columns the data came from. If measure.vars is an integer/character vector, then this column contains names of melted columns from the input data table. If measure.vars is a list of integer/character vectors, then this column contains integers indicating the index/position in each of those vectors.

@tdhock
Copy link
Member

tdhock commented Oct 7, 2020

even better just avoid the variable column altogether, use new functionality in #4731

remotes::install_github("Rdatatable/data.table@melt-custom-variable")
#> Skipping install of 'data.table' from a github remote, the SHA1 (c02fa9e8) has not changed since last install.
#>   Use `force = TRUE` to force installation
library(data.table)
dt <- data.table(
  i = 1:2, na = rnorm(2), nb=rnorm(2), 
  ua=runif(2), ub=runif(2))  
melt(dt, measure.vars=measure(value.name, letter, pattern="([un])([ab])"))
#>    i letter          n         u
#> 1: 1      a -0.6042333 0.3756086
#> 2: 2      a  0.4125218 0.2719224
#> 3: 1      b  0.2163859 0.5793461
#> 4: 2      b -0.6725394 0.1945757
melt(dt, measure.vars=measure(letter, value.name, pattern="([un])([ab])"))
#>    i letter          a          b
#> 1: 1      n -0.6042333  0.2163859
#> 2: 2      n  0.4125218 -0.6725394
#> 3: 1      u  0.3756086  0.5793461
#> 4: 2      u  0.2719224  0.1945757

@otoomet
Copy link
Author

otoomet commented Oct 8, 2020

Just a reminder here: the issue is about missing documentation of the current behavior.

Myself, I would prefer the new functionality, but I can work around with the existing one. But without details laid out in documentation, it feels somewhat unsafe to assume how the current approach works. In particular, to assume that the numeric values correspond to the original values in alphabetic order.

@tdhock
Copy link
Member

tdhock commented Oct 8, 2020

I have proposed a doc fix in #4723
\item{variable.name}{name (default \code{'variable'}) of output column containing information about which input column(s) were melted. If \code{measure.vars} is an integer/character vector, then each entry of this column contains the name of a melted column from \code{data}. If \code{measure.vars} is a list of integer/character vectors, then each entry of this column contains an integer indicating an index/position in each of those vectors.}

@tdhock
Copy link
Member

tdhock commented May 24, 2021

hi again the doc fix in #4723 was recently merged into master, so if this is good enough for you then @otoomet can you please close this issue?
From ?melt

variable.name: name (default ‘'variable'’) of output column containing
          information about which input column(s) were melted. If
          ‘measure.vars’ is an integer/character vector, then each
          entry of this column contains the name of a melted column
          from ‘data’. If ‘measure.vars’ is a list of integer/character
          vectors, then each entry of this column contains an integer
          indicating an index/position in each of those vectors.

@MichaelChirico MichaelChirico added this to the 1.14.1 milestone Jun 8, 2021
@jangorecki jangorecki modified the milestones: 1.14.3, 1.14.5 Jul 19, 2022
@jangorecki jangorecki modified the milestones: 1.14.11, 1.15.1 Oct 29, 2023
@MichaelChirico MichaelChirico modified the milestones: 1.16.0, 1.17.0 Jul 10, 2024
@tdhock tdhock closed this as completed Nov 29, 2024
@tdhock
Copy link
Member

tdhock commented Nov 29, 2024

closed since I believe the doc changes address the issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
reshape dcast melt
Projects
None yet
Development

No branches or pull requests

6 participants