Melt on multiple columns: undocumented behavior #4047

otoomet · 2019-11-16T00:52:20Z

Consider a simple data table:

dt <- data.table(i = 1:2, na = rnorm(2), nb=rnorm(2), 
                          ua=runif(2), ub=runif(2))
dt
##    i         na        nb         ua        ub
## 1: 1  0.8675148 1.1900491 0.09394934 0.2717421
## 2: 2 -0.1700282 0.9188715 0.58017687 0.5443863

when melting it into multiple colums we get:

melt(dt, measure.vars=list(c("na", "nb"), c("ua", "ub")))
##    i variable     value1     value2
## 1: 1        1  0.8675148 0.09394934
## 2: 2        1 -0.1700282 0.58017687
## 3: 1        2  1.1900491 0.27174213
## 4: 2        2  0.9188715 0.54438634

In particular variable is a factor with levels "1" and "2". This behavior seems to be undocumented. ?melt tells

‘list’ is a generalization of the vector version - each
element of the list (which should be ‘integer’ or
‘character’ as above) will become a ‘melt’ed column.

and

From version ‘1.9.6’, ‘melt’ gains a feature with ‘measure.vars’
accepting a list of ‘character’ or ‘integer’ vectors as well to
melt into multiple columns in a single function call efficiently.
The function ‘patterns’ can be used to provide regular expression
patterns. When used along with ‘melt’, if ‘cols’ argument is not
provided, the patterns will be matched against ‘names(data)’, for
convenience.

However, I cannot find anything about

the fact that the variable will use numbered labels to denote the original columns, and
what is the relationship between the numeric label and the original column name.

I know there are related feature requests (#2551and #3396). I am also aware of related solutions (e.g. on SO) that revolve around renaming the corresponding factor levels. However, for such solutions to be considered safe, the behavior of numeric levels should be documented and considered part of the API.

data.table 1.12.6; R 3.4, 3.6.

The text was updated successfully, but these errors were encountered:

ssh352 · 2020-01-19T08:34:16Z

I have the same question. It would be nice for variable column to have factors "a" and "b" instead of 1 and 2.

ggrothendieck · 2020-04-17T12:10:11Z

Look at this example, which has data.table, tidyr and base reshape versions of the same problem for comparison. The data.table version is also shown below.

https://stackoverflow.com/questions/61259478/manipulating-data-frame-format-in-base-r/61260008#61260008

variable.name is the name of an index into the times. We need to perform that indexing in post processing where pivot_longer does it automatically. Even base reshape is slightly better here as there is a times= argument that provides the times which are automatically indexed into so you don't have to do it yourself. Maybe melt could have such an argument or perhaps some other method of doing this. This would slightly simplify the second last line of code below.

Also it would be nice if melt had a drop_na argument to drop rows having NAs like pivot_longer does and have a key= argument to sort the result to reduce the amount of post processing. This would eliminate the last line in the code below.

library(data.table)

input <-
structure(list(id = c(1, 3, 6), school = structure(1:3, .Label = c("A", 
"B", "C"), class = "factor"), read_1 = c(20, 22, 24), read_1_sp = c(TRUE, 
FALSE, TRUE), read_2 = c(45, 47, 49), read_2_sp = c(FALSE, FALSE, 
FALSE), math_1 = c(20, 22, NA), math_1_sp = c(TRUE, FALSE, NA
), math_2 = c(NA, 35, 37), math_2_sp = c(NA, FALSE, FALSE)), class = "data.frame", 
row.names = c(NA, -3L))

input2 <- as.data.table(input, key = c("id", "school"))
times <- grep("\\d$", names(input2), value = TRUE)  # score col names

melt(input2, measure = patterns(sp = "sp", score = "\\d$"), variable.name = "subject")[, 
  c("subject", "no"):= fread(text = times[subject], sep = "_")][,   
  na.omit(.SD), key = key(input2)]

tdhock · 2020-09-29T03:58:30Z

hi this is solved by a new function in nc package which uses melt.data.table internally,

library(data.table)
dt <- data.table(
  i = 1:2, na = rnorm(2), nb=rnorm(2), 
  ua=runif(2), ub=runif(2))  
nc::capture_melt_multiple(dt, column="[un]", letter="[ab]")
#>    i letter          n         u
#> 1: 1      a  0.5509765 0.7095506
#> 2: 2      a  0.7278650 0.2971809
#> 3: 1      b -0.4690630 0.9605627
#> 4: 2      b -1.4568312 0.3414062
nc::capture_melt_multiple(dt, letter="[un]", column="[ab]")
#>    i letter         a          b
#> 1: 1      n 0.5509765 -0.4690630
#> 2: 2      n 0.7278650 -1.4568312
#> 3: 1      u 0.7095506  0.9605627
#> 4: 2      u 0.2971809  0.3414062

jangorecki · 2020-09-29T12:41:50Z

Again, we should first discuss if that functionality is going to be in scope of DT before closing.

tdhock · 2020-10-01T05:23:14Z

the original question: "what is the relationship between the numeric label and the original column name?" it is true that we should add documentation about what values go into the variable column. This is related to #4455 in which the problem is that the numeric label in the variable column is not consistent between na.rm=TRUE and FALSE.

AFAICT the closest documentation is that variable.name is name for the measured variable names column which is somewhat ambiguous. I would recommend revising to something like:
variable.name: name of output column containing information about which input columns the data came from. If measure.vars is an integer/character vector, then this column contains names of melted columns from the input data table. If measure.vars is a list of integer/character vectors, then this column contains integers indicating the index/position in each of those vectors.

tdhock · 2020-10-07T04:01:39Z

even better just avoid the variable column altogether, use new functionality in #4731

remotes::install_github("Rdatatable/data.table@melt-custom-variable")
#> Skipping install of 'data.table' from a github remote, the SHA1 (c02fa9e8) has not changed since last install.
#>   Use `force = TRUE` to force installation
library(data.table)
dt <- data.table(
  i = 1:2, na = rnorm(2), nb=rnorm(2), 
  ua=runif(2), ub=runif(2))  
melt(dt, measure.vars=measure(value.name, letter, pattern="([un])([ab])"))
#>    i letter          n         u
#> 1: 1      a -0.6042333 0.3756086
#> 2: 2      a  0.4125218 0.2719224
#> 3: 1      b  0.2163859 0.5793461
#> 4: 2      b -0.6725394 0.1945757
melt(dt, measure.vars=measure(letter, value.name, pattern="([un])([ab])"))
#>    i letter          a          b
#> 1: 1      n -0.6042333  0.2163859
#> 2: 2      n  0.4125218 -0.6725394
#> 3: 1      u  0.3756086  0.5793461
#> 4: 2      u  0.2719224  0.1945757

otoomet · 2020-10-08T17:06:33Z

Just a reminder here: the issue is about missing documentation of the current behavior.

Myself, I would prefer the new functionality, but I can work around with the existing one. But without details laid out in documentation, it feels somewhat unsafe to assume how the current approach works. In particular, to assume that the numeric values correspond to the original values in alphabetic order.

tdhock · 2020-10-08T17:24:31Z

I have proposed a doc fix in #4723
\item{variable.name}{name (default \code{'variable'}) of output column containing information about which input column(s) were melted. If \code{measure.vars} is an integer/character vector, then each entry of this column contains the name of a melted column from \code{data}. If \code{measure.vars} is a list of integer/character vectors, then each entry of this column contains an integer indicating an index/position in each of those vectors.}

tdhock · 2021-05-24T15:57:46Z

hi again the doc fix in #4723 was recently merged into master, so if this is good enough for you then @otoomet can you please close this issue?
From ?melt

variable.name: name (default ‘'variable'’) of output column containing
          information about which input column(s) were melted. If
          ‘measure.vars’ is an integer/character vector, then each
          entry of this column contains the name of a melted column
          from ‘data’. If ‘measure.vars’ is a list of integer/character
          vectors, then each entry of this column contains an integer
          indicating an index/position in each of those vectors.

tdhock · 2024-11-29T20:01:45Z

closed since I believe the doc changes address the issue

jangorecki added the reshape dcast melt label Dec 1, 2019

mrdwab mentioned this issue May 18, 2020

Ability to explicitly pass index names in melt.data.table measure.vars argument #4027

Closed

tdhock closed this as completed Sep 29, 2020

jangorecki reopened this Sep 29, 2020

myoung3 mentioned this issue Mar 22, 2021

pivot_longer(): Incorrect "id" labels when using ".value" in names_to tidyverse/dtplyr#228

Closed

MichaelChirico added this to the 1.14.1 milestone Jun 8, 2021

jangorecki modified the milestones: 1.14.3, 1.14.5 Jul 19, 2022

jangorecki modified the milestones: 1.14.11, 1.15.1 Oct 29, 2023

MichaelChirico modified the milestones: 1.16.0, 1.17.0 Jul 10, 2024

tdhock closed this as completed Nov 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Melt on multiple columns: undocumented behavior #4047

Melt on multiple columns: undocumented behavior #4047

otoomet commented Nov 16, 2019

ssh352 commented Jan 19, 2020

ggrothendieck commented Apr 17, 2020 •

edited

Loading

tdhock commented Sep 29, 2020

jangorecki commented Sep 29, 2020

tdhock commented Oct 1, 2020

tdhock commented Oct 7, 2020

otoomet commented Oct 8, 2020

tdhock commented Oct 8, 2020

tdhock commented May 24, 2021

tdhock commented Nov 29, 2024

Melt on multiple columns: undocumented behavior #4047

Melt on multiple columns: undocumented behavior #4047

Comments

otoomet commented Nov 16, 2019

ssh352 commented Jan 19, 2020

ggrothendieck commented Apr 17, 2020 • edited Loading

tdhock commented Sep 29, 2020

jangorecki commented Sep 29, 2020

tdhock commented Oct 1, 2020

tdhock commented Oct 7, 2020

otoomet commented Oct 8, 2020

tdhock commented Oct 8, 2020

tdhock commented May 24, 2021

tdhock commented Nov 29, 2024

ggrothendieck commented Apr 17, 2020 •

edited

Loading