Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possibility to provide group labels in melt variable column when melting a list of columns simultaneously #2551

Closed
matthiasgomolka opened this issue Jan 5, 2018 · 11 comments · Fixed by #4731
Assignees
Milestone

Comments

@matthiasgomolka
Copy link

matthiasgomolka commented Jan 5, 2018

Right now, there is a loss of information when one melts a list of columns simultaneously since the variable column only gives the number of the group (as a factor by default), not a name.

It would be great if the functionality of melting a list of columns simultaneously would be extended such that it is possible to provide labels for the factor in the variable ###column. I would suggest an additional argument, e.g. variable.labels which makes it possible to provide custom labels.

# Minimal reproducible example

library(data.table)

DT <- data.table(val_loss = 1:2,
                 val_acc = 3:4,
                 loss = 5:6,
                 acc = 7:8,
                 id = c("A", "B"))

DT_melted <- 
  melt(DT,
       measure.vars = list(c("loss", "acc"), c("val_loss", "val_acc")),
# the following line shows what I think would be ideal
#      variable.labels = c("group 1", "group 2"),
       value = c("training", "validation"))

# the following step should become obsolete by providing the labels during the melt call above
DT_melted[, variable := factor(variable, levels = 1:2, labels = c("group 1", "group 2"))]

# Output of sessionInfo()

R version 3.3.3 (2017-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252    LC_MONETARY=German_Germany.1252
[4] LC_NUMERIC=C                    LC_TIME=German_Germany.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.4-3

loaded via a namespace (and not attached):
[1] tools_3.3.3

I know that the R Version is not up to date but at my Company it takes a while until we get the most recent version.

Sorry, for not labeling the issue. Somehow I could not figure out, how to set them.

@matthiasgomolka matthiasgomolka changed the title Possibility to provide group labels in variable column when melting a list of columns simultaneously #Enhancement #melt Possibility to provide group labels in variable column when melting a list of columns simultaneously @Enhancement @melt Jan 5, 2018
@matthiasgomolka matthiasgomolka changed the title Possibility to provide group labels in variable column when melting a list of columns simultaneously @Enhancement @melt Possibility to provide group labels in variable column when melting a list of columns simultaneously label:Enhancement #melt Jan 5, 2018
@matthiasgomolka matthiasgomolka changed the title Possibility to provide group labels in variable column when melting a list of columns simultaneously label:Enhancement #melt Possibility to provide group labels in variable column when melting a list of columns simultaneously #Enhancement #melt Jan 5, 2018
@MichaelChirico
Copy link
Member

I don't understand your comment about "information loss." There's a 1-1 mapping between your desired output and the actual output?

I don't see much wrong with:

levels(DT_melted$variable) = paste('group', levels(DT_melted$variable))

@matthiasgomolka
Copy link
Author

What I mean by "information loss" is that you need to attach the labels afterwards as you did above and as I did in the last line of the minimal example. I feel that this increases the risk of mixing up the labels.

But you are right, it can be done afterwards using the same vector of group names. Therefore, the proposed feature would be rather "nice to have" than adding something which cannot be done right now.

@Henrik-P
Copy link

Henrik-P commented Jan 5, 2018

@MichaelChirico
Copy link
Member

@Henrik-P yes appears quite related. In fact playing around with OP's example, I inadvertently tried some of the fixes for #1547 that don't work yet...

@mrdwab
Copy link

mrdwab commented Jan 18, 2018

@MichaelChirico , Glad to hear this is being worked on. I don't know how much of a priority it is for you guys, but it would be great if you also considered how to deal with "unbalanced" data that is in a panel form.

Building on #2564, let's imagine we were missing a few columns:

library(data.table)
set.seed(2334)
DT <- data.table(
  a_alpha = rnorm(3), a_gamma = rnorm(3),
  b_beta = rnorm(3), b_gamma = rnorm(3),
  id = c(1:3)
)

It would still be fantastic to be able to use the measure.vars = patterns() approach, but that does not give the right result, and you can't actually "restore" labels as is demonstrated in this issue.

melt(DT, measure.vars = patterns("a_", "b_"))
#    id variable     value1       value2
# 1:  1        1 -0.1183107  1.682018318
# 2:  2        1  1.2370906 -0.573611132
# 3:  3        1  0.8088209 -0.057320032
# 4:  1        2 -0.7656264 -0.706428227
# 5:  2        2 -0.5919939  0.001899857
# 6:  3        2  0.5279071  1.063851211

To get the correct output, you would have to do the more traditional melt + strsplit + dcast routine:

melt(DT, "id")[, c("stub", "var") := tstrsplit(variable, "_")][, dcast(.SD, id + var ~ stub)]
#    id   var          a            b
# 1:  1 alpha -0.1183107           NA
# 2:  1  beta         NA  1.682018318
# 3:  1 gamma -0.7656264 -0.706428227
# 4:  2 alpha  1.2370906           NA
# 5:  2  beta         NA -0.573611132
# 6:  2 gamma -0.5919939  0.001899857
# 7:  3 alpha  0.8088209           NA
# 8:  3  beta         NA -0.057320032
# 9:  3 gamma  0.5279071  1.063851211

I had previously written merged.stack as part of "splitstackshape" that melted columns separately and then merged them together. Needless to say, not a very efficient approach.

library(splitstackshape)
merged.stack(DT, var.stubs = c("a_", "b_"), sep = "var.stubs")
#    id .time_1         a_           b_
# 1:  1   alpha -0.1183107           NA
# 2:  1    beta         NA  1.682018318
# 3:  1   gamma -0.7656264 -0.706428227
# 4:  2   alpha  1.2370906           NA
# 5:  2    beta         NA -0.573611132
# 6:  2   gamma -0.5919939  0.001899857
# 7:  3   alpha  0.8088209           NA
# 8:  3    beta         NA -0.057320032
# 9:  3   gamma  0.5279071  1.063851211

More recently, working on a replacement for merged.stack I used a different approach where I added in the missing columns before melting.

ReshapeLong_(DT, c(a = "a_", b = "b_"))
#    id variable          a            b
# 1:  1    alpha -0.1183107           NA
# 2:  2    alpha  1.2370906           NA
# 3:  3    alpha  0.8088209           NA
# 4:  1     beta         NA  1.682018318
# 5:  2     beta         NA -0.573611132
# 6:  3     beta         NA -0.057320032
# 7:  1    gamma -0.7656264 -0.706428227
# 8:  2    gamma -0.5919939  0.001899857
# 9:  3    gamma  0.5279071  1.063851211

This works, but:

  1. I'm sure there's a better way
  2. Since column types aren't guessed before adding the empty column, data.table complains about the molten values not being all the same type, so you have lots of warnings

Anyway--as I mentioned, great to hear that this is being worked on. Just thought I would throw this out as a possible test you would want to look into.

@mrdwab
Copy link

mrdwab commented Jan 18, 2018

Also, I don't know whether Stata handles unbalanced wide to long transformation, but from what I can tell, pandas handles it with wide_to_long:

import pandas as pd
import numpy as np
np.random.seed(123)
df = pd.DataFrame({"A1970" : {0 : "a", 1 : "b", 2 : "c"},
                   "A1980" : {0 : "d", 1 : "e", 2 : "f"},
                   "A1990" : {0 : "g", 1 : "h", 2 : "i"},
                   "B1970" : {0 : 2.5, 1 : 1.2, 2 : .7},
                   "B1980" : {0 : 3.2, 1 : 1.3, 2 : .1},
                   "X"     : dict(zip(range(3), np.random.randn(3)))
                  })
df["id"] = df.index

print(df)

  A1970 A1980 A1990  B1970  B1980         X  id
0     a     d     g    2.5    3.2 -1.085631   0
1     b     e     h    1.2    1.3  0.997345   1
2     c     f     i    0.7    0.1  0.282978   2

print(pd.wide_to_long(df, ["A", "B"], i="id", j="year"))

                X  A    B
id year                  
0  1970 -1.085631  a  2.5
   1980 -1.085631  d  3.2
   1990 -1.085631  g  NaN
1  1970  0.997345  b  1.2
   1980  0.997345  e  1.3
   1990  0.997345  h  NaN
2  1970  0.282978  c  0.7
   1980  0.282978  f  0.1
   1990  0.282978  i  NaN

@MichaelChirico
Copy link
Member

@mrdwab Could you file a separate issue with this? Looks very useful

@MichaelChirico MichaelChirico self-assigned this Feb 2, 2018
@mattdowle mattdowle modified the milestones: v1.10.6, Candidate Mar 30, 2018
@mattdowle mattdowle removed this from the Candidate milestone May 10, 2018
@MichaelChirico MichaelChirico changed the title Possibility to provide group labels in variable column when melting a list of columns simultaneously #Enhancement #melt Possibility to provide group labels in melt variable column when melting a list of columns simultaneously Nov 20, 2019
@jangorecki jangorecki added the reshape dcast melt label Apr 5, 2020
@tdhock
Copy link
Member

tdhock commented Sep 29, 2020

hi the original issue is solved via

library(data.table)
DT <- data.table(
  val_loss = 1:2,
  val_acc = 3:4,
  loss = 5:6,
  acc = 7:8,
  id = c("A", "B"))
nc::capture_melt_multiple(
  DT,
  column="val_|", function(x)ifelse(x=="", "training", "validation"),
  variable="loss|acc")
#>    id variable training validation
#> 1:  A      acc        7          3
#> 2:  B      acc        8          4
#> 3:  A     loss        5          1
#> 4:  B     loss        6          2
nc::capture_melt_multiple(
  DT,
  set="val_|", function(x)ifelse(x=="", "training", "validation"),
  column="loss|acc")
#>    id        set acc loss
#> 1:  A   training   7    5
#> 2:  B   training   8    6
#> 3:  A validation   3    1
#> 4:  B validation   4    2

@tdhock tdhock closed this as completed Sep 29, 2020
@jangorecki
Copy link
Member

jangorecki commented Sep 29, 2020

Code is not reproducible due to unkown namespace nc. Anyway before closing it is good to merge unit test to ensure requested functionality will keep working in future.

@jangorecki jangorecki reopened this Sep 29, 2020
@tdhock
Copy link
Member

tdhock commented Sep 29, 2020

install.packages('nc') ;)

@tdhock
Copy link
Member

tdhock commented Oct 7, 2020

pure data.table solution using #4731

remotes::install_github("Rdatatable/data.table@melt-custom-variable")
#> Skipping install of 'data.table' from a github remote, the SHA1 (c02fa9e8) has not changed since last install.
#>   Use `force = TRUE` to force installation
library(data.table)
DT <- data.table(
  val_loss = 1:2,
  val_acc = 3:4,
  loss = 5:6,
  acc = 7:8,
  id = c("A", "B"))
melt(DT, measure.vars=measure(
  value.name=function(x)ifelse(x=="", "training", "validation"),
  variable, pattern="(val_|)(loss|acc)"))
#>    id variable validation training
#> 1:  A     loss          1        5
#> 2:  B     loss          2        6
#> 3:  A      acc          3        7
#> 4:  B      acc          4        8
melt(DT, measure.vars=measure(
  set=function(x)ifelse(x=="", "training", "validation"),
  value.name, pattern="(val_|)(loss|acc)"))
#>    id        set loss acc
#> 1:  A validation    1   3
#> 2:  B validation    2   4
#> 3:  A   training    5   7
#> 4:  B   training    6   8

@mattdowle mattdowle added this to the 1.14.1 milestone May 9, 2021
@jangorecki jangorecki modified the milestones: 1.14.9, 1.15.0 Oct 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants