Skip to content

Index gets lost when DataFrame melt method is used #17440

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
NiklasKeck opened this issue Sep 5, 2017 · 5 comments · Fixed by #33659
Closed

Index gets lost when DataFrame melt method is used #17440

NiklasKeck opened this issue Sep 5, 2017 · 5 comments · Fixed by #33659
Labels
API Design Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@NiklasKeck
Copy link

Index gets lost when DataFrame melt method is used

import pandas as pd
import numpy as np
df = pd.DataFrame({"Numbers_1":range(0,3),
                   "Numbers_2":range(3,6),
                   "Letters":["A","B","C"]})
df.set_index("Letters",inplace=True)
print(df)
Letters Numbers_1 Numbers_2
A 0 3
B 1 4
C 2 5
df_melted = df.melt()
print(df_melted)
. variable value
0 Numbers_1 0
1 Numbers_1 1
2 Numbers_1 2
3 Numbers_2 3
4 Numbers_2 4
5 Numbers_2 5

Problem description

When melting a dataframe, I expected the original index to be reused. However, the original index is lost in the melt method. This is probably meant by wesm's comment (# TODO: what about the existing index?)

# TODO: what about the existing index?

Expected Output

I would expect something like

n_row,n_col = df.shape
index_melted = list(df.index.get_values())*n_col
melt_id = list(np.arange(n_col).repeat(n_row))
temp = list(zip(*[index_melted,melt_id]))

index_melted_uniq = pd.MultiIndex.from_tuples(temp,names=[df.index.names[0], 'melt_id'])
index_numbers = list(range(df.shape[1]))*n_row

data = {'variable':df.columns.repeat(n_row),
        "value":df.values.ravel('F')}

df_expected = pd.DataFrame(data,columns = ["variable","value"], index=index_melted_uniq)
print(df_expected)
Letters melt_id variable value
A 0 Numbers_1 0
B 0 Numbers_1 1
C 0 Numbers_1 2
A 1 Numbers_2 3
B 1 Numbers_2 4
C 1 Numbers_2 5

Where Letters and melt_id are two multiindex levels and variable and value are actual columns.

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit: d0f62c2816ada96a991f5a624a52c9a4f09617f7
python: 3.6.2.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en
LOCALE: None.None

pandas: 0.21.0.dev+420.gd0f62c2
pytest: 3.2.1
pip: 9.0.1
setuptools: 36.2.2.post20170724
Cython: 0.26
numpy: 1.13.1
scipy: None
pyarrow: None
xarray: None
IPython: 6.1.0
sphinx: 1.6.3
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.9999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@TomAugspurger
Copy link
Contributor

Thanks @NiklasKeck. I thought about proposing this a while back, but never wrote up an issue. melt is already quite complex as is, but this seems worthwhile to avoid an awkward .reset_index() / .set_index() dance.

I never worked out the correct way to handle the interaction between the existing index and the id_vars. Your melt_id is an option but I'll need to think about it more. In an ideal world, I think that df.index + id_vars would always be unique, and we'd use that as the MI:

In [34]: df.reset_index().melt(id_vars=['Letters']).set_index(['Letters', 'variable'])
Out[34]:
                   value
Letters variable
A       Numbers_1      0
B       Numbers_1      1
C       Numbers_1      2
A       Numbers_2      3
B       Numbers_2      4
C       Numbers_2      5

but that may not be true in general.

Anyway, I think this would be a useful addition (as an option keyword, to preserve backwards compatibility)

@TomAugspurger TomAugspurger added API Design Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Sep 5, 2017
@NiklasKeck
Copy link
Author

NiklasKeck commented Sep 5, 2017

Thanks @TomAugspurger for the rapid and nice response. Your idea to use the original column names (variable) for the additional level is more intuitive than the arbitrary melt_id. I think the result should be unique if the original index and column names were unique. I will think about how to implement that.
A boolean optional keyword argument like keep_index=False might be an idea to ensure backwards compatibility.

NiklasKeck added a commit to NiklasKeck/pandas that referenced this issue Sep 6, 2017
Setting keep_index to True will reuse the original DataFrame index +
names of melted columns as additional level. closes issue pandas-dev#17440
NiklasKeck added a commit to NiklasKeck/pandas that referenced this issue Sep 7, 2017
Setting keep_index to True will reuse the original DataFrame index +
names of melted columns as additional level. closes issue pandas-dev#17440
NiklasKeck added a commit to NiklasKeck/pandas that referenced this issue Sep 7, 2017
Setting keep_index to True will reuse the original DataFrame index +
names of melted columns as additional level. closes issue pandas-dev#17440
@gitgithan
Copy link

I just met this issue too and felt it's such a pity for losing valuable index information. What happened to this feature addition and what can i do to make it happen?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Dec 11, 2018 via email

@nickdelgrosso
Copy link

I'm very interested in this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
5 participants