Skip to content

ENH: Support Multi-Index for columns in parquet format #34777

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
yohplala opened this issue Jun 14, 2020 · 4 comments · Fixed by #36305
Closed

ENH: Support Multi-Index for columns in parquet format #34777

yohplala opened this issue Jun 14, 2020 · 4 comments · Fixed by #36305

Comments

@yohplala
Copy link

yohplala commented Jun 14, 2020

Is your feature request related to a problem?

I would like to save DataFrame with Multi-Index used for columns into parquet format.
This is currently not possible.

import pandas as pd
df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], columns= pd.MultiIndex.from_product([['1'],['a', 'b', 'c']]))
df.to_parquet('test.parquet')
""" doesn't work, whatever the engine """
ValueError: parquet must have string column names

Describe alternatives you've considered

To do so with actual state of library, a piece of code has been shared on SO.
I will use it for now.

import pyarrow as pa
import pyarrow.parquet as pq
table = pa.Table.from_pandas(df)
pq.write_table(table, 'test.parquet')
df_test_read = pd.read_parquet('test.parquet')

Thanks for your help!
Bests,

@yohplala yohplala added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 14, 2020
@rhshadrach
Copy link
Member

Simply removing the code

# must have value column names (strings only)
if df.columns.inferred_type not in {"string", "empty"}:
    raise ValueError("parquet must have string column names")

this example then works, although I'm not sure what the limitations are. I've tested a couple of examples, such as columns=[1, 2, 3] and things seem to generally work. Even weird things like columns=[(1, 2), 3, 4] "work", although this is converted to ['('1', '2')', '3', '4'] when reading back in (with a warning message).

@TomAugspurger
Copy link
Contributor

The check needs to be updated to handle MultiIndexes. Something like

if df.index.nlevels > 1:
    if not all(x.inferred_type in {"string", "empty"} for x in df.index.levels):
        raise ValueError(...)

@TomAugspurger TomAugspurger added IO Parquet parquet, feather good first issue and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 4, 2020
@TomAugspurger TomAugspurger added this to the Contributions Welcome milestone Sep 4, 2020
@hweecat
Copy link
Contributor

hweecat commented Sep 5, 2020

Hello! I could make an attempt on updating the check to handle MultiIndexes for parquet format, if that's okay. :)

@TomAugspurger
Copy link
Contributor

That'd be great, thanks.

hweecat added a commit to hweecat/pandas that referenced this issue Sep 13, 2020
1. Update check to handle MultiIndex columns for parquet format
2. Edit whatsnew entry.
3. Add test for writing MultiIndex columns with string column names
hweecat added a commit to hweecat/pandas that referenced this issue Sep 13, 2020
1. Include issue number as a comment on added test
hweecat added a commit to hweecat/pandas that referenced this issue Sep 14, 2020
1. Add tests for writing Indexes and MultiIndexes for columns
2. Edit message for check to handle  MultiIndex columns for parquet
3. Edit whatsnew entry to move entry to other enhancements
hweecat added a commit to hweecat/pandas that referenced this issue Sep 14, 2020
1. Fix PEP8 issue for error message in check for MultiIndex columns
@jreback jreback modified the milestones: Contributions Welcome, 1.2 Oct 10, 2020
hweecat added a commit to hweecat/pandas that referenced this issue Oct 11, 2020
add whatsnew entry: enhancements in 1.2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants