Skip to content

DataFrame(recarray, columns=MultiIndex) disregards input data, gives empty DataFrame #13415

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jzwinck opened this issue Jun 9, 2016 · 2 comments

Comments

@jzwinck
Copy link
Contributor

jzwinck commented Jun 9, 2016

I previously posted this as a question (not knowing it was a bug) here: http://stackoverflow.com/questions/37732403/pandas-dataframe-from-multiindex-and-numpy-structured-array-recarray

First I create a two-level MultiIndex:

import numpy as np
import pandas as pd

ind = pd.MultiIndex.from_product([('X','Y'), ('a','b')])

I can use it like this:

pd.DataFrame(np.zeros((3,4)), columns=ind)

Which gives:

     X         Y     
     a    b    a    b
0  0.0  0.0  0.0  0.0
1  0.0  0.0  0.0  0.0
2  0.0  0.0  0.0  0.0

But now I'm trying to do this:

dtype = [('Xa','f8'), ('Xb','i4'), ('Ya','f8'), ('Yb','i4')]
pd.DataFrame(np.zeros(3, dtype), columns=ind)

But that gives me an empty DataFrame!

Empty DataFrame
Columns: [(X, a), (X, b), (Y, a), (Y, b)]
Index: []

I expected it to do the same thing as this:

df = pd.DataFrame(np.zeros(3, dtype))
df.columns = ind
df

Which is:

     X       Y   
     a  b    a  b
0  0.0  0  0.0  0
1  0.0  0  0.0  0
2  0.0  0  0.0  0

INSTALLED VERSIONS

commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-86-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.0
pip: 8.1.1
setuptools: 20.7.0
numpy: 1.10.0
scipy: 0.16.0
statsmodels: 0.6.1
IPython: 3.2.1
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.4
tables: 3.2.2
numexpr: 2.5.2
matplotlib: 1.4.3

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jun 10, 2016

This is a common pitfall: currently, passing columns in DataFrame() does a reindex and does not overwrite the columns.

If your data already has column name information, pd.DataFrame(np.zeros(3, dtype), columns=ind) does more something like:

df = pd.DataFrame(np.zeros(3, dtype))
df = df.reindex(columns=ind)

rather than the

df = pd.DataFrame(np.zeros(3, dtype))
df.columns = ind

as you expected.

So knowing this, the output you see is correct, as the reindex will not find matching column names and return an empty dataframe.
There are some related issues about this, and some discussions on changing this (but the question is also whether it is worth the breaking change).

@jorisvandenbossche
Copy link
Member

xref discussion in #9237

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants