Skip to content

from_dict(..., orient='index') row order preservation inconsistent #24859

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tdamsma opened this issue Jan 21, 2019 · 5 comments
Closed

from_dict(..., orient='index') row order preservation inconsistent #24859

tdamsma opened this issue Jan 21, 2019 · 5 comments
Labels
Bug DataFrame DataFrame data structure

Comments

@tdamsma
Copy link
Contributor

tdamsma commented Jan 21, 2019

Code Sample

data = {'B' : [1], 'A' : [2], 'C' :[3]}
print(pd.DataFrame.from_dict(data, orient='index'))

#    0
# B  1
# A  2
# C  3

data = {'B' : dict(col1=1), 'A' : dict(col1=2), 'C' :dict(col1=3)}
print(pd.DataFrame.from_dict(data, orient='index'))

#    col1
# A     2
# B     1
# C     3

Problem description

If dictionaries are passed for the column values in a call to pd.DataFrame.from_dict(data, orient='index'), then the df index is sorted (Not Expected). If the column values are lists, then the index is not sorted (Expected)

Expected Output

data = {'B' : dict(col1=1), 'A' : dict(col1=2), 'C' :dict(col1=3)}
print(pd.DataFrame.from_dict(data, orient='index'))

#    col1
# B     1
# A     2
# C     3

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.0.final.0
python-bits: 64
OS: Darwin
OS-release: 18.2.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: 4.0.1
pip: 18.1
setuptools: 40.6.2
Cython: 0.29.1
numpy: 1.15.4
scipy: None
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: 1.8.2
patsy: None
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.2
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: 1.1.2
lxml: None
bs4: 4.6.3
html5lib: None
sqlalchemy: 1.2.14
pymysql: None
psycopg2: 2.7.6.1 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@tdamsma
Copy link
Contributor Author

tdamsma commented Jan 21, 2019

The difference is caused by the code branching to _from_nested_dict. As far as I can tell this bug can be resolved by removing the branch condition

@mroeschke
Copy link
Member

I imagine that completely removing that branch would break existing tests, but investigation and potentially patching _from_nested_dict itself is welcome!

@mroeschke mroeschke added Bug DataFrame DataFrame data structure labels Jan 22, 2019
@tdamsma
Copy link
Contributor Author

tdamsma commented Jan 22, 2019

Indeed the change break one existing test, pandas.tests.frame.test_constructors.TestDataFrameConstructors::test_constructor_list_of_series
see https://travis-ci.org/tdamsma/pandas/jobs/482430391

Compare the following:

# both columns and index are not sorted
example = {'B' : {'Y':1,'X':2}, 'A' : {'Y':3,'X':4}}

# Index is sorted, columns not
print(pd.DataFrame.from_dict(example, orient='index'))

#    Y  X
# A  3  4
# B  1  2

# what would happen if we don't use _from_nested_dict
# Columns are sorted, index not
data, index = list(example.values()), list(example.keys())
print(pd.DataFrame(data, index=index))

#    X  Y
# B  2  1
# A  4  3

@WillAyd
Copy link
Member

WillAyd commented May 3, 2019

I think we've had quite a few similar conversations about this recently but generally always come back to the point that it's arguably impossible to guarantee dict insertion order when dealing with more than one dimension. It's also not something that gets guaranteed by specifications that allow things like this, like JSON

Closing this out as such, though feel free to ping if you strongly disagree

@Alex-ley
Copy link

Alex-ley commented Nov 23, 2021

This also threw me off for a while. Especially because in lots of other cases insertion order is maintained (e.g. across columns)

a workaround seems to be the following (assuming your dict preserves insertion order in e.g. py>=3.6):

data = {"B": dict(col1=1), "A": dict(col1=2), "C": dict(col1=3)}
# 1st Option (fails - keys are sorted):
# df = pd.DataFrame.from_dict(data, orient="index")
# 2nd Option (works but seems overly verbose to type this out each time):
# df = pd.DataFrame.from_records(list(data.values()), index=list(data.keys()))
# 3rd Option (nice and concise):
df = pd.DataFrame(data).T # use the fact that insertion order is maintained across the column and then transpose it
print(df)
#    col1
# B     1
# A     2
# C     3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug DataFrame DataFrame data structure
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants