Skip to content

BUG: Preserve index order when constructing DataFrame from dict #26113

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 12 commits into from

Conversation

bchu
Copy link

@bchu bchu commented Apr 16, 2019

This resolves the issue of pd.DataFrame(dict) or pd.DataFrame.from_dict(dict) not preserving the order of the index specified in dict.

This required fixing up tests that relied on alphabetical ordering. The changes to these tests also test for this new preserved ordering; let me know if I should add an additional explicit test.

@codecov
Copy link

codecov bot commented Apr 17, 2019

Codecov Report

Merging #26113 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26113      +/-   ##
==========================================
- Coverage   91.99%   91.98%   -0.01%     
==========================================
  Files         175      175              
  Lines       52384    52385       +1     
==========================================
- Hits        48189    48186       -3     
- Misses       4195     4199       +4
Flag Coverage Δ
#multiple 90.53% <100%> (ø) ⬆️
#single 40.73% <100%> (-0.14%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/api.py 99.01% <100%> (ø) ⬆️
pandas/core/internals/construction.py 95.88% <100%> (ø) ⬆️
pandas/io/gbq.py 75% <0%> (-12.5%) ⬇️
pandas/core/frame.py 96.9% <0%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 68b1da7...f9d82eb. Read the comment docs.

@codecov
Copy link

codecov bot commented Apr 17, 2019

Codecov Report

Merging #26113 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26113      +/-   ##
==========================================
- Coverage   91.99%   91.98%   -0.01%     
==========================================
  Files         175      175              
  Lines       52384    52385       +1     
==========================================
- Hits        48189    48186       -3     
- Misses       4195     4199       +4
Flag Coverage Δ
#multiple 90.53% <100%> (ø) ⬆️
#single 40.73% <100%> (-0.14%) ⬇️
Impacted Files Coverage Δ
pandas/core/frame.py 96.9% <ø> (-0.12%) ⬇️
pandas/core/indexes/api.py 99.01% <100%> (ø) ⬆️
pandas/core/internals/construction.py 95.88% <100%> (ø) ⬆️
pandas/io/gbq.py 75% <0%> (-12.5%) ⬇️
pandas/core/tools/datetimes.py 84.59% <0%> (ø) ⬆️
pandas/core/window.py 96.39% <0%> (ø) ⬆️
pandas/core/groupby/generic.py 89.02% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 68b1da7...adff27b. Read the comment docs.

@@ -146,7 +146,8 @@ def _union_indexes(indexes, sort=True):
if len(indexes) == 1:
result = indexes[0]
if isinstance(result, list):
result = Index(sorted(result))
result = sorted(result) if sort else result
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this branch be removed altogether or does that break 35 compatibility?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it breaks python 3.5 compat.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just move the compat check from the construction module to here then? Would be simpler

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be a little more complicated here since we would have to do another compat check in the len(indexes) > 1 case

@WillAyd WillAyd added the DataFrame DataFrame data structure label Apr 17, 2019
@@ -305,7 +305,7 @@ Conversion
^^^^^^^^^^

- Bug in :func:`DataFrame.astype()` when passing a dict of columns and types the `errors` parameter was ignored. (:issue:`25905`)
-
- Bug in constructing :class:`DataFrame` from dict, where the index would be sorted instead of using the dict insertion order. (:issue:`24859`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be preferable if you could reference the method here instead of just the class

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would I say DataFrame.__init__(), in that case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:meth:DataFrame.from_dict

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "root" cause of the bug is in DataFrame.__init__({ some dictionary }), though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can also say DataFrame construction then

@bchu
Copy link
Author

bchu commented Apr 18, 2019

PTAL

@@ -301,7 +301,7 @@ def extract_index(data):
' an index')

if have_series or have_dicts:
index = _union_indexes(indexes)
index = _union_indexes(indexes, sort=not PY36)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs a comment on why here

@@ -305,7 +305,7 @@ Conversion
^^^^^^^^^^

- Bug in :func:`DataFrame.astype()` when passing a dict of columns and types the `errors` parameter was ignored. (:issue:`25905`)
-
- Bug in :class:`DataFrame` construction from dict, where the index would be sorted instead of using dict insertion order. (:issue:`24859`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls make clear of how this affects 3.5 and 3.6+

let's move this to the other api changes section as its not a bug rather an api change.

@WillAyd
Copy link
Member

WillAyd commented Apr 18, 2019

The more I am looking at this the less I am thinking we can actually support this. We had a similar discussion here:

#25915 (comment)

The original issue being linked is pretty explicit about orient='index' which I think could work in and of itself since index items would always be the keys in the top level dict, but this PR is starting to stray further down into nested dicts which I don't think is generalizable.

For instance, what is the expectation if this occurs:

data = {
    'A': {'bar': 0, 'foo': 1},
    'B': {'foo': 0, 'bar': 1}
}
pd.DataFrame(data)

Even this one modified in the PR gets a little suspect:

df.agg({'A' : ['max', 'min'], 'B' : ['min', 'sum']})

@bchu
Copy link
Author

bchu commented Apr 18, 2019

I thought this was a bug because the API was changed to follow insertion order in #25915 (https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.23.0.html#instantiation-from-dicts-preserves-dict-insertion-order-for-python-3-6).

I agree the example of mismatched dictionary indexes is not clear and the choice will either be arbitrary (order of first dict, which is what the behavior would be under this PR), or will be a special case (fall back to sorting?)

But one case where following insertion order is useful is for round-tripping the conversion of a DataFrame to/from a dict. E.g. df.equals(pd.DataFrame(df.to_dict())), and the same for round-tripping to/from json (to_json and read_json).

@WillAyd
Copy link
Member

WillAyd commented Apr 18, 2019

Yea so the distinction I was trying to make before was that we can make guarantees about ordering from the top most level of the dict, but making guarantees about nested dicts probably isn't feasible.

But one case where following insertion order is useful is for round-tripping the conversion of a DataFrame to/from a dict. E.g. df.equals(pd.DataFrame(df.to_dict())), and the same for round-tripping to/from json (to_json and read_json).

I think the former here holds True so long as you are only dealing with one top level dictionary. Standard JSON objects by definition have no order so there's no way to explicitly support the latter (though it may be implicitly or through alternate specifications like the table schema)

@bchu
Copy link
Author

bchu commented Apr 19, 2019

Making guarantees about nested dicts would still be needed for round-tripping. If the top-level of df.to_dict() were the DataFrame's index, then the nested level would be the columns, and so the columns of pd.DataFrame(df.to_dict()) will still end up in a different order than the original dataframe df.

@jreback
Copy link
Contributor

jreback commented Apr 20, 2019

Making guarantees about nested dicts would still be needed for round-tripping. If the top-level of df.to_dict() were the DataFrame's index, then the nested level would be the columns, and so the columns of pd.DataFrame(df.to_dict()) will still end up in a different order than the original dataframe df.

round-tripping with unordered things is not guaranteed at all, nor are why trying to 'fix' it. If you need a guarantee then you need to use lists which guarantee ordering. It is an implementation detail that python happens to guarantee insertion order starting in 3.6, this is certainly not true for anything JSON.

@bchu
Copy link
Author

bchu commented Apr 20, 2019

Insertion order is publicly guaranteed for dicts starting with Python 3.7

@WillAyd
Copy link
Member

WillAyd commented May 3, 2019

Thanks for the PR @bchu ! I think based on discussion above though that there isn't a huge appetite for this one at the moment and going to close as is.

Certainly welcome contributions on other outstanding issues if you have any you'd like to tackle

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DataFrame DataFrame data structure
Projects
None yet
Development

Successfully merging this pull request may close these issues.

from_dict(..., orient='index') row order preservation inconsistent
3 participants