BUG: Preserve index order when constructing DataFrame from dict #26113

bchu · 2019-04-16T23:05:15Z

closes from_dict(..., orient='index') row order preservation inconsistent #24859
tests added / passed
(no new tests added)
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

This resolves the issue of pd.DataFrame(dict) or pd.DataFrame.from_dict(dict) not preserving the order of the index specified in dict.

This required fixing up tests that relied on alphabetical ordering. The changes to these tests also test for this new preserved ordering; let me know if I should add an additional explicit test.

codecov · 2019-04-17T02:52:43Z

Codecov Report

Merging #26113 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #26113      +/-   ##
==========================================
- Coverage   91.99%   91.98%   -0.01%     
==========================================
  Files         175      175              
  Lines       52384    52385       +1     
==========================================
- Hits        48189    48186       -3     
- Misses       4195     4199       +4

Flag	Coverage Δ
#multiple	`90.53% <100%> (ø)`	⬆️
#single	`40.73% <100%> (-0.14%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/indexes/api.py	`99.01% <100%> (ø)`	⬆️
pandas/core/internals/construction.py	`95.88% <100%> (ø)`	⬆️
pandas/io/gbq.py	`75% <0%> (-12.5%)`	⬇️
pandas/core/frame.py	`96.9% <0%> (-0.12%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 68b1da7...f9d82eb. Read the comment docs.

codecov · 2019-04-17T02:52:45Z

Codecov Report

Merging #26113 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #26113      +/-   ##
==========================================
- Coverage   91.99%   91.98%   -0.01%     
==========================================
  Files         175      175              
  Lines       52384    52385       +1     
==========================================
- Hits        48189    48186       -3     
- Misses       4195     4199       +4

Flag	Coverage Δ
#multiple	`90.53% <100%> (ø)`	⬆️
#single	`40.73% <100%> (-0.14%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/frame.py	`96.9% <ø> (-0.12%)`	⬇️
pandas/core/indexes/api.py	`99.01% <100%> (ø)`	⬆️
pandas/core/internals/construction.py	`95.88% <100%> (ø)`	⬆️
pandas/io/gbq.py	`75% <0%> (-12.5%)`	⬇️
pandas/core/tools/datetimes.py	`84.59% <0%> (ø)`	⬆️
pandas/core/window.py	`96.39% <0%> (ø)`	⬆️
pandas/core/groupby/generic.py	`89.02% <0%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 68b1da7...adff27b. Read the comment docs.

WillAyd · 2019-04-17T15:53:43Z

pandas/core/indexes/api.py

@@ -146,7 +146,8 @@ def _union_indexes(indexes, sort=True):
    if len(indexes) == 1:
        result = indexes[0]
        if isinstance(result, list):
-            result = Index(sorted(result))
+            result = sorted(result) if sort else result


Can this branch be removed altogether or does that break 35 compatibility?

Yeah, it breaks python 3.5 compat.

Why not just move the compat check from the construction module to here then? Would be simpler

It would be a little more complicated here since we would have to do another compat check in the len(indexes) > 1 case

WillAyd · 2019-04-17T15:55:01Z

doc/source/whatsnew/v0.25.0.rst

@@ -305,7 +305,7 @@ Conversion
 ^^^^^^^^^^

 - Bug in :func:`DataFrame.astype()` when passing a dict of columns and types the `errors` parameter was ignored. (:issue:`25905`)
-
+- Bug in constructing :class:`DataFrame` from dict, where the index would be sorted instead of using the dict insertion order. (:issue:`24859`)


Would be preferable if you could reference the method here instead of just the class

Would I say DataFrame.__init__(), in that case?

:meth:DataFrame.from_dict

The "root" cause of the bug is in DataFrame.__init__({ some dictionary }), though.

Can also say DataFrame construction then

bchu · 2019-04-18T20:26:04Z

PTAL

jreback · 2019-04-18T22:08:19Z

pandas/core/internals/construction.py

@@ -301,7 +301,7 @@ def extract_index(data):
                             ' an index')

        if have_series or have_dicts:
-            index = _union_indexes(indexes)
+            index = _union_indexes(indexes, sort=not PY36)


this needs a comment on why here

jreback · 2019-04-18T22:08:58Z

doc/source/whatsnew/v0.25.0.rst

@@ -305,7 +305,7 @@ Conversion
 ^^^^^^^^^^

 - Bug in :func:`DataFrame.astype()` when passing a dict of columns and types the `errors` parameter was ignored. (:issue:`25905`)
-
+- Bug in :class:`DataFrame` construction from dict, where the index would be sorted instead of using dict insertion order. (:issue:`24859`)


pls make clear of how this affects 3.5 and 3.6+

let's move this to the other api changes section as its not a bug rather an api change.

WillAyd · 2019-04-18T22:10:10Z

The more I am looking at this the less I am thinking we can actually support this. We had a similar discussion here:

#25915 (comment)

The original issue being linked is pretty explicit about orient='index' which I think could work in and of itself since index items would always be the keys in the top level dict, but this PR is starting to stray further down into nested dicts which I don't think is generalizable.

For instance, what is the expectation if this occurs:

data = {
    'A': {'bar': 0, 'foo': 1},
    'B': {'foo': 0, 'bar': 1}
}
pd.DataFrame(data)

Even this one modified in the PR gets a little suspect:

df.agg({'A' : ['max', 'min'], 'B' : ['min', 'sum']})

bchu · 2019-04-18T22:50:52Z

I thought this was a bug because the API was changed to follow insertion order in #25915 (https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.23.0.html#instantiation-from-dicts-preserves-dict-insertion-order-for-python-3-6).

I agree the example of mismatched dictionary indexes is not clear and the choice will either be arbitrary (order of first dict, which is what the behavior would be under this PR), or will be a special case (fall back to sorting?)

But one case where following insertion order is useful is for round-tripping the conversion of a DataFrame to/from a dict. E.g. df.equals(pd.DataFrame(df.to_dict())), and the same for round-tripping to/from json (to_json and read_json).

WillAyd · 2019-04-18T22:58:22Z

Yea so the distinction I was trying to make before was that we can make guarantees about ordering from the top most level of the dict, but making guarantees about nested dicts probably isn't feasible.

But one case where following insertion order is useful is for round-tripping the conversion of a DataFrame to/from a dict. E.g. df.equals(pd.DataFrame(df.to_dict())), and the same for round-tripping to/from json (to_json and read_json).

I think the former here holds True so long as you are only dealing with one top level dictionary. Standard JSON objects by definition have no order so there's no way to explicitly support the latter (though it may be implicitly or through alternate specifications like the table schema)

bchu · 2019-04-19T23:32:09Z

Making guarantees about nested dicts would still be needed for round-tripping. If the top-level of df.to_dict() were the DataFrame's index, then the nested level would be the columns, and so the columns of pd.DataFrame(df.to_dict()) will still end up in a different order than the original dataframe df.

jreback · 2019-04-20T15:58:22Z

Making guarantees about nested dicts would still be needed for round-tripping. If the top-level of df.to_dict() were the DataFrame's index, then the nested level would be the columns, and so the columns of pd.DataFrame(df.to_dict()) will still end up in a different order than the original dataframe df.

round-tripping with unordered things is not guaranteed at all, nor are why trying to 'fix' it. If you need a guarantee then you need to use lists which guarantee ordering. It is an implementation detail that python happens to guarantee insertion order starting in 3.6, this is certainly not true for anything JSON.

bchu · 2019-04-20T23:42:18Z

Insertion order is publicly guaranteed for dicts starting with Python 3.7

WillAyd · 2019-05-03T05:26:34Z

Thanks for the PR @bchu ! I think based on discussion above though that there isn't a huge appetite for this one at the moment and going to close as is.

Certainly welcome contributions on other outstanding issues if you have any you'd like to tackle

bchu added 11 commits April 12, 2019 19:24

No sorting

81d9214

Fix single index case

cb57553

Fix tests

b0e46eb

Fix alter axes test

a972fb7

Fix test

91b9a38

Fix test

d419ee4

Lint

b54592f

Whitespace

0aed1ca

Whats new

3201e19

Backwards compat

bbbeec7

Fix doctest

f9d82eb

WillAyd reviewed Apr 17, 2019

View reviewed changes

WillAyd added the DataFrame DataFrame data structure label Apr 17, 2019

WillAyd requested changes Apr 17, 2019

View reviewed changes

Tweak whats new

adff27b

jreback requested changes Apr 18, 2019

View reviewed changes

WillAyd closed this May 3, 2019

jorisvandenbossche mentioned this pull request Jul 10, 2019

ENH: Preserve key order when passing list of dicts to DataFrame on py 3.6+ #27309

Merged

4 tasks

Uh oh!

BUG: Preserve index order when constructing DataFrame from dict #26113

BUG: Preserve index order when constructing DataFrame from dict #26113

Uh oh!

Conversation

bchu commented Apr 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Apr 17, 2019

Codecov Report

Uh oh!

codecov bot commented Apr 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bchu commented Apr 18, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WillAyd commented Apr 18, 2019

Uh oh!

bchu commented Apr 18, 2019

Uh oh!

WillAyd commented Apr 18, 2019

Uh oh!

bchu commented Apr 19, 2019

Uh oh!

jreback commented Apr 20, 2019

Uh oh!

bchu commented Apr 20, 2019

Uh oh!

WillAyd commented May 3, 2019

Uh oh!

Uh oh!

bchu commented Apr 16, 2019 •

edited

Loading

codecov bot commented Apr 17, 2019 •

edited

Loading