-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: Preserve index order when constructing DataFrame from dict #26113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #26113 +/- ##
==========================================
- Coverage 91.99% 91.98% -0.01%
==========================================
Files 175 175
Lines 52384 52385 +1
==========================================
- Hits 48189 48186 -3
- Misses 4195 4199 +4
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #26113 +/- ##
==========================================
- Coverage 91.99% 91.98% -0.01%
==========================================
Files 175 175
Lines 52384 52385 +1
==========================================
- Hits 48189 48186 -3
- Misses 4195 4199 +4
Continue to review full report at Codecov.
|
@@ -146,7 +146,8 @@ def _union_indexes(indexes, sort=True): | |||
if len(indexes) == 1: | |||
result = indexes[0] | |||
if isinstance(result, list): | |||
result = Index(sorted(result)) | |||
result = sorted(result) if sort else result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this branch be removed altogether or does that break 35 compatibility?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it breaks python 3.5 compat.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just move the compat check from the construction module to here then? Would be simpler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be a little more complicated here since we would have to do another compat check in the len(indexes) > 1
case
doc/source/whatsnew/v0.25.0.rst
Outdated
@@ -305,7 +305,7 @@ Conversion | |||
^^^^^^^^^^ | |||
|
|||
- Bug in :func:`DataFrame.astype()` when passing a dict of columns and types the `errors` parameter was ignored. (:issue:`25905`) | |||
- | |||
- Bug in constructing :class:`DataFrame` from dict, where the index would be sorted instead of using the dict insertion order. (:issue:`24859`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be preferable if you could reference the method here instead of just the class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would I say DataFrame.__init__()
, in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:meth:DataFrame.from_dict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "root" cause of the bug is in DataFrame.__init__({ some dictionary })
, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can also say DataFrame construction then
PTAL |
@@ -301,7 +301,7 @@ def extract_index(data): | |||
' an index') | |||
|
|||
if have_series or have_dicts: | |||
index = _union_indexes(indexes) | |||
index = _union_indexes(indexes, sort=not PY36) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs a comment on why here
@@ -305,7 +305,7 @@ Conversion | |||
^^^^^^^^^^ | |||
|
|||
- Bug in :func:`DataFrame.astype()` when passing a dict of columns and types the `errors` parameter was ignored. (:issue:`25905`) | |||
- | |||
- Bug in :class:`DataFrame` construction from dict, where the index would be sorted instead of using dict insertion order. (:issue:`24859`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls make clear of how this affects 3.5 and 3.6+
let's move this to the other api changes section as its not a bug rather an api change.
The more I am looking at this the less I am thinking we can actually support this. We had a similar discussion here: The original issue being linked is pretty explicit about For instance, what is the expectation if this occurs: data = {
'A': {'bar': 0, 'foo': 1},
'B': {'foo': 0, 'bar': 1}
}
pd.DataFrame(data) Even this one modified in the PR gets a little suspect: df.agg({'A' : ['max', 'min'], 'B' : ['min', 'sum']}) |
I thought this was a bug because the API was changed to follow insertion order in #25915 (https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.23.0.html#instantiation-from-dicts-preserves-dict-insertion-order-for-python-3-6). I agree the example of mismatched dictionary indexes is not clear and the choice will either be arbitrary (order of first dict, which is what the behavior would be under this PR), or will be a special case (fall back to sorting?) But one case where following insertion order is useful is for round-tripping the conversion of a DataFrame to/from a dict. E.g. |
Yea so the distinction I was trying to make before was that we can make guarantees about ordering from the top most level of the dict, but making guarantees about nested dicts probably isn't feasible.
I think the former here holds True so long as you are only dealing with one top level dictionary. Standard JSON objects by definition have no order so there's no way to explicitly support the latter (though it may be implicitly or through alternate specifications like the table schema) |
Making guarantees about nested dicts would still be needed for round-tripping. If the top-level of |
round-tripping with unordered things is not guaranteed at all, nor are why trying to 'fix' it. If you need a guarantee then you need to use lists which guarantee ordering. It is an implementation detail that python happens to guarantee insertion order starting in 3.6, this is certainly not true for anything JSON. |
Insertion order is publicly guaranteed for dicts starting with Python 3.7 |
Thanks for the PR @bchu ! I think based on discussion above though that there isn't a huge appetite for this one at the moment and going to close as is. Certainly welcome contributions on other outstanding issues if you have any you'd like to tackle |
(no new tests added)
git diff upstream/master -u -- "*.py" | flake8 --diff
This resolves the issue of
pd.DataFrame(dict)
orpd.DataFrame.from_dict(dict)
not preserving the order of the index specified indict
.This required fixing up tests that relied on alphabetical ordering. The changes to these tests also test for this new preserved ordering; let me know if I should add an additional explicit test.