-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: MutliIndex variable length tuples #14823
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/indexes/multi.py
Outdated
@@ -5,6 +5,11 @@ | |||
from functools import partial | |||
from sys import getsizeof | |||
|
|||
try: | |||
from itertools import zip_longest | |||
except ImportError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be in pandas.compat (might be there already actually) so put / import from there
pandas/tests/indexes/test_multi.py
Outdated
@@ -1601,6 +1601,19 @@ def test_from_tuples(self): | |||
idx = MultiIndex.from_tuples(((1, 2), (3, 4)), names=['a', 'b']) | |||
self.assertEqual(len(idx), 2) | |||
|
|||
def test_from_tuples_variable_length(self): | |||
# check that len(MultiIndex) == max(len(iterables)) | |||
T = ((1,), (2, 3), (4, 5, 6)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the issue number as a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually on 2nd thought i think this is an invalid index construction and should raise an error - we require fully balanaced tuples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that's the case then
min(map(len, tuples)) == max(map(len, tuples))
might be a good check.
Edit: Actually this might be a slightly more efficient check.
len_0 = tuples[0]
all(len_0 == x for x in map(len, tuples))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know from_arrays required balanced inputs (there's a check for it). It seems that from_tuples attempts to balance the input before passing it to from_arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes not sure how to do this efficiently, this could potentially be hit in a lot of places, so pls check perf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(for my own reference) a quick timing shows that
len_0 = len(tuples[0])
for i in tuples:
if len_0 != len(i):
break
is the fastest check so far. This is similar to the check that from_arrays does, but moves half of the len calls outside the loop.
pandas/tests/indexes/test_multi.py
Outdated
T = ((1,), (2, 3), (4, 5, 6)) | ||
|
||
idx = MultiIndex.from_tuples(T) | ||
self.assertEqual(len(idx), 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use an expected index and assert_index_equal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just followed the pattern from the other test case for MultiIndex.from_tuples. I'll change it though. Do you want me to update the other test case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
doc/source/whatsnew/v0.20.0.txt
Outdated
@@ -108,3 +108,5 @@ Performance Improvements | |||
|
|||
Bug Fixes | |||
~~~~~~~~~ | |||
|
|||
- ``MultiIndex.from_tuples`` correctly handles sequences of variable length tuples (:issue:`14794`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be in 0.19.2
this also could impact performance |
Are unbalanced indexes invalid? When passing in an unbalanced list, the holes get filled with NaN. My goal was trying to make the behavior consistent between lists and other containers. I'm fine with raising an error if unbalanced input is invalid. |
@groutr unbalanced are invalid, its not normal to do this and causes all kinds of indexing issues, they are barely supported, so a helpful error up front is useful. |
can you rebase / update |
Yeah, I can rebase and update to throw an error. I went through the test suite as well, and saw that a couple of tests construct these invalid multi-indexes. I will try to update those tests too. |
@groutr thanks! |
@jreback I was working on updating this PR the other day and before changing too much like I did in #14806, I wanted to get your ideas on a particular issue. The distinction in my mind between The Of course, we don't want to change an API willy-nilly. My thought is to have a new "private" |
@groutr actually
the check in |
Codecov Report
@@ Coverage Diff @@
## master #14823 +/- ##
==========================================
- Coverage 90.98% 85.28% -5.71%
==========================================
Files 161 144 -17
Lines 49288 50972 +1684
==========================================
- Hits 44846 43469 -1377
- Misses 4442 7503 +3061
Continue to review full report at Codecov.
|
Tests coming soon. |
pandas/tests/indexes/test_multi.py
Outdated
# check that len(MultiIndex) == max(len(iterables)) | ||
T = ((1,), (2, 3), (4, 5, 6)) | ||
|
||
idx = MultiIndex.from_tuples(T) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this raise?
We should be raising on variable length tuples. have the whatsnew reflect that. |
Factor out equal length check into separate method.
No need to import zip_longest anymore
b0060c5
to
927d439
Compare
pandas/indexes/multi.py
Outdated
@@ -983,6 +983,7 @@ def from_tuples(cls, tuples, sortorder=None, names=None): | |||
---------- | |||
tuples : list / sequence of tuple-likes | |||
Each tuple is the index of one row/column. | |||
A ValueError will be raised if all tuples are not the same length. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this a separate Raises
section after Returns
(for the raising conditions)
expected = MultiIndex.from_tuples([('a', 'b', 'c'), ('d', 'e', np.nan), | ||
('f', np.nan, np.nan)]) | ||
tm.assert_index_equal(idx.str.split(expand=True), expected) | ||
# This is invalid behavior |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the issue number here
pandas/tests/indexes/test_multi.py
Outdated
@@ -1640,6 +1640,21 @@ def test_from_tuples(self): | |||
idx = MultiIndex.from_tuples(((1, 2), (3, 4)), names=['a', 'b']) | |||
self.assertEqual(len(idx), 2) | |||
|
|||
def test_equal_length(self): | |||
# Test _check_equal_length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue number here
pandas/tests/test_strings.py
Outdated
'is', 'not')]) | ||
tm.assert_index_equal(result, exp) | ||
self.assertEqual(result.nlevels, 6) | ||
with self.assertRaises(ValueError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this changing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test now constructs an invalid multi-index. What should I be doing instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is this invalid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test constructs a multi-index with unbalanced tuples which now throws a ValueError. The split strings have different lengths and because expand=True, a MultiIndex is constructed; see a few lines below. Therefore, I modified the test check for a ValueError.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, then this is actually a bug in that method then. So on construction it needs to make sure to send in balanced tuples.
pandas/indexes/multi.py
Outdated
|
||
Return True if all sequences are the same length, otherwise False | ||
If seq_of_seqs is empty return True as well. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do a
if not is_list_like(seq_of_seqs):
return True
at the top, this will prevent things like: strings & Timestamps from pass thru here (though they shouldn't be passed in the first place, its possible)
pandas/indexes/multi.py
Outdated
@@ -1007,6 +1008,9 @@ def from_tuples(cls, tuples, sortorder=None, names=None): | |||
# I think this is right? Not quite sure... | |||
raise TypeError('Cannot infer number of levels from empty list') | |||
|
|||
if not _check_equal_length(tuples): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would definitely consider doing this check in Cython at the same time as constructing the arrays in the lib.tuples_to_object_array
and lib.to_object_array_tuples
functions. Otherwise I would expect a serious impact on performance in the typical case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shoyer makes a good point. checking cost is about the same as the actual construction (in this trivial example)
In [1]: tuples = [(1,2)]*1000000
In [2]: len(tuples[0])
Out[2]: 2
In [3]: %timeit any([len(x) != 2 for x in tuples])
10 loops, best of 3: 88.5 ms per loop
In [5]: %timeit pd.MultiIndex.from_tuples(tuples)
10 loops, best of 3: 108 ms per loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with implementing the method in Cython. Would it make sense to move _check_equal_length
to lib.pyx?
EDIT: NM, I think I understand what you mean with checking during construction. Would _check_equal_length be a useful function elsewhere, or is it specific to these situations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For performance, the check probably needs to be integrated into each of the existing Cython functions separately, avoiding a second loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to_object_array_tuples
seems to deliberately find the length of the longest inner sequence to pad everything to. It is used indirectly by DataFrame. If to_object_array_tuples
were to throw an error on unbalanced input, would that affect DataFrame's behavior of padding things with NaN in some cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest grepping for uses of to_object_array_tuples
to understand the impact
can you rebase / update |
closing as stale. would like to have this fixed though :> |
try something like this
|
# Conflicts: # pandas/core/indexes/multi.py # pandas/tests/test_strings.py
Hello @groutr! Thanks for updating the PR.
|
closing as stale, but would still like the fix. ping if you want to reopen. |
git diff upstream/master | flake8 --diff