BUG: MutliIndex variable length tuples #14823

groutr · 2016-12-08T05:15:34Z

closes Incorrect sequence handling with MultiIndex.from_tuples #14794
tests added / passed
passes git diff upstream/master | flake8 --diff
whatsnew entry

jreback · 2016-12-08T11:52:46Z

pandas/indexes/multi.py

@@ -5,6 +5,11 @@
 from functools import partial
 from sys import getsizeof

+try:
+    from itertools import zip_longest
+except ImportError:


this should be in pandas.compat (might be there already actually) so put / import from there

jreback · 2016-12-08T11:53:07Z

pandas/tests/indexes/test_multi.py

@@ -1601,6 +1601,19 @@ def test_from_tuples(self):
        idx = MultiIndex.from_tuples(((1, 2), (3, 4)), names=['a', 'b'])
        self.assertEqual(len(idx), 2)

+    def test_from_tuples_variable_length(self):
+        # check that len(MultiIndex) == max(len(iterables))
+        T = ((1,), (2, 3), (4, 5, 6))


add the issue number as a comment

actually on 2nd thought i think this is an invalid index construction and should raise an error - we require fully balanaced tuples

If that's the case then

min(map(len, tuples)) == max(map(len, tuples))

might be a good check.

Edit: Actually this might be a slightly more efficient check.

len_0 = tuples[0] all(len_0 == x for x in map(len, tuples))

I know from_arrays required balanced inputs (there's a check for it). It seems that from_tuples attempts to balance the input before passing it to from_arrays.

yes not sure how to do this efficiently, this could potentially be hit in a lot of places, so pls check perf.

(for my own reference) a quick timing shows that

len_0 = len(tuples[0]) for i in tuples: if len_0 != len(i): break

is the fastest check so far. This is similar to the check that from_arrays does, but moves half of the len calls outside the loop.

jreback · 2016-12-08T11:53:33Z

pandas/tests/indexes/test_multi.py

+        T = ((1,), (2, 3), (4, 5, 6))
+
+        idx = MultiIndex.from_tuples(T)
+        self.assertEqual(len(idx), 3)


use an expected index and assert_index_equal

I just followed the pattern from the other test case for MultiIndex.from_tuples. I'll change it though. Do you want me to update the other test case?

jreback · 2016-12-08T11:53:55Z

doc/source/whatsnew/v0.20.0.txt

@@ -108,3 +108,5 @@ Performance Improvements

 Bug Fixes
 ~~~~~~~~~
+
+- ``MultiIndex.from_tuples`` correctly handles sequences of variable length tuples (:issue:`14794`)


this can be in 0.19.2

jreback · 2016-12-08T11:58:40Z

this also could impact performance
so i am inclined to raise an error (may need to modify some cython to detect though)

groutr · 2016-12-09T02:49:30Z

Are unbalanced indexes invalid? When passing in an unbalanced list, the holes get filled with NaN. My goal was trying to make the behavior consistent between lists and other containers. I'm fine with raising an error if unbalanced input is invalid.

jreback · 2016-12-10T15:02:08Z

@groutr unbalanced are invalid, its not normal to do this and causes all kinds of indexing issues, they are barely supported, so a helpful error up front is useful.

jreback · 2016-12-30T21:24:57Z

can you rebase / update

groutr · 2017-01-19T19:59:36Z

Yeah, I can rebase and update to throw an error.

I went through the test suite as well, and saw that a couple of tests construct these invalid multi-indexes. I will try to update those tests too.

jreback · 2017-01-19T20:06:58Z

@groutr thanks!

groutr · 2017-02-01T19:17:56Z

@jreback I was working on updating this PR the other day and before changing too much like I did in #14806, I wanted to get your ideas on a particular issue.

The distinction in my mind between from_tuples and from_arrays is vanishing very quickly. There seems to be very little distinction between from_arrays and from_tuples almost to the point where I think they could literally be the same function. The only difference before seemed that from_tuples didn't enforce a uniform length requirement whereas from_arrays does. I can't see the justification for casting the tuples to numpy object arrays in from_tuples. Is that for performance reasons? from_arrays seems to only require iterable containers of equal length.

The from_tuples and from_arrays functions seem to be two special cases of a more general from_sequences idea. The issue I'm running into is from_tuples calls from_arrays, but for ideal input, a relatively expensive check would be performed twice, once in from_tuples and then again in from_arrays. from_tuples doesn't seem to be performing any major meaningful work on top of from_arrays, or am I missing something here?

Of course, we don't want to change an API willy-nilly. My thought is to have a new "private" _from_even_sequences (or similar named) function that combines the work of from_tuples and from_arrays. from_tuples and from_arrays would simply have their first len() check then call _from_even_sequences to do the remainder of the work in from_arrays for generating the MultiIndex. The API would be intact, but then performance of _from_even_sequences could be addressed more directly.

jreback · 2017-02-01T19:21:22Z

@groutr actually .from_tuples is quite useful, its exactly what it says, take a list-of-tuples and create a MI. while .from_arrays is almost lower level, take a list of arrays that are already essentially the levels (dtyped) and create a MI, its is much more efficient.

.from_tuples is a high-level interface, so of course it should be implemented in as .from_arrays.

the check in .from_arrays is pretty cheap. .from_tuples needs to do a check (as seen in the issue).

codecov-io · 2017-02-02T03:07:27Z

Codecov Report

Merging #14823 into master will decrease coverage by 5.7%.
The diff coverage is 60%.

@@            Coverage Diff             @@
##           master   #14823      +/-   ##
==========================================
- Coverage   90.98%   85.28%   -5.71%     
==========================================
  Files         161      144      -17     
  Lines       49288    50972    +1684     
==========================================
- Hits        44846    43469    -1377     
- Misses       4442     7503    +3061

Flag	Coverage Δ
#multiple	`?`
#single	`?`

Impacted Files	Coverage Δ
pandas/indexes/multi.py	`96.28% <60%> (ø)`
pandas/io/sas/sas_constants.py	`0% <0%> (-100%)`	⬇️
pandas/io/sas/sas7bdat.py	`0% <0%> (-90.67%)`	⬇️
pandas/io/sas/sas_xport.py	`0% <0%> (-90.28%)`	⬇️
pandas/io/sas/sasreader.py	`4.34% <0%> (-81.86%)`	⬇️
pandas/io/gbq.py	`17.77% <0%> (-65.57%)`	⬇️
pandas/io/html.py	`64.35% <0%> (-20.51%)`	⬇️
pandas/util/testing.py	`82.5% <0%> (-17.5%)`	⬇️
pandas/tools/plotting.py	`68.8% <0%> (-13.01%)`	⬇️
pandas/io/pickle.py	`74.28% <0%> (-6.97%)`	⬇️
... and 220 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3955261...6cf5286. Read the comment docs.

groutr · 2017-02-02T03:33:00Z

Tests coming soon.

jreback · 2017-02-02T13:46:21Z

pandas/tests/indexes/test_multi.py

+        # check that len(MultiIndex) == max(len(iterables))
+        T = ((1,), (2, 3), (4, 5, 6))
+
+        idx = MultiIndex.from_tuples(T)


shouldn't this raise?

jreback · 2017-02-02T13:47:48Z

We should be raising on variable length tuples. have the whatsnew reflect that.

Factor out equal length check into separate method.

No need to import zip_longest anymore

jreback · 2017-02-10T14:24:48Z

pandas/indexes/multi.py

@@ -983,6 +983,7 @@ def from_tuples(cls, tuples, sortorder=None, names=None):
        ----------
        tuples : list / sequence of tuple-likes
            Each tuple is the index of one row/column.
+            A ValueError will be raised if all tuples are not the same length.


make this a separate Raises section after Returns (for the raising conditions)

jreback · 2017-02-10T14:25:11Z

pandas/tests/indexes/test_base.py

-        expected = MultiIndex.from_tuples([('a', 'b', 'c'), ('d', 'e', np.nan),
-                                           ('f', np.nan, np.nan)])
-        tm.assert_index_equal(idx.str.split(expand=True), expected)
+        # This is invalid behavior


add the issue number here

jreback · 2017-02-10T14:25:19Z

pandas/tests/indexes/test_multi.py

@@ -1640,6 +1640,21 @@ def test_from_tuples(self):
        idx = MultiIndex.from_tuples(((1, 2), (3, 4)), names=['a', 'b'])
        self.assertEqual(len(idx), 2)

+    def test_equal_length(self):
+        # Test _check_equal_length


issue number here

jreback · 2017-02-10T14:25:46Z

pandas/tests/test_strings.py

-                                           'is', 'not')])
-        tm.assert_index_equal(result, exp)
-        self.assertEqual(result.nlevels, 6)
+        with self.assertRaises(ValueError):


why is this changing?

This test now constructs an invalid multi-index. What should I be doing instead?

how is this invalid?

The test constructs a multi-index with unbalanced tuples which now throws a ValueError. The split strings have different lengths and because expand=True, a MultiIndex is constructed; see a few lines below. Therefore, I modified the test check for a ValueError.

hmm, then this is actually a bug in that method then. So on construction it needs to make sure to send in balanced tuples.

jreback · 2017-02-10T14:28:26Z

pandas/indexes/multi.py

+
+    Return True if all sequences are the same length, otherwise False
+    If seq_of_seqs is empty return True as well.
+    """


do a

if not is_list_like(seq_of_seqs): return True

at the top, this will prevent things like: strings & Timestamps from pass thru here (though they shouldn't be passed in the first place, its possible)

shoyer · 2017-02-15T15:28:17Z

pandas/indexes/multi.py

@@ -1007,6 +1008,9 @@ def from_tuples(cls, tuples, sortorder=None, names=None):
            # I think this is right? Not quite sure...
            raise TypeError('Cannot infer number of levels from empty list')

+        if not _check_equal_length(tuples):


I would definitely consider doing this check in Cython at the same time as constructing the arrays in the lib.tuples_to_object_array and lib.to_object_array_tuples functions. Otherwise I would expect a serious impact on performance in the typical case.

@shoyer makes a good point. checking cost is about the same as the actual construction (in this trivial example)

In [1]: tuples = [(1,2)]*1000000 In [2]: len(tuples[0]) Out[2]: 2 In [3]: %timeit any([len(x) != 2 for x in tuples]) 10 loops, best of 3: 88.5 ms per loop In [5]: %timeit pd.MultiIndex.from_tuples(tuples) 10 loops, best of 3: 108 ms per loop

I'm fine with implementing the method in Cython. Would it make sense to move _check_equal_length to lib.pyx?

EDIT: NM, I think I understand what you mean with checking during construction. Would _check_equal_length be a useful function elsewhere, or is it specific to these situations?

For performance, the check probably needs to be integrated into each of the existing Cython functions separately, avoiding a second loop.

to_object_array_tuples seems to deliberately find the length of the longest inner sequence to pad everything to. It is used indirectly by DataFrame. If to_object_array_tuples were to throw an error on unbalanced input, would that affect DataFrame's behavior of padding things with NaN in some cases?

I would suggest grepping for uses of to_object_array_tuples to understand the impact

jreback · 2017-03-28T18:02:07Z

can you rebase / update

jreback · 2017-05-07T14:04:44Z

closing as stale. would like to have this fixed though :>

jreback · 2017-07-14T20:49:06Z

try something like this

diff --git a/pandas/core/frame.py b/pandas/core/frame.py
index 6559fc4..a64bf61 100644
--- a/pandas/core/frame.py
+++ b/pandas/core/frame.py
@@ -5828,7 +5828,7 @@ def _reorder_arrays(arrays, arr_columns, columns):
 
 def _list_to_arrays(data, columns, coerce_float=False, dtype=None):
     if len(data) > 0 and isinstance(data[0], tuple):
-        content = list(lib.to_object_array_tuples(data).T)
+        content = list(lib.tuples_to_object_array(np.array(data, dtype=object)).T)
     else:
         # list of lists
         content = list(lib.to_object_array(data).T)
diff --git a/pandas/core/indexes/multi.py b/pandas/core/indexes/multi.py
index 81eac0a..4b4a820 100644
--- a/pandas/core/indexes/multi.py
+++ b/pandas/core/indexes/multi.py
@@ -1128,20 +1128,21 @@ class MultiIndex(Index):
         MultiIndex.from_product : Make a MultiIndex from cartesian product
                                   of iterables
         """
+        if not is_list_like(tuples):
+            raise
+
         if len(tuples) == 0:
             if names is None:
                 msg = 'Cannot infer number of levels from empty list'
                 raise TypeError(msg)
             arrays = [[]] * len(names)
-        elif isinstance(tuples, (np.ndarray, Index)):
+
+        if isinstance(tuples, (list, tuple)):
+            arrays = list(lib.to_object_array_tuples(list(tuples)).T)
+        else:
             if isinstance(tuples, Index):
                 tuples = tuples._values
-
             arrays = list(lib.tuples_to_object_array(tuples).T)
-        elif isinstance(tuples, list):
-            arrays = list(lib.to_object_array_tuples(tuples).T)
-        else:
-            arrays = lzip(*tuples)
 
         return MultiIndex.from_arrays(arrays, sortorder=sortorder, names=names)

# Conflicts: # pandas/core/indexes/multi.py # pandas/tests/test_strings.py

pep8speaks · 2017-07-15T20:10:56Z

Hello @groutr! Thanks for updating the PR.

In the file pandas/tests/indexes/test_base.py, following are the PEP8 issues :

Line 1463:80: E501 line too long (83 > 79 characters)
Line 1464:45: E128 continuation line under-indented for visual indent

In the file pandas/tests/test_strings.py, following are the PEP8 issues :

Line 2031:80: E501 line too long (82 > 79 characters)
Line 2032:41: E124 closing bracket does not match visual indentation

jreback · 2017-09-07T00:42:45Z

closing as stale, but would still like the fix. ping if you want to reopen.

jreback requested changes Dec 8, 2016

View reviewed changes

jreback added Error Reporting Incorrect or improved errors from pandas MultiIndex labels Dec 10, 2016

jreback reviewed Feb 2, 2017

View reviewed changes

groutr added 8 commits February 9, 2017 22:38

add tests for variable length sequences.

41aa759

use zip_longest to avoid truncating tuples.

14da53c

Check that tuples are all same length.

0c118bc

Factor out equal length check into separate method.

remove extra )

0c2ca79

No need to import zip_longest anymore

use iterators rather than indexing

6593c26

catch ValueErrors in now invalid test cases.

d1d26ff

test _check_equal_length

62cbc27

If we have an empty seq, then we should return True.

927d439

groutr force-pushed the multi-index-zip branch from b0060c5 to 927d439 Compare February 10, 2017 04:39

jreback reviewed Feb 10, 2017

View reviewed changes

Merge branch 'master' into multi-index-zip

75fe081

shoyer reviewed Feb 15, 2017

View reviewed changes

jreback closed this May 7, 2017

jreback reopened this Jul 14, 2017

Merge branch 'master' into multi-index-zip

6cf5286

# Conflicts: # pandas/core/indexes/multi.py # pandas/tests/test_strings.py

jreback mentioned this pull request Jul 17, 2017

BUG: MultiIndex.set_levels with emtpy levels #16987

Closed

3 tasks

jreback mentioned this pull request Aug 6, 2017

MultiIndex.from_tuples returns an Index when all tuples are length 1 #17178

Closed

jreback closed this Sep 7, 2017

Uh oh!

BUG: MutliIndex variable length tuples #14823

BUG: MutliIndex variable length tuples #14823

Uh oh!

Conversation

groutr commented Dec 8, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

groutr Dec 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

groutr Dec 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback commented Dec 8, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

groutr commented Dec 9, 2016

Uh oh!

jreback commented Dec 10, 2016

Uh oh!

jreback commented Dec 30, 2016

Uh oh!

groutr commented Jan 19, 2017

Uh oh!

jreback commented Jan 19, 2017

Uh oh!

groutr commented Feb 1, 2017

Uh oh!

jreback commented Feb 1, 2017

Uh oh!

codecov-io commented Feb 2, 2017 • edited by codecov bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

groutr commented Feb 2, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback commented Feb 2, 2017

Uh oh!

jreback Feb 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

groutr Dec 9, 2016 •

edited

Loading

groutr Dec 9, 2016 •

edited

Loading

jreback commented Dec 8, 2016 •

edited

Loading

codecov-io commented Feb 2, 2017 •

edited by codecov bot

Loading

jreback Feb 10, 2017 •

edited

Loading

groutr Feb 15, 2017 •

edited

Loading