Skip to content

BUG: MutliIndex variable length tuples #14823

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from

Conversation

groutr
Copy link
Contributor

@groutr groutr commented Dec 8, 2016

@@ -5,6 +5,11 @@
from functools import partial
from sys import getsizeof

try:
from itertools import zip_longest
except ImportError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be in pandas.compat (might be there already actually) so put / import from there

@@ -1601,6 +1601,19 @@ def test_from_tuples(self):
idx = MultiIndex.from_tuples(((1, 2), (3, 4)), names=['a', 'b'])
self.assertEqual(len(idx), 2)

def test_from_tuples_variable_length(self):
# check that len(MultiIndex) == max(len(iterables))
T = ((1,), (2, 3), (4, 5, 6))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issue number as a comment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually on 2nd thought i think this is an invalid index construction and should raise an error - we require fully balanaced tuples

Copy link
Contributor Author

@groutr groutr Dec 9, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's the case then

min(map(len, tuples)) == max(map(len, tuples))

might be a good check.

Edit: Actually this might be a slightly more efficient check.

len_0 = tuples[0]
all(len_0 == x for x in map(len, tuples))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know from_arrays required balanced inputs (there's a check for it). It seems that from_tuples attempts to balance the input before passing it to from_arrays.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes not sure how to do this efficiently, this could potentially be hit in a lot of places, so pls check perf.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(for my own reference) a quick timing shows that

len_0 = len(tuples[0])
for i in tuples:
    if len_0 != len(i):
        break

is the fastest check so far. This is similar to the check that from_arrays does, but moves half of the len calls outside the loop.

T = ((1,), (2, 3), (4, 5, 6))

idx = MultiIndex.from_tuples(T)
self.assertEqual(len(idx), 3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use an expected index and assert_index_equal

Copy link
Contributor Author

@groutr groutr Dec 9, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just followed the pattern from the other test case for MultiIndex.from_tuples. I'll change it though. Do you want me to update the other test case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@@ -108,3 +108,5 @@ Performance Improvements

Bug Fixes
~~~~~~~~~

- ``MultiIndex.from_tuples`` correctly handles sequences of variable length tuples (:issue:`14794`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be in 0.19.2

@jreback
Copy link
Contributor

jreback commented Dec 8, 2016

this also could impact performance
so i am inclined to raise an error (may need to modify some cython to detect though)

@groutr
Copy link
Contributor Author

groutr commented Dec 9, 2016

Are unbalanced indexes invalid? When passing in an unbalanced list, the holes get filled with NaN. My goal was trying to make the behavior consistent between lists and other containers. I'm fine with raising an error if unbalanced input is invalid.

@jreback jreback added Error Reporting Incorrect or improved errors from pandas MultiIndex labels Dec 10, 2016
@jreback
Copy link
Contributor

jreback commented Dec 10, 2016

@groutr unbalanced are invalid, its not normal to do this and causes all kinds of indexing issues, they are barely supported, so a helpful error up front is useful.

@jreback
Copy link
Contributor

jreback commented Dec 30, 2016

can you rebase / update

@groutr
Copy link
Contributor Author

groutr commented Jan 19, 2017

Yeah, I can rebase and update to throw an error.

I went through the test suite as well, and saw that a couple of tests construct these invalid multi-indexes. I will try to update those tests too.

@jreback
Copy link
Contributor

jreback commented Jan 19, 2017

@groutr thanks!

@groutr
Copy link
Contributor Author

groutr commented Feb 1, 2017

@jreback I was working on updating this PR the other day and before changing too much like I did in #14806, I wanted to get your ideas on a particular issue.

The distinction in my mind between from_tuples and from_arrays is vanishing very quickly. There seems to be very little distinction between from_arrays and from_tuples almost to the point where I think they could literally be the same function. The only difference before seemed that from_tuples didn't enforce a uniform length requirement whereas from_arrays does. I can't see the justification for casting the tuples to numpy object arrays in from_tuples. Is that for performance reasons? from_arrays seems to only require iterable containers of equal length.

The from_tuples and from_arrays functions seem to be two special cases of a more general from_sequences idea. The issue I'm running into is from_tuples calls from_arrays, but for ideal input, a relatively expensive check would be performed twice, once in from_tuples and then again in from_arrays. from_tuples doesn't seem to be performing any major meaningful work on top of from_arrays, or am I missing something here?

Of course, we don't want to change an API willy-nilly. My thought is to have a new "private" _from_even_sequences (or similar named) function that combines the work of from_tuples and from_arrays. from_tuples and from_arrays would simply have their first len() check then call _from_even_sequences to do the remainder of the work in from_arrays for generating the MultiIndex. The API would be intact, but then performance of _from_even_sequences could be addressed more directly.

@jreback
Copy link
Contributor

jreback commented Feb 1, 2017

@groutr actually .from_tuples is quite useful, its exactly what it says, take a list-of-tuples and create a MI. while .from_arrays is almost lower level, take a list of arrays that are already essentially the levels (dtyped) and create a MI, its is much more efficient.

.from_tuples is a high-level interface, so of course it should be implemented in as .from_arrays.

the check in .from_arrays is pretty cheap. .from_tuples needs to do a check (as seen in the issue).

@codecov-io
Copy link

codecov-io commented Feb 2, 2017

Codecov Report

Merging #14823 into master will decrease coverage by 5.7%.
The diff coverage is 60%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #14823      +/-   ##
==========================================
- Coverage   90.98%   85.28%   -5.71%     
==========================================
  Files         161      144      -17     
  Lines       49288    50972    +1684     
==========================================
- Hits        44846    43469    -1377     
- Misses       4442     7503    +3061
Flag Coverage Δ
#multiple ?
#single ?
Impacted Files Coverage Δ
pandas/indexes/multi.py 96.28% <60%> (ø)
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-90.67%) ⬇️
pandas/io/sas/sas_xport.py 0% <0%> (-90.28%) ⬇️
pandas/io/sas/sasreader.py 4.34% <0%> (-81.86%) ⬇️
pandas/io/gbq.py 17.77% <0%> (-65.57%) ⬇️
pandas/io/html.py 64.35% <0%> (-20.51%) ⬇️
pandas/util/testing.py 82.5% <0%> (-17.5%) ⬇️
pandas/tools/plotting.py 68.8% <0%> (-13.01%) ⬇️
pandas/io/pickle.py 74.28% <0%> (-6.97%) ⬇️
... and 220 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3955261...6cf5286. Read the comment docs.

@groutr
Copy link
Contributor Author

groutr commented Feb 2, 2017

Tests coming soon.

# check that len(MultiIndex) == max(len(iterables))
T = ((1,), (2, 3), (4, 5, 6))

idx = MultiIndex.from_tuples(T)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this raise?

@jreback
Copy link
Contributor

jreback commented Feb 2, 2017

We should be raising on variable length tuples. have the whatsnew reflect that.

@@ -983,6 +983,7 @@ def from_tuples(cls, tuples, sortorder=None, names=None):
----------
tuples : list / sequence of tuple-likes
Each tuple is the index of one row/column.
A ValueError will be raised if all tuples are not the same length.
Copy link
Contributor

@jreback jreback Feb 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this a separate Raises section after Returns (for the raising conditions)

expected = MultiIndex.from_tuples([('a', 'b', 'c'), ('d', 'e', np.nan),
('f', np.nan, np.nan)])
tm.assert_index_equal(idx.str.split(expand=True), expected)
# This is invalid behavior
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issue number here

@@ -1640,6 +1640,21 @@ def test_from_tuples(self):
idx = MultiIndex.from_tuples(((1, 2), (3, 4)), names=['a', 'b'])
self.assertEqual(len(idx), 2)

def test_equal_length(self):
# Test _check_equal_length
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue number here

'is', 'not')])
tm.assert_index_equal(result, exp)
self.assertEqual(result.nlevels, 6)
with self.assertRaises(ValueError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this changing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test now constructs an invalid multi-index. What should I be doing instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is this invalid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test constructs a multi-index with unbalanced tuples which now throws a ValueError. The split strings have different lengths and because expand=True, a MultiIndex is constructed; see a few lines below. Therefore, I modified the test check for a ValueError.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, then this is actually a bug in that method then. So on construction it needs to make sure to send in balanced tuples.


Return True if all sequences are the same length, otherwise False
If seq_of_seqs is empty return True as well.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do a

if not is_list_like(seq_of_seqs):
     return True

at the top, this will prevent things like: strings & Timestamps from pass thru here (though they shouldn't be passed in the first place, its possible)

@@ -1007,6 +1008,9 @@ def from_tuples(cls, tuples, sortorder=None, names=None):
# I think this is right? Not quite sure...
raise TypeError('Cannot infer number of levels from empty list')

if not _check_equal_length(tuples):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would definitely consider doing this check in Cython at the same time as constructing the arrays in the lib.tuples_to_object_array and lib.to_object_array_tuples functions. Otherwise I would expect a serious impact on performance in the typical case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoyer makes a good point. checking cost is about the same as the actual construction (in this trivial example)

In [1]: tuples = [(1,2)]*1000000

In [2]: len(tuples[0])
Out[2]: 2

In [3]: %timeit any([len(x) != 2 for x in tuples])
10 loops, best of 3: 88.5 ms per loop

In [5]: %timeit pd.MultiIndex.from_tuples(tuples)
10 loops, best of 3: 108 ms per loop

Copy link
Contributor Author

@groutr groutr Feb 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with implementing the method in Cython. Would it make sense to move _check_equal_length to lib.pyx?

EDIT: NM, I think I understand what you mean with checking during construction. Would _check_equal_length be a useful function elsewhere, or is it specific to these situations?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For performance, the check probably needs to be integrated into each of the existing Cython functions separately, avoiding a second loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to_object_array_tuples seems to deliberately find the length of the longest inner sequence to pad everything to. It is used indirectly by DataFrame. If to_object_array_tuples were to throw an error on unbalanced input, would that affect DataFrame's behavior of padding things with NaN in some cases?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest grepping for uses of to_object_array_tuples to understand the impact

@jreback
Copy link
Contributor

jreback commented Mar 28, 2017

can you rebase / update

@jreback
Copy link
Contributor

jreback commented May 7, 2017

closing as stale. would like to have this fixed though :>

@jreback jreback closed this May 7, 2017
@jreback jreback reopened this Jul 14, 2017
@jreback
Copy link
Contributor

jreback commented Jul 14, 2017

try something like this

diff --git a/pandas/core/frame.py b/pandas/core/frame.py
index 6559fc4..a64bf61 100644
--- a/pandas/core/frame.py
+++ b/pandas/core/frame.py
@@ -5828,7 +5828,7 @@ def _reorder_arrays(arrays, arr_columns, columns):
 
 def _list_to_arrays(data, columns, coerce_float=False, dtype=None):
     if len(data) > 0 and isinstance(data[0], tuple):
-        content = list(lib.to_object_array_tuples(data).T)
+        content = list(lib.tuples_to_object_array(np.array(data, dtype=object)).T)
     else:
         # list of lists
         content = list(lib.to_object_array(data).T)
diff --git a/pandas/core/indexes/multi.py b/pandas/core/indexes/multi.py
index 81eac0a..4b4a820 100644
--- a/pandas/core/indexes/multi.py
+++ b/pandas/core/indexes/multi.py
@@ -1128,20 +1128,21 @@ class MultiIndex(Index):
         MultiIndex.from_product : Make a MultiIndex from cartesian product
                                   of iterables
         """
+        if not is_list_like(tuples):
+            raise
+
         if len(tuples) == 0:
             if names is None:
                 msg = 'Cannot infer number of levels from empty list'
                 raise TypeError(msg)
             arrays = [[]] * len(names)
-        elif isinstance(tuples, (np.ndarray, Index)):
+
+        if isinstance(tuples, (list, tuple)):
+            arrays = list(lib.to_object_array_tuples(list(tuples)).T)
+        else:
             if isinstance(tuples, Index):
                 tuples = tuples._values
-
             arrays = list(lib.tuples_to_object_array(tuples).T)
-        elif isinstance(tuples, list):
-            arrays = list(lib.to_object_array_tuples(tuples).T)
-        else:
-            arrays = lzip(*tuples)
 
         return MultiIndex.from_arrays(arrays, sortorder=sortorder, names=names)

# Conflicts:
#	pandas/core/indexes/multi.py
#	pandas/tests/test_strings.py
@pep8speaks
Copy link

Hello @groutr! Thanks for updating the PR.

Line 1463:80: E501 line too long (83 > 79 characters)
Line 1464:45: E128 continuation line under-indented for visual indent

Line 2031:80: E501 line too long (82 > 79 characters)
Line 2032:41: E124 closing bracket does not match visual indentation

@jreback
Copy link
Contributor

jreback commented Sep 7, 2017

closing as stale, but would still like the fix. ping if you want to reopen.

@jreback jreback closed this Sep 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas MultiIndex
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect sequence handling with MultiIndex.from_tuples
5 participants