Skip to content

Multi-Index .loc with nlevels > 2 fails when nlevels - i > 1 levels are specified in a tuple. #22151

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bfollinprm opened this issue Aug 1, 2018 · 9 comments

Comments

@bfollinprm
Copy link

bfollinprm commented Aug 1, 2018

Code Sample, a copy-pastable example if possible

import pandas as pd
tuples = [
    ('2011', '01', 1),
    ('2011', '01', 2),
    ('2011', '02', 1),
    ('2011', '02', 2),
    ('2012', '01', 1),
    ('2012', '01', 2),
    ('2012', '02', 1),
    ('2012', '02', 2),    
]
index = pd.MultiIndex.from_tuples(tuples)
df = pd.DataFrame({'a': list(range(8))}, index = index)
df.loc[[('2011', '01')], :] # raises ValueError

Problem description

In the spirit of the Multi-Index missile example in the docstring for pd.DataFrame.loc, I query a data frame with a 3 level multi-index, with variable length tuple supplied. The following calls all function as expected from the generalization of the example:

*df.loc[['2011'], :]
*df.loc[[('2011', '01', 1)], :]
*df.loc['2011', :]
*df.loc[('2011', '01', 1), :]
*df.loc[('2011', '01'), :]

The requested query is failing due to a mis-match between the supplied multi-index tuple and the pre-computed offsets in the new MultiIndexUIntEngine class introduced in 0.23.0:

ValueError Traceback (most recent call last) in () ----> 1 df.loc[[('2011', '01')], :]

pandas/core/indexing.py in getitem(self, key)
1508
1509 maybe_callable = com._apply_if_callable(key, self.obj)
-> 1510 return self._getitem_axis(maybe_callable, axis=axis)
1511
1512 def _is_scalar_access(self, key):

pandas/core/indexing.py in _getitem_axis(self, key, axis)
1910 raise ValueError('Cannot index with multidimensional key')
1911
-> 1912 return self._getitem_iterable(key, axis=axis)
1913
1914 # nested tuple slicing

pandas/core/indexing.py in _getitem_iterable(self, key, axis)
1211 # A collection of keys
1212 keyarr, indexer = self._get_listlike_indexer(key, axis,
-> 1213 raise_missing=False)
1214 return self.obj._reindex_with_indexers({axis: [keyarr, indexer]},
1215 copy=True, allow_dups=True)

pandas/core/indexing.py in _get_listlike_indexer(self, key, axis, raise_missing)
1160 if len(ax) or not len(key):
1161 key = self._convert_for_reindex(key, axis)
-> 1162 indexer = ax.get_indexer_for(key)
1163 keyarr = ax.reindex(keyarr)[0]
1164 else:

pandas/core/indexes/base.py in get_indexer_for(self, target, **kwargs)
3420 """
3421 if self.is_unique:
-> 3422 return self.get_indexer(target, **kwargs)
3423 indexer, _ = self.get_indexer_non_unique(target, **kwargs)
3424 return indexer

pandas/core/indexes/multi.py in get_indexer(self, target, method, limit, tolerance)
1972 'for MultiIndex; see GitHub issue 9365')
1973 else:
-> 1974 indexer = self._engine.get_indexer(target)
1975
1976 return ensure_platform_int(indexer)

pandas/_libs/index.pyx in pandas._libs.index.BaseMultiIndexCodesEngine.get_indexer()

pandas/_libs/index.pyx in pandas._libs.index.BaseMultiIndexCodesEngine._extract_level_codes()

pandas/core/indexes/multi.py in _codes_to_ints(self, codes)
75 # Shift the representation of each level by the pre-calculated number
76 # of bits:
---> 77 codes <<= self.offsets
78
79 # Now sum and OR are in fact interchangeable. This is a simple

ValueError: operands could not be broadcast together with shapes (1,2) (3,) (1,2)

Expected Output

                a
2011    01  1   0
            2   1

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.0.dev0+346.gb97545563
pytest: 3.6.3
pip: 10.0.1
setuptools: 39.2.0
Cython: 0.28.4
numpy: 1.14.0
scipy: 0.19.1
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.1.6
pymysql: None
psycopg2: 2.7.1 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@WillAyd
Copy link
Member

WillAyd commented Aug 1, 2018

Does seem strange, though I'm fairly certain the preferred method of slicing here is:

df.loc[('2011', '01'), :]

which works. The docs are pretty explicit about need to specify both axes.

That said there's likely a bug hidden here somewhere - investigation and PRs are always welcome.

cc @toobaz

@WillAyd WillAyd added the Bug label Aug 1, 2018
@bfollinprm
Copy link
Author

bfollinprm commented Aug 1, 2018

The construction df.loc[('2011', '01'), :] is equivalent to df.loc[('2011', '01')], which works but returns a "series"-like object, e.g. it has lost the index information. This is not quite equivalent to what .loc[[idx]] returns, e.g. df.loc[[('2011', '01')],:] returns the same ValueError. I'll update the description to include the column axis placeholder.

The solution is pretty simple actually (at least to get back to 0.22.0's functionality), in pandas.core.indexes/multi.py the size offset needs masked by the missing element in the tuple(s). For the example above, replacing line 77 with

if len(codes.shape) > 1:
    # Check if querying multi-index, and reduce
    offsets = self.offsets[codes.shape[1])
else:
    offsets = self.offsets
    codes <== offsets

is sufficient (of course, a similar change should propagate to the PyInt-version of the same class).

I would raise as a PR, but work has strict rules around contributions to open source, and I've already tried unsuccessfully to get permission to contribute to pandas.

@toobaz
Copy link
Member

toobaz commented Aug 1, 2018

@bfollinprm Thanks for the report. I definitely agree this should work.

I don't think the proposed fix is optimal however. _codes_to_int expects only complete combinations of codes, and I think it should not need to be more complicated than this. After all, when nlevels == 1, or when you run the example by @WillAyd , it works correctly - and the operations the engine does should be exactly the same. So this can and should be fixed somewhere else.

@toobaz
Copy link
Member

toobaz commented Aug 2, 2018

After all, when nlevels == 1, or when you run the example by @WillAyd , it works correctly

Correcting myself: nlevels == 1 does not work correctly:

import pandas as pd
tuples = [
    ('2011', '01'),
    ('2011', '01'),
    ('2011', '02'),
    ('2011', '02'),
    ('2012', '01'),
    ('2012', '01'),
    ('2012', '02'),
    ('2012', '02'),    
]
index = pd.MultiIndex.from_tuples(tuples)
df = pd.DataFrame({'a': list(range(8))}, index = index)
df.loc[[('2011', )], ] # raises ValueError

... but the example by @WillAyd does, so my argument should still stand.

@toobaz
Copy link
Member

toobaz commented Aug 2, 2018

@bfollinprm did your example really work with 0.22.0? With 0.19.2, I get

In [2]: df.loc[[('2011', '01')], :]
Out[2]: 
          a
2011 01 NaN

... which is certainly not the desired behavior.

@bfollinprm
Copy link
Author

bfollinprm commented Aug 3, 2018

Ah, you are right, this was buggy before--it gave the wrong result, but was silent about it.

I didn't notice because the test code that was checking this was looping over all tuple lengths, but was only checking the values of the length 1 and length index queries. So I guess ValueError is an improvement, but I'm still of the opinion this should work.

@bfollinprm
Copy link
Author

bfollinprm commented Aug 3, 2018

The problem with the other signatures, e.g. @WillAyd'sdf.loc[('2011', '01'), :], is that it returns a series and does not generalize to e.g. a list of indexes.

The following seems to give what I want:

target_indexes = [('2011', '02'), ('2011', '01')]
target_to_position = df.index.get_locs(list(zip(*target_indexes)))
df.iloc[target_to_position]

So there definitely is a way, but that's ugly.

@toobaz
Copy link
Member

toobaz commented Aug 3, 2018

The problem with the other signatures, e.g. @WillAyd'sdf.loc[('2011', '01'), :], is that it returns a series and does not generalize to e.g. a list of indexes.

Sure, I agree, my comment was only concerning the implementation. Since the current engine can be used for the desired lookup, the problem is not there.

The following seems to give what I want:

Please correct me if I'm wrong, but given target_indexes = [('2011', '01'), ('2012', '02')], your snippet will also return ('2011', '02', 1), which is not what you desire, and certainly not what df.loc[target_indexes, :] should return.

This said, I think the solution will be something similar to your snippet, the main difficulty being that it needs to account for both lists and dataframes to be returned by get_loc or similar, see #9519 .

@toobaz
Copy link
Member

toobaz commented Aug 3, 2018

Duplicate of #16083 , @bfollinprm feel free to add there any further comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants