Multi-Index .loc with nlevels > 2 fails when nlevels - i > 1 levels are specified in a tuple. #22151

bfollinprm · 2018-08-01T00:57:44Z

Code Sample, a copy-pastable example if possible

import pandas as pd
tuples = [
    ('2011', '01', 1),
    ('2011', '01', 2),
    ('2011', '02', 1),
    ('2011', '02', 2),
    ('2012', '01', 1),
    ('2012', '01', 2),
    ('2012', '02', 1),
    ('2012', '02', 2),    
]
index = pd.MultiIndex.from_tuples(tuples)
df = pd.DataFrame({'a': list(range(8))}, index = index)
df.loc[[('2011', '01')], :] # raises ValueError

Problem description

In the spirit of the Multi-Index missile example in the docstring for pd.DataFrame.loc, I query a data frame with a 3 level multi-index, with variable length tuple supplied. The following calls all function as expected from the generalization of the example:

*df.loc[['2011'], :]
*df.loc[[('2011', '01', 1)], :]
*df.loc['2011', :]
*df.loc[('2011', '01', 1), :]
*df.loc[('2011', '01'), :]

The requested query is failing due to a mis-match between the supplied multi-index tuple and the pre-computed offsets in the new MultiIndexUIntEngine class introduced in 0.23.0:

ValueError Traceback (most recent call last) in () ----> 1 df.loc[[('2011', '01')], :]

pandas/core/indexing.py in getitem(self, key)
1508
1509 maybe_callable = com._apply_if_callable(key, self.obj)
-> 1510 return self._getitem_axis(maybe_callable, axis=axis)
1511
1512 def _is_scalar_access(self, key):

pandas/core/indexing.py in _getitem_axis(self, key, axis)
1910 raise ValueError('Cannot index with multidimensional key')
1911
-> 1912 return self._getitem_iterable(key, axis=axis)
1913
1914 # nested tuple slicing

pandas/core/indexing.py in _getitem_iterable(self, key, axis)
1211 # A collection of keys
1212 keyarr, indexer = self._get_listlike_indexer(key, axis,
-> 1213 raise_missing=False)
1214 return self.obj._reindex_with_indexers({axis: [keyarr, indexer]},
1215 copy=True, allow_dups=True)

pandas/core/indexing.py in _get_listlike_indexer(self, key, axis, raise_missing)
1160 if len(ax) or not len(key):
1161 key = self._convert_for_reindex(key, axis)
-> 1162 indexer = ax.get_indexer_for(key)
1163 keyarr = ax.reindex(keyarr)[0]
1164 else:

pandas/core/indexes/base.py in get_indexer_for(self, target, **kwargs)
3420 """
3421 if self.is_unique:
-> 3422 return self.get_indexer(target, **kwargs)
3423 indexer, _ = self.get_indexer_non_unique(target, **kwargs)
3424 return indexer

pandas/core/indexes/multi.py in get_indexer(self, target, method, limit, tolerance)
1972 'for MultiIndex; see GitHub issue 9365')
1973 else:
-> 1974 indexer = self._engine.get_indexer(target)
1975
1976 return ensure_platform_int(indexer)

pandas/_libs/index.pyx in pandas._libs.index.BaseMultiIndexCodesEngine.get_indexer()

pandas/_libs/index.pyx in pandas._libs.index.BaseMultiIndexCodesEngine._extract_level_codes()

pandas/core/indexes/multi.py in _codes_to_ints(self, codes)
75 # Shift the representation of each level by the pre-calculated number
76 # of bits:
---> 77 codes <<= self.offsets
78
79 # Now sum and OR are in fact interchangeable. This is a simple

ValueError: operands could not be broadcast together with shapes (1,2) (3,) (1,2)

Expected Output

                a
2011    01  1   0
            2   1

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.0.dev0+346.gb97545563
pytest: 3.6.3
pip: 10.0.1
setuptools: 39.2.0
Cython: 0.28.4
numpy: 1.14.0
scipy: 0.19.1
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.1.6
pymysql: None
psycopg2: 2.7.1 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

The text was updated successfully, but these errors were encountered:

WillAyd · 2018-08-01T05:19:53Z

Does seem strange, though I'm fairly certain the preferred method of slicing here is:

df.loc[('2011', '01'), :]

which works. The docs are pretty explicit about need to specify both axes.

That said there's likely a bug hidden here somewhere - investigation and PRs are always welcome.

cc @toobaz

bfollinprm · 2018-08-01T05:54:15Z

The construction df.loc[('2011', '01'), :] is equivalent to df.loc[('2011', '01')], which works but returns a "series"-like object, e.g. it has lost the index information. This is not quite equivalent to what .loc[[idx]] returns, e.g. df.loc[[('2011', '01')],:] returns the same ValueError. I'll update the description to include the column axis placeholder.

The solution is pretty simple actually (at least to get back to 0.22.0's functionality), in pandas.core.indexes/multi.py the size offset needs masked by the missing element in the tuple(s). For the example above, replacing line 77 with

if len(codes.shape) > 1:
    # Check if querying multi-index, and reduce
    offsets = self.offsets[codes.shape[1])
else:
    offsets = self.offsets
    codes <== offsets

is sufficient (of course, a similar change should propagate to the PyInt-version of the same class).

I would raise as a PR, but work has strict rules around contributions to open source, and I've already tried unsuccessfully to get permission to contribute to pandas.

toobaz · 2018-08-01T14:57:59Z

@bfollinprm Thanks for the report. I definitely agree this should work.

I don't think the proposed fix is optimal however. _codes_to_int expects only complete combinations of codes, and I think it should not need to be more complicated than this. After all, when nlevels == 1, or when you run the example by @WillAyd , it works correctly - and the operations the engine does should be exactly the same. So this can and should be fixed somewhere else.

toobaz · 2018-08-02T07:13:27Z

After all, when nlevels == 1, or when you run the example by @WillAyd , it works correctly

Correcting myself: nlevels == 1 does not work correctly:

import pandas as pd
tuples = [
    ('2011', '01'),
    ('2011', '01'),
    ('2011', '02'),
    ('2011', '02'),
    ('2012', '01'),
    ('2012', '01'),
    ('2012', '02'),
    ('2012', '02'),    
]
index = pd.MultiIndex.from_tuples(tuples)
df = pd.DataFrame({'a': list(range(8))}, index = index)
df.loc[[('2011', )], ] # raises ValueError

... but the example by @WillAyd does, so my argument should still stand.

toobaz · 2018-08-02T17:22:03Z

@bfollinprm did your example really work with 0.22.0? With 0.19.2, I get

In [2]: df.loc[[('2011', '01')], :]
Out[2]: 
          a
2011 01 NaN

... which is certainly not the desired behavior.

bfollinprm · 2018-08-03T06:52:37Z

Ah, you are right, this was buggy before--it gave the wrong result, but was silent about it.

I didn't notice because the test code that was checking this was looping over all tuple lengths, but was only checking the values of the length 1 and length index queries. So I guess ValueError is an improvement, but I'm still of the opinion this should work.

bfollinprm · 2018-08-03T07:04:27Z

The problem with the other signatures, e.g. @WillAyd'sdf.loc[('2011', '01'), :], is that it returns a series and does not generalize to e.g. a list of indexes.

The following seems to give what I want:

target_indexes = [('2011', '02'), ('2011', '01')]
target_to_position = df.index.get_locs(list(zip(*target_indexes)))
df.iloc[target_to_position]

So there definitely is a way, but that's ugly.

toobaz · 2018-08-03T08:07:29Z

The problem with the other signatures, e.g. @WillAyd'sdf.loc[('2011', '01'), :], is that it returns a series and does not generalize to e.g. a list of indexes.

Sure, I agree, my comment was only concerning the implementation. Since the current engine can be used for the desired lookup, the problem is not there.

The following seems to give what I want:

Please correct me if I'm wrong, but given target_indexes = [('2011', '01'), ('2012', '02')], your snippet will also return ('2011', '02', 1), which is not what you desire, and certainly not what df.loc[target_indexes, :] should return.

This said, I think the solution will be something similar to your snippet, the main difficulty being that it needs to account for both lists and dataframes to be returned by get_loc or similar, see #9519 .

toobaz · 2018-08-03T17:02:41Z

Duplicate of #16083 , @bfollinprm feel free to add there any further comment

WillAyd added the MultiIndex label Aug 1, 2018

WillAyd added the Bug label Aug 1, 2018

toobaz closed this as completed Aug 3, 2018

toobaz mentioned this issue Aug 3, 2018

.loc with list of incomplete labels misbehaves or raises #16083

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-Index .loc with nlevels > 2 fails when nlevels - i > 1 levels are specified in a tuple. #22151

Multi-Index .loc with nlevels > 2 fails when nlevels - i > 1 levels are specified in a tuple. #22151

bfollinprm commented Aug 1, 2018 •

edited

Loading

INSTALLED VERSIONS

WillAyd commented Aug 1, 2018

bfollinprm commented Aug 1, 2018 •

edited

Loading

toobaz commented Aug 1, 2018

toobaz commented Aug 2, 2018

toobaz commented Aug 2, 2018

bfollinprm commented Aug 3, 2018 •

edited

Loading

bfollinprm commented Aug 3, 2018 •

edited

Loading

toobaz commented Aug 3, 2018

toobaz commented Aug 3, 2018

Multi-Index .loc with nlevels > 2 fails when nlevels - i > 1 levels are specified in a tuple. #22151

Multi-Index .loc with nlevels > 2 fails when nlevels - i > 1 levels are specified in a tuple. #22151

Comments

bfollinprm commented Aug 1, 2018 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

WillAyd commented Aug 1, 2018

bfollinprm commented Aug 1, 2018 • edited Loading

toobaz commented Aug 1, 2018

toobaz commented Aug 2, 2018

toobaz commented Aug 2, 2018

bfollinprm commented Aug 3, 2018 • edited Loading

bfollinprm commented Aug 3, 2018 • edited Loading

toobaz commented Aug 3, 2018

toobaz commented Aug 3, 2018

bfollinprm commented Aug 1, 2018 •

edited

Loading

Output of `pd.show_versions()`

bfollinprm commented Aug 1, 2018 •

edited

Loading

bfollinprm commented Aug 3, 2018 •

edited

Loading

bfollinprm commented Aug 3, 2018 •

edited

Loading