Skip to content

Incoherence between Index and MultiIndex when labels in list are not found #15452

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
toobaz opened this issue Feb 18, 2017 · 14 comments · Fixed by #41607
Closed

Incoherence between Index and MultiIndex when labels in list are not found #15452

toobaz opened this issue Feb 18, 2017 · 14 comments · Fixed by #41607
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@toobaz
Copy link
Member

toobaz commented Feb 18, 2017

Code Sample, a copy-pastable example if possible

In [2]: s = pd.Series(range(4), index=pd.MultiIndex.from_product([[1,2], ['a', 'b']]))

In [3]: s.loc[['not', 'found']]
Out[3]: Series([], dtype: int64)

Problem description

With regular Indexes, looking for a list of labels of which none is present will raise KeyError (see below). We should be coherent (and while I would prefer the empty result to the exception, this was already discussed in #7999).

Expected Output

Compare to

In [4]: s.reset_index().loc[['not', 'found']]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-4-e13655430320> in <module>()
----> 1 s.reset_index().loc[['not', 'found']]

/home/nobackup/repo/pandas/pandas/core/indexing.py in __getitem__(self, key)
   1339         else:
   1340             key = com._apply_if_callable(key, self.obj)
-> 1341             return self._getitem_axis(key, axis=0)
   1342 
   1343     def _is_scalar_access(self, key):

/home/nobackup/repo/pandas/pandas/core/indexing.py in _getitem_axis(self, key, axis)
   1539                     raise ValueError('Cannot index with multidimensional key')
   1540 
-> 1541                 return self._getitem_iterable(key, axis=axis)
   1542 
   1543             # nested tuple slicing

/home/nobackup/repo/pandas/pandas/core/indexing.py in _getitem_iterable(self, key, axis)
   1049     def _getitem_iterable(self, key, axis=0):
   1050         if self._should_validate_iterable(axis):
-> 1051             self._has_valid_type(key, axis)
   1052 
   1053         labels = self.obj._get_axis(axis)

/home/nobackup/repo/pandas/pandas/core/indexing.py in _has_valid_type(self, key, axis)
   1429 
   1430                 raise KeyError("None of [%s] are in the [%s]" %
-> 1431                                (key, self.obj._get_axis_name(axis)))
   1432 
   1433             return True

KeyError: "None of [['not', 'found']] are in the [index]"

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: f65a641 python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 4.7.0-1-amd64 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: it_IT.utf8 LOCALE: it_IT.UTF-8

pandas: 0.19.0+473.gf65a641
pytest: 3.0.6
pip: 8.1.2
setuptools: 28.0.0
Cython: 0.23.4
numpy: 1.12.0
scipy: 0.18.1
xarray: None
IPython: 5.1.0.dev
sphinx: 1.4.8
patsy: 0.3.0-dev
dateutil: 2.5.3
pytz: 2015.7
blosc: None
bottleneck: 1.2.0
tables: 3.2.2
numexpr: 2.6.0
feather: None
matplotlib: 2.0.0rc2
openpyxl: 2.3.0
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.1
html5lib: 0.999
httplib2: 0.9.1
apiclient: 1.5.2
sqlalchemy: 1.0.15
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: None
pandas_datareader: 0.2.1

@jreback
Copy link
Contributor

jreback commented Feb 18, 2017

these are not equivalent, asking for a scalar, e.g. s.loc[('not', 'found')] is the equivalent operation and this does raise KeyError.

@jreback jreback closed this as completed Feb 18, 2017
@jreback jreback added this to the No action milestone Feb 18, 2017
@toobaz
Copy link
Member Author

toobaz commented Feb 18, 2017

@jreback , while indeed the most strict equivalent to

# flat Index, looking for a list of values
s.reset_index().loc[['not', 'found']] 

is not really

# MultiIndex, looking for a list of values (in first level)
s.loc[['not', 'found']]

but rather

# MultiIndex, looking for a list of (complete) keys
s.loc[[('not', 'found1'), ('not', 'found2')]] 

, your example is not related to any of those (compare to s.reset_index().loc[[1, 2]], which does not return a scalar, and neither s.reset_index().loc[[1, 'a']] does!). And by the way, the last example above is even more incoherent:

In [6]: s.loc[[('not', 'found1'), ('not', 'found2')]]
Out[6]: 
not  found1   NaN
     found2   NaN
dtype: float64

... but at least there seems to be some reasoning behind (not clear to me).

@toobaz
Copy link
Member Author

toobaz commented Feb 18, 2017

Or if you want a more concise example: s.iloc[:2].loc[[s.index[-1]]] will raise KeyError if s has a flat Index, and not if it has a MultiIndex

@jorisvandenbossche
Copy link
Member

asking for a scalar, e.g. s.loc[('not', 'found')] is the equivalent operation and this does raise KeyError.

that is asking for a scalar label, but @toobaz issue was a list of labels, so the equivalent would indeed rather be (but of course it difficult to pinpoint an exact equivalent):

In [97]: s.loc[[('not', 'found'), ('also', 'not')]]
Out[97]: 
not   found   NaN
also  not     NaN
dtype: float64

which does a reindex. Thus, different from single Index indexing, where it only does a reindex when at least one label of the list is included in the Index.

@jorisvandenbossche jorisvandenbossche removed this from the No action milestone Feb 19, 2017
@toobaz
Copy link
Member Author

toobaz commented Feb 22, 2017

@jreback @jorisvandenbossche to be honest, I don't get why it was decided in the first place to make .loc behave like .reindex at all (e.g. change the structure of the indexed index). It seems to me completely incoherent with regular .loc behavior, and also undocumented. In my view, we should enforce the following general rules:

  • s.loc[list_of_things] for a Series or df.loc[list_of_things, :] for a DataFrame should always (except when they raise some error, obviously) return an object of the same kind and with an index of the same depth (including 1) as s and df, respectively (where list_of_things is not necessarily a list, but rather anything that we interpret as a list of labels/levels, including Indexes)
  • such index must contain a subset of the labels of the index of s, or df (possibly repeated, however, and/or in another order)

And I know it's probably late to say this, but concerning this bug, the cleanest solution would seem to me to abandon the rule (documented, this time) At least 1 of the labels for which you ask, must be in the index or a KeyError will be raised!, and allow an empty DataFrame/Series to be returned when looking for a list of missing labels (even in a flat Index). By the way, pd.Series(range(2)).loc[[]] is currently incoherent with the docs, since it doesn't raise.

Or alternatively, we could decide to be strict and raise a KeyError whenever any of the provided labels (or partial labels, when indexing only on some levels) is missing.

I understand that either of the two would break existing code. But the current compromise, in which we raise only if we do not find at least one label, seems to me a source of headaches both for the pandas user and for the pandas developer.

(sorry if this was already discussed somewhere I couldn't find)

toobaz added a commit to toobaz/pandas that referenced this issue Mar 8, 2017
toobaz added a commit to toobaz/pandas that referenced this issue Mar 8, 2017
@toobaz toobaz mentioned this issue Mar 8, 2017
4 tasks
@jorisvandenbossche
Copy link
Member

To come back to the original issue of s.loc[['not', 'found']], the question is maybe also how this exactly should be interpreted.
It's a list of labels. But, is it a list of full labels (spanning all levels) (like s.loc[[('not', 'found'), ('also', 'not')]] but where you give wrong labels)? Or is it a list of labels for the first level (like s.loc[(['not', 'found'], ), :])

Because the desired result may be different depending on how it is interpreted (or at least the current behaviour). And s.loc[['not', 'found']] is a bit ambiguous.

@jorisvandenbossche
Copy link
Member

@toobaz there is also a difference in the 'reindexing' behaviour of loc between single Index and MultiIndex.
For an Index, it is 'real' reindex:

In [69]: s.reset_index(level=1, drop=True).loc[[1, 3]]
Out[69]: 
1    0.0
1    1.0
3    NaN
dtype: float64

so introducing NaN for the labels that did not exist in the original index.
For a MultiIndex, this is not done (following from the fact that pandas cannot know how to fill in the other levels), so here it is a 'reindex but only with the existing labels':

In [70]: s.loc[[1, 3], :]
Out[70]: 
1  a    0
   b    1
dtype: int64

So following that idea, you could see the original case here as equivalent to indexing with an empty list:

In [71]: s.reset_index(level=1, drop=True).loc[[]]
Out[71]: Series([], dtype: int64)

In [73]: s.loc[[3, 4],:]
Out[73]: Series([], dtype: int64)

In [74]: s.loc[[],:]
Out[74]: Series([], dtype: int64)

and then the output could be interpreted as consistent ...

@toobaz
Copy link
Member Author

toobaz commented Mar 9, 2017

It's a list of labels. But, is it a list of full labels (spanning all levels) (like s.loc[[('not', 'found'), ('also', 'not')]] but where you give wrong labels)? Or is it a list of labels for the first level (like s.loc[(['not', 'found'], ), :])

I gave it as granted that every time we ask s.something_involving_our(given_list) we want a_series_having_as_content([s.something_involving_our(x) for x in given_list]).

In your first option, you have ('not', 'found') as tuples, not lists. My understanding is that pandas draws a sharp distinction between lists and tuples (the former being collection, the latter being possibly composed keys), and that this is very useful! So I gave it as granted that we wanted the second.

@toobaz
Copy link
Member Author

toobaz commented Mar 9, 2017

So following that idea, you could see the original case here as equivalent to indexing with an empty list:
[...]
and then the output could be interpreted as consistent ...

Sure! And this reveals, strictly speaking, another very perverse (although well known) bug: since the docs say At least 1 of the labels for which you ask, must be in the index or a KeyError will be raised!, and this is false when you pass an empty list, then the empty list should raise too.

Now, if I will be able to convince you that the current behaviour is perverse (I mean my proposal above about managing missing labels - but we can discuss it elsewhere), I will take care of required changes, and the problem will disappear, everything will be coherent, and I will be happy. But assuming this does not happen, we will update the docs stating "except when passing an empty list", and then again the expected behaviour for this bug will be clear (since I am not passing an empty list).

@jorisvandenbossche
Copy link
Member

In your first option, you have ('not', 'found') as tuples, not lists.

No, it is a list of tuples ([('not', 'found'), ('also', 'not')]), or a list of scalars (['not', 'found']), but both are a list.

I gave it as granted that every time we ask s.something_involving_our(given_list) we want a_series_having_as_content([s.something_involving_our(x) for x in given_list]).

That's maybe a good analogy to reason about the behaviour of a single list, yes. And to justify that s.loc[[1, 2]] works how it works (because you could say that this is actually interpreted as s.loc[([1,2], slice(None))]

Sure! And this reveals, strictly speaking, another very perverse (although well known) bug: since the docs say At least 1 of the labels for which you ask, must be in the index or aKeyErrorwill be raised!, and this is false when you pass an empty list, then the empty list should raise too.

I am not sure we should call it a bug, as it could also be a design choice (although maybe a questionable one). But I would open a separate issue for that (focusing on the plain Index case).

@toobaz
Copy link
Member Author

toobaz commented Mar 9, 2017

In your first option, you have ('not', 'found') as tuples, not lists.

No, it is a list of tuples ([('not', 'found'), ('also', 'not')]), or a list of scalars (['not', 'found']), but both are a list.

OK: you have tuples, I don't. Tuples (can) mean "pieces of keys" (one per level), lists never do. Tuples can also contain lists (as in your example below), but they again mean "pieces of keys", except that for each level you have several pieces rather than only one. It is a beautiful convention without which I would never know what to expect from .loc[] on MultiIndex: tuples go horizontally, lists go vertically.

That's maybe a good analogy to reason about the behaviour of a single list, yes. And to justify that s.loc[[1, 2]] works how it works (because you could say that this is actually interpreted as s.loc[([1,2], slice(None))]

Exactly!

I am not sure we should call it a bug, as it could also be a design choice (although maybe a questionable one).

Sorry, you are right, I know it's a design choice... I meant that as of now, it is just a documentation bug (except that I would like to fix the code rather than the docs, but I agree to talk about this in a separate issue).

toobaz added a commit to toobaz/pandas that referenced this issue Mar 10, 2017
toobaz added a commit to toobaz/pandas that referenced this issue Mar 11, 2017
@jorisvandenbossche
Copy link
Member

OK: you have tuples I don't. Tuples (can) mean "pieces of keys" (one per level), lists never do. Tuples can also contain lists (as in your example below), but they again mean "pieces of keys", except that for each level you have several pieces rather than only one. It is a beautiful convention without which I would never know what to expect from .loc[] on MultiIndex.

This part I don't fully understand. But we may be confused by each other terminology.
I also don't have tuples in the actual issue we are discussing (apart from the tuples within the list), I only use them (if you mean the s.loc[([1,2], slice(None))]) to clarify how the list can be interpreted. As to me, that form is the 'fully specified' form that is non-ambiguous.

Tuples (can) mean "pieces of keys" (one per level), lists never do.

Well, that is my question. Do lists never mean that? Because the result of s.loc[[1, 2]] also looks like the list is interpreted as a specification of the first level (is this what you mean with "pieces of keys"?)

By the way, the

I gave it as granted that every time we ask s.something_involving_our(given_list) we want a_series_having_as_content([s.something_involving_our(x) for x in given_list]).

That's maybe a good analogy to reason about the behaviour of a single list,

does not fully hold if you have more than two levels:


In [7]: midx = pd.MultiIndex.from_product([['A0', 'A1', 'A2'], ['B0', 'B1'], ['C0', 'C1']])

In [8]: s = pd.Series(range(len(midx)), index=midx)

# a non-fully specified tuple works as a single element
In [9]: s[('A0', 'B0')]
Out[9]: 
C0    0
C1    1
dtype: int64

# but not as a list of those
In [10]: s[[('A0', 'B0'), ('A1', 'B1')]]
Out[10]: 
A0  B0   NaN
A1  B1   NaN
dtype: float64

@toobaz
Copy link
Member Author

toobaz commented Mar 13, 2017

This part I don't fully understand. But we may be confused by each other terminology.
I also don't have tuples in the actual issue we are discussing (apart from the tuples within the list)

I was referring precisely to the tuples within the list

Tuples (can) mean "pieces of keys" (one per level), lists never do.

Well, that is my question. Do lists never mean that? Because the result of s.loc[[1, 2]] also looks like the list is interpreted as a specification of the first level (is this what you mean with "pieces of keys"?)

OK, seems like my wording was confusing. What I meant was: tuples broadcast across levels, lists never do. (By "pieces of keys" I meant "different pieces of one same key", or of multiple keys, but with emphasis that different components refer to different levels.)

That's maybe a good analogy to reason about the behaviour of a single list,

does not fully hold if you have more than two levels:

Good catch... I'm pretty sure we want to consider Out[10] as a bug. And more precisely:

  1. the current behaviour with missing labels is, in general, weird (and we will discuss separately)
  2. ... and it is the reason why .loc was made to act like .reindex when it does not find labels on MultiIndexes (and I think this should be fixed too - it should at least act like .loc for those labels which it does find, and I think it is more in general documented - but we should also discuss this separately)
  3. ... but since in your example there are no missing labels (they are incomplete, but not missing, precisely as if you did s[['A0', 'A1']]), Out[10] is a bug even if you don't agree with me on fixing 1. and 2.

You don't even need the three levels to contradict me, try the following ;-)

In [2]: s = pd.Series(range(6), index=pd.MultiIndex.from_product([['A0', 'A1', 'A2'], ['B0', 'B1']]))

In [3]: s.loc[[('A0',), ('A1',)]]

(clearly another bug - or the same, for what it matters)

(By the way, before you take me more literally than I intended: I meant s.something_involving_our(given_list) limited to the case in which something_involving_our is a call to an indexer... the API is so big that any general statement about it should be easy to falsify ;-) )

@mroeschke
Copy link
Member

Looks like this reasonably raises a KeyError now. Supposed could use a test

In [65]: In [2]: s = pd.Series(range(4), index=pd.MultiIndex.from_product([[1,2], ['a', 'b']]))
    ...:
    ...: In [3]: s.loc[['not', 'found']]
KeyError: "['not' 'found'] not in index"

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed API Design MultiIndex labels May 8, 2021
@mroeschke mroeschke mentioned this issue May 21, 2021
10 tasks
@jreback jreback modified the milestones: Contributions Welcome, 1.3 May 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants