Skip to content

Proposal: Deprecating support of incomplete indexing on MultiIndexes #10574

@tgarc

Description

@tgarc

First, here's the example DataFrame

                    0  1  2  3  4  5  6  7
first second third                        
bar   one    three  4  9  7  8  5  0  7  8
             four   6  8  1  5  9  9  1  7
      two    three  7  5  2  6  7  8  5  9
             four   0  8  8  5  3  5  8  3
baz   one    three  6  0  8  0  0  9  8  8
             four   9  3  2  0  2  7  4  9
      two    three  0  3  7  7  4  3  7  0
             four   6  3  2  8  3  9  7  8
foo   one    three  6  7  3  7  3  0  3  6
             four   5  8  0  8  1  5  1  5
      two    three  2  0  8  2  8  1  8  3
             four   9  0  2  7  0  6  8  3
qux   one    three  1  2  5  5  0  7  0  1
             four   6  6  7  0  0  4  5  3
      two    three  1  8  2  8  7  5  7  5
             four   1  1  3  8  8  6  0  3

For DataFrames with MultiIndexed rows, pandas allows this type of indexing

df.loc[('foo','bar'), ('one','two'), ('three','four')]

To be taken to mean

df.loc[(('foo','bar'), ('one','two'), ('three','four')), :]

But this type of indexing is ambiguous in the case when the number of indexing tuples is 2 since

df.loc[('foo','bar'), ('one','two')]

could mean incomplete indexing as in

df.loc[(('foo','bar'), ('one','two')),:]

or row,column indexing as in

df.loc[(('foo','bar'),), (('one','two'),)]

I appreciate that there is already a warning for this in the documentation, but I wonder if the functionality is worth the complications it adds to the code/docs.

Personally, I would suggest offloading the responsibility of complete indexing on a MultiIndex DataFrame to the user (obviously this doesn't apply to Series as they are 1d so to speak). This would take away the minor syntactical convenience of not specifying the column index, but it simplifies the code and gives the user only one way to index on a MultiIndex DataFrame (which makes usage less confusing).

The consequence to the user in the specific case of selecting multiple levels of a row-MultiIndex on a DataFrame is that instead of writing

df.loc['foo','one']

they would have to write

df.loc[('foo','one'), :]

And, in the syntactically worst case, instead of writing

df.loc[('foo','bar'), ('one','two'), ('three','four')]

they would have to write

df.loc[(('foo','bar'), ('one','two'), ('three','four')), :]

I'm fairly new to pandas (don't think I started using it until v0.16), so I realize I may be missing the bigger picture. If so, enlighten me!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions