Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API/BUG: allow .str-accessor for 1-level MultiIndex? Return what? #23679

Closed
h-vetinari opened this issue Nov 14, 2018 · 1 comment · Fixed by #26608
Closed

API/BUG: allow .str-accessor for 1-level MultiIndex? Return what? #23679

h-vetinari opened this issue Nov 14, 2018 · 1 comment · Fixed by #26608
Labels
API Design MultiIndex Strings String extension data type and string data
Milestone

Comments

@h-vetinari
Copy link
Contributor

h-vetinari commented Nov 14, 2018

Following from #23167...

The current checks for the .str-constructor regarding MultiIndex is (roughly)

if isinstance(data, MultiIndex) and data.nlevels > 1:
    raise ...

Meaning the constructor passes for MultiIndex with a single level, but essentially all methods fail or produce garbage:

idx = pd.Index(['aaa', 'bb', 'c'])
mi = pd.MultiIndex.from_arrays([idx])
>>> mi.str.len()
Int64Index([1, 1, 1], dtype='int64')  # compare idx.str.len() == Int64Index([3, 2, 1], dtype='int64')
>>> mi.str.cat()
[...]
NotImplementedError: initializing a Series from a MultiIndex is not supported
>>> mi.str.startswith('a')
Float64Index([nan, nan, nan], dtype='float64')
>>> mi.str.upper()
Float64Index([nan, nan, nan], dtype='float64')
>>> mi.str.islower()
Float64Index([nan, nan, nan], dtype='float64')
>>> mi.str.split()
Float64Index([nan, nan, nan], dtype='float64')
>>> mi.str.find('a')
Float64Index([nan, nan, nan], dtype='float64')
>>> mi.str.ljust(10)
Float64Index([nan, nan, nan], dtype='float64')
>>> mi.str.repeat(3)
Float64Index([nan, nan, nan], dtype='float64')
>>> mi.str.slice(1, 2)
Index([(), (), ()], dtype='object')  # compare idx.str.slice(1, 2) == Index(['a', 'b', ''], dtype='object')
>>> mi.str.zfill(10)
Float64Index([nan, nan, nan], dtype='float64')
>>> mi.str.wrap(2)
Float64Index([nan, nan, nan], dtype='float64')
>>> mi.str.normalize('NFC')
Float64Index([nan, nan, nan], dtype='float64')
>>> mi.str.index('')
[...]
ValueError: tuple.index(x): x not in tuple
>>> mi.str.get(1)
Float64Index([nan, nan, nan], dtype='float64')
>>> mi.str.contains('a')
Float64Index([nan, nan, nan], dtype='float64')

My original plan in #23167 was just to disable MultiIndex.str regardless of the number of levels, but @toobaz brought up the point (in a side discussion in #23670) that:

Sorry, naive question, but what is the problem with just running .str on the result of self.get_level_values(0)?

This would, of course, work without problem. The main question that arises from this issue:

  • should the .str-accessor be enabled for MultiIndex at all?
  • if yes, should it return an Index or a 1-level MultiIndex?

PS. As another link to #23670, one could maybe consider enabling .str for all MultiIndex, by operating on MultiIndex.to_flat_index() in those cases. This might be interesting for example for easy joining of the MultiIndex-levels with .str.join.

@gfyoung gfyoung added Strings String extension data type and string data MultiIndex API Design labels Nov 14, 2018
@toobaz
Copy link
Member

toobaz commented Nov 14, 2018

should the .str-accessor be enabled for MultiIndex at all?

As mentioned on the other issue, I think that we need to clearly decide among one of the following policies:

  1. Index and 1-level MultiIndex are different implementations of precisely the same thing - and so behave identically, including with the .str accessor
  2. Index and 1-level MultiIndex are different by design because they achieve different purposes (and by "different" I don't mean "one has a subset of features of the other")
  3. we deprecate 1-level MultiIndexes and immediately stop producing them in our code

I think there is no evidence/support for 2. So it is just a matter of understanding if 1. is feasible. I always thought it is (and it might have the advantage of avoiding some conversions), but if it is not (e.g. we do not want to implement .str for 1-level MultiIndex) then I think we should go with 3.

if yes, should it return an Index or a 1-level MultiIndex?

Assuming we follow policy 1., Index I think: simpler to implement, and I see no advantage for the alternative.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design MultiIndex Strings String extension data type and string data
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants