Skip to content

Weird MultiIndex bug #3714

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gerigk opened this issue May 30, 2013 · 14 comments
Closed

Weird MultiIndex bug #3714

gerigk opened this issue May 30, 2013 · 14 comments
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves
Milestone

Comments

@gerigk
Copy link

gerigk commented May 30, 2013

Bug: Assigning levels/labels to a multiindex (or really any fields in Index) should raise (if done externally)

This is (for me) one of the weirdest things I have found so far (0.10.1 still)
tl;dr
it seems like df.copy() creates a shallow copy of the MultiIndex w.r.t levels
Also, the setting of the index.levels does not seem to have effect on the index itself

Running the same code twice results in different results during the second time
although I do not alter the original object in any place.
also note the line "print data.index.levels[1]" which changes after changing data2

my original goal (concatenating) always contains the original date in all rows

Code

import pandas as pd
from datetime import date, timedelta
import datetime
ind = pd.MultiIndex.from_tuples([(pd.Timestamp('2013-04-30 00:00:00'),
   datetime.date(2013, 5, 30)), (pd.Timestamp('2013-05-01 00:00:00'),datetime.date(2013, 5, 30))])
data = pd.DataFrame({'orders_ga': {0: 10.0,
  1: 15.0},
 'revenues_ga': {0: 5.0,
 1: 7.0}})
data.index = ind


data2 = data.copy()
print 'original index'
print data.index
print 'copied index levels'
print data2.index.levels[1]
data2.index.levels[1] = pd.Index([date.today() + timedelta(1)])
print 'index levels of original dataframe after assinging new index to the copied df'
print data.index.levels[1]
print 'new dfs index levels after assigning new levels'
print data2.index.levels[1]
#print ##################
#print data.index
#data2['recorded_at'] = date.today() + timedelta(1)
print '##################################################  output of concat'
print pd.concat([data, data2])
print '##########################################################################'
print 'one more time'
data2 = data.copy()
print 'original index'
print data.index
print 'copied index levels'
print data2.index.levels[1]
data2.index.levels[1] = pd.Index([date.today() + timedelta(1)])
print 'index levels of original dataframe after assinging new index to the copied df'
print data.index.levels[1]
print 'new dfs index levels after assigning new levels'
print data2.index.levels[1]
#print ##################
#print data.index
#data2['recorded_at'] = date.today() + timedelta(1)
print '##################################################  output of concat'
print pd.concat([data, data2])

Output:

original index
MultiIndex
[(2013-04-30 00:00:00, 2013-05-30), (2013-05-01 00:00:00, 2013-05-30)]
copied index levels
Index([2013-05-30], dtype=object)
index levels of original dataframe after assinging new index to the copied df
Index([2013-05-31], dtype=object)
new dfs index levels after assigning new levels
Index([2013-05-31], dtype=object)
##################################################  output of concat
                       orders_ga  revenues_ga
2013-04-30 2013-05-30         10            5
2013-05-01 2013-05-30         15            7
2013-04-30 2013-05-30         10            5
2013-05-01 2013-05-30         15            7
##########################################################################
one more time
original index
MultiIndex
[(2013-04-30 00:00:00, 2013-05-30), (2013-05-01 00:00:00, 2013-05-30)]
copied index levels
Index([2013-05-31], dtype=object)
index levels of original dataframe after assinging new index to the copied df
Index([2013-05-31], dtype=object)
new dfs index levels after assigning new levels
Index([2013-05-31], dtype=object)
##################################################  output of concat
                       orders_ga  revenues_ga
2013-04-30 2013-05-30         10            5
2013-05-01 2013-05-30         15            7
2013-04-30 2013-05-30         10            5
2013-05-01 2013-05-30         15            7
@jreback
Copy link
Contributor

jreback commented Jun 3, 2013

@gerigk

When you copy the data, you get the same identical index, do

id(data.index) == id(data2.index)

the problem is that you are changing the levels on a MultiIndex, which is an immutable object
this should not be allowed by a user of the index; if you need to change it, you need to create a new object

This is a bug in that assigning to levels should raise

@jreback
Copy link
Contributor

jreback commented Jun 3, 2013

@wesm we do allow assigning to the names, which is ok (though in theory could run into the same problem), but clearly levels/labels assignment should be disallowed....?

@gerigk
Copy link
Author

gerigk commented Jun 3, 2013

it definitely is confusing the way pandas behaves right now.

so I would have to do something like

def set_index_level(multi_index, level, k):
    levels = multi_index.levels
    levels[k] = level
    new_index = pd.MultiIndex(levels=levels, labels=data2.index.labels, names=data2.index.names)
    return new_index
data2.index = set_index_level(data2.index, pd.Index([date.today() + timedelta(1)]), 1)

or is there a more convenient way to do this?

@jreback
Copy link
Contributor

jreback commented Jun 3, 2013

use set/reset index

In [16]: data
Out[16]: 
                       orders_ga  revenues_ga
2013-04-30 2013-05-30         10            5
2013-05-01 2013-05-30         15            7

In [17]: df = data.reset_index()

In [18]: df
Out[18]: 
              level_0     level_1  orders_ga  revenues_ga
0 2013-04-30 00:00:00  2013-05-30         10            5
1 2013-05-01 00:00:00  2013-05-30         15            7

In [19]: df['new_level'] = df['level_1']+timedelta(1)

In [20]: df.set_index(['level_0','new_level'])
Out[20]: 
                          level_1  orders_ga  revenues_ga
level_0    new_level                                     
2013-04-30 2013-05-31  2013-05-30         10            5
2013-05-01 2013-05-31  2013-05-30         15            7

@gerigk
Copy link
Author

gerigk commented Jun 3, 2013

I don't know the internals but this sounds pretty expensive given that I
actually want to change only one value inside one index level.

Doesn't this first construct new columns, then I operate on a whole column
and then I construct a complete new index?

On Mon, Jun 3, 2013 at 4:35 PM, jreback notifications@github.com wrote:

use set/reset index

In [16]: data
Out[16]:
orders_ga revenues_ga
2013-04-30 2013-05-30 10 5
2013-05-01 2013-05-30 15 7

In [17]: df = data.reset_index()

In [18]: df
Out[18]:
level_0 level_1 orders_ga revenues_ga
0 2013-04-30 00:00:00 2013-05-30 10 5
1 2013-05-01 00:00:00 2013-05-30 15 7

In [19]: df['new_level'] = df['level_1']+timedelta(1)

In [20]: df.set_index(['level_0','new_level'])
Out[20]:
level_1 orders_ga revenues_ga
level_0 new_level
2013-04-30 2013-05-31 2013-05-30 10 5
2013-05-01 2013-05-31 2013-05-30 15 7


Reply to this email directly or view it on GitHubhttps://github.com//issues/3714#issuecomment-18845251
.

@jreback
Copy link
Contributor

jreback commented Jun 3, 2013

this is pretty cheap to do; see if this actually a bottleneck for you

what is your end goal?

@jreback
Copy link
Contributor

jreback commented Jun 3, 2013

these ops can also be done inplace, fyi (pass inplace=True)

@jreback
Copy link
Contributor

jreback commented Jun 3, 2013

reorder_levels might also be useful to you

@gerigk
Copy link
Author

gerigk commented Jun 3, 2013

I have some situations where I want to change some values in the levels and in the labels.
especially the latter is interesting.
assume I did a group by and I want to output the data (millions of entries).
If I want to change the labels according to a mapping then your reset_index() method would have to be followed by a series.apply() to map the labels to different labels which is slow.
if there are only 10 distinct labels it is (I think) much faster to replace the .labels entry with my new mapping

@jreback
Copy link
Contributor

jreback commented Jun 3, 2013

you can try to copy the index, then assign to the levels, then set the index

e.g.

index = index.copy()
index.levels[2] = 'foo'
df.index = index

to avoid the aliasing issue (this way you make sure that the index you have is ONLY attached to where you want it

@gerigk
Copy link
Author

gerigk commented Jun 3, 2013

much better, thanks!

@jreback
Copy link
Contributor

jreback commented Jun 3, 2013

great!

this is still a 'bug' in any event, so will keep this issue open

@gerigk
Copy link
Author

gerigk commented Jun 3, 2013

ok.

just for me to understand:
is your solution


index = index.copy()
index.levels[2] = 'foo'
df.index = index

supposed to fail in the future since we still access the index.levels ?

in this case my previously suggested (clumsy) method would be safer to use

On Mon, Jun 3, 2013 at 5:47 PM, jreback notifications@github.com wrote:

great!

this is still a 'bug' in any event, so will keep this issue open


Reply to this email directly or view it on GitHubhttps://github.com//issues/3714#issuecomment-18850505
.

@jreback
Copy link
Contributor

jreback commented Jun 3, 2013

At some point I think we might raise if you try to set level directly, so your solution bypasses that, but in the meantime go ahead and use it

jreback pushed a commit that referenced this issue Aug 12, 2013
* `names` is now a property *and* is set up as an immutable tuple.
* `levels` are always (shallow) copied now and it is deprecated to set directly
* `labels` are set up as a property now, moving all the processing of
  labels out of `__new__` + shallow-copied.
* `levels` and `labels` are immutable.
* Add names tests, motivating example from #3742, reflect tuple-ish
  output from names, and level names check to reindex test.
* Add set_levels, set_labels, set_names and rename to index
* Deprecate setting labels and levels directly

Similar to other set_* methods...allows mutation if necessary but
otherwise returns same object.

Labels are now converted to `FrozenNDArray` and wrapped in a
`FrozenList`. Should mostly resolve #3714 because you have to work to
actually make assignments to an `Index`.

BUG: Give MultiIndex its own astype method

Fixes issue with set_value forgetting names.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

No branches or pull requests

2 participants