-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Indexes still include values that have been deleted #2770
Comments
related #2655 (maybe) |
This is kind of a tricky problem, e.g. when should you "recompute" the levels? Have to table this until I have a chance to look a bit more deeply. Another solution is to exclude levels with no observations in such a groupby |
Well, is there any easy workaround I can use? Like if I know I have this problem, can I manually call a .rebuild_index() or something? I've played around with all the obvious possibilities (short of creating a brand new dataframe) and can't find any workaround. EDIT: better clarification. I have one function that builds the dataset and drops the rows. At that point, I know I'm in the situation described in this issue, and I'd like to do my workaround there. But then the .groupby().sum() happens much much later in a different function. I could easily hack that second function as you say (exclude levels with no observations) but it makes more sense to keep my workaround code in the first function. Any ideas? |
How about the workaround that I proposed for #2655 ? In your case maybe x.groupby(x.index.get_level_values(1)).sum() should do the correct thing, if I'm not wrong? I don't know why, but the result of this function delivers updated values. |
Yes that works; but the code that does .groupby().sum() is in one function and the code that removes the value from the table is in another fxn. It would be much much clear to use a workaround that cleans up the problem with the dataframe in the fxn that creates it -- that way any other fxn could use the dataframe without having to do your trick. |
Ehm, can you confirm that this problem still exists with 0.10.1? In [9]: print x.index.levels
[Index([deleteMe, keepMe, keepMeToo], dtype=object), Int64Index([1, 2, 3], dtype=int64)]
In [10]: x.groupby(level='first').sum()
Out[10]:
second third
first
keepMe 2 9
keepMeToo 3 9 |
is this closable? @tavistmorph does this exist in 0.11-dev? |
It's still an issue. Still happens for me in 10.1 and 0.11 (as of the last time I pulled, at least). Just run the code snippet in my orig post and you can see it. Michael -- the deleted row appears in your output above on step #9 ("deleteMe" should not be there since we deleted it) and then it appears in the output for step 10 ("first" should not appear since all the rows with the "first" value were deleted). |
This isn't really a bug. Perhaps an option should be added to return an array of observed values in a particular level in the index (which is what you're after)? |
Can you precise what you mean by observed? Do you mean, that the object is a view into the original object (I don't know if it is), and that's why it still contains the 'deleteMe' index? |
The levels are not computed from the actual observed values. For example, in R you can have a factor (categorical variable) in which some distinct values are not observed:
|
Version: '0.12.0-1184-gc73b957' In [10]: x.index
Out[10]:
MultiIndex(levels=[[u'deleteMe', u'keepMe', u'keepMeToo'], [1, 2, 3]],
labels=[[1, 2], [1, 2]],
names=[u'first', u'second']) but at least the groupby does not create an empty row anymore for previously existing indices: In [12]: x.groupby(level='first').sum()
Out[12]:
second third
first
keepMe 2 9
keepMeToo 3 9
[2 rows x 2 columns] so the discussion now boils down to the confusion of looking at df.index. I would argue, as I am looking often at the index to see what I am working with, that I would still be very puzzled by the index showing old values and from that point on I would not trust the results anymore. |
If you print that MultiIndex, it looks like what you want: In [7]: mi
Out[7]:
MultiIndex(levels=[[u'deleteMe', u'keepMe', u'keepMeToo'], [1, 2, 3]],
labels=[[1, 2], [1, 2]],
names=[u'first', u'second'])
In [8]: print mi
first second
keepMe 2
keepMeToo 3 Thus, a simple way to handle this is to examine your indices with print rather than the repr that IPython shows you. The MultiIndex repr isn't really intuitive in any case, unless you understand that it's a categorical and that labels represent the integer positions of the levels at each location. You shouldn't need to care about that as a consumer of a MultiIndex. And if you understand the internal representation, you can then also understand why it doesn't matter whether there are extra levels. The issue becomes clearer with a more complicated MI: In [2]: ind = pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b', 'c'], ['d', 'e', 'f', 'g', 'h']])
In [3]: ind
Out[3]:
MultiIndex(levels=[[u'a', u'b', u'c'], [u'd', u'e', u'f', u'g', u'h']],
labels=[[0, 0, 1, 1, 2], [0, 1, 2, 3, 4]])
In [4]: print ind
a d
e
b f
g
c h |
Based on your previous comment, it seems like the key issue here (groupby showing unused levels) is now resolved. Can we close this or edit this issue to be a feature request? (e.g., method to allow MI to consolidate its levels) As an aside, my perspective is that it's more intuitive to have the entire level set remain, because it makes slices very clear (and you can share the memory for storing levels): In [15]: ind
Out[15]:
MultiIndex(levels=[[u'a', u'b', u'c'], [u'd', u'e', u'f', u'g', u'h']],
labels=[[0, 0, 1, 1, 2], [0, 1, 2, 3, 4]])
In [16]: ind[:2]
Out[16]:
MultiIndex(levels=[[u'a', u'b', u'c'], [u'd', u'e', u'f', u'g', u'h']],
labels=[[0, 0], [0, 1]])
In [17]: ind[2:4]
Out[17]:
MultiIndex(levels=[[u'a', u'b', u'c'], [u'd', u'e', u'f', u'g', u'h']],
labels=[[1, 1], [2, 3]])
In [18]: ind[4:5]
Out[18]:
MultiIndex(levels=[[u'a', u'b', u'c'], [u'd', u'e', u'f', u'g', u'h']],
labels=[[2], [4]]) |
I would argue that to use repr as a way to examine pandas objects is the default and advertised use case, as pandas docs are full of that, so I don't find it really satisfying and a tad inconsistent that I have to resort to printing an object for clarity while repr works for all (most?) other cases. |
I also would like to point out that the pandas core team does not seem to have come to a consistent conclusion how to handle this, as we have 3 issue related to this, and in one (#2655) the claim is made that it is no bug, while the 2 others (#3686 and this one) have been marked as a bug. Maybe you guys should have an internal discussion about it. |
@michaelaye, I think you (legitimately) missed the point wes was making. My guess is that you're under a misconception Test yourself with this example: In [10]: MultiIndex.from_tuples([[0,1],[0,2]])
Out[10]:
MultiIndex(levels=[[0], [1, 2]],
labels=[[0, 0], [0, 1]]) Do you understand why the first element in levels only has one item? You may find it counter-intuitive (I did in the past), but then the problem to be addressed The fact that a groupby emitted a group for entries that appear in I would venture a guess that the reason this non-bug issue has lingered for so long is Also:
I agree with @jtratner, we can close this. |
Thank you for your efforts. I indeed was puzzled by the meaning of 'unobserved' and 'observed' and finally understand Wes' comment. Still, there are API calls that take levels as an argument, e.g. groupby(). If other users don't find it confusing to have a list of levels not representing the current state, than it must be me. |
If you're finding something wrong with groupby (ie you end up with spurious |
I don't have anything showing up wrong, and I didn't mean to imply that. My work-style is very much relying on looking at indices and columns with |
@michaelaye unlikely to happen - |
Understood. Thanks for your patience. |
The pandas API doesn't fit in my head anymore. For reference ...:
...: x = pandas.DataFrame([['deleteMe',1, 9],['keepMe',2, 9],['keepMeToo',3, 9]], columns=['first','second', 'third'])
...: x = x.set_index(['first','second'], drop=False)
...:
...: print x.index.get_level_values(0)
...: x = x[x['first'] != 'deleteMe'] #Chop off all the 'deleteMe' rows
...: print x.index.get_level_values(0)
...:
Index([u'deleteMe', u'keepMe', u'keepMeToo'], dtype='object')
Index([u'keepMe', u'keepMeToo'], dtype='object') |
I think this can be closed: the default behavior is as intended, and the method |
yep this is now the accepted soln. |
How this
Newly created |
Using pandas 0.10. If we create a Dataframe with a multi-index, then delete all the rows with value X, we'd expect the index to no longer show value X. But it does.
Note the apparent inconsistency between "index" and "index.levels" -- one shows the values have been deleted but the other doesn't.
We don't want the deleted values to show up in that groupby. Can we eliminate them?
The text was updated successfully, but these errors were encountered: