Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resetting Index on slice #15930

Closed
jadolfbr opened this issue Apr 6, 2017 · 7 comments
Closed

Resetting Index on slice #15930

jadolfbr opened this issue Apr 6, 2017 · 7 comments
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Usage Question

Comments

@jadolfbr
Copy link

jadolfbr commented Apr 6, 2017

Code Sample, a copy-pastable example if possible

# Your code here
df = data_index[data_index['ntorsions'] == 2]

Problem description

When slicing a dataframe, the index is not reset by default. This becomes an issue if you want to output that dataframe, combine that dataframe with other dataframes (good luck with that), or output the dataframe without two index columns.

Fixing this will not break code in the wild.

Expected Output

Index being correct - without the need to manually call reset_index over and over again. This is much more intuitive to end users.

-> At end of slice, call reset_index(drop = True) on the returned dataframe or current dataframe if you are slicing in-place.

Output of pd.show_versions()

loaded rc file /Users/jadolfbr/.matplotlib/matplotlibrc
matplotlib version 1.5.1
verbose.level helpful
interactive is False
platform is darwin

INSTALLED VERSIONS

commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Darwin
OS-release: 14.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 9.0.1
setuptools: 20.3.1
Cython: None
numpy: 1.11.1
scipy: 0.13.0b1
statsmodels: 0.6.1
xarray: None
IPython: 4.1.2
sphinx: None
patsy: 0.4.0
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

@TomAugspurger
Copy link
Contributor

Can you make a full copy-pastable example (including constructing data_index).

@jadolfbr
Copy link
Author

jadolfbr commented Apr 6, 2017

Sure.

import pandas
data_index = pandas.read_table("data_index.tsv")
df = data_index[data_index['torsions'] == 2]
print df

Top Contents of data_index.tsv (sugar molecule linkages):

name	fname	ntorsions	torsion_bins
b-D-GlcpNAc ->4)-D-GlcpNAc	b-D-GlcpNAc_-_4_-D-GlcpNAc.tsv	2	2.3
a-D-Glcp ->4)-D-Glcp	a-D-Glcp_-_4_-D-Glcp.tsv	2	2.2
b-D-Manp ->4)-D-GlcpNAc	b-D-Manp_-_4_-D-GlcpNAc.tsv	2	5.5
a-L-Fucp ->3)-D-GlcpNAc	a-L-Fucp_-_3_-D-GlcpNAc.tsv	2	2.3
b-D-Xylp ->2)-D-Manp	b-D-Xylp_-_2_-D-Manp.tsv	2	3.4
a-D-Manp ->3)-D-Manp	a-D-Manp_-_3_-D-Manp.tsv	2	4.3
a-D-Manp ->6)-D-Manp	a-D-Manp_-_6_-D-Manp.tsv	3	2.4.4

@jorisvandenbossche
Copy link
Member

As you mention yourself, you can use reset_index(drop = True) to get the desired result.

Changing this (to do this resetting automatically when doing a slice) would fundamentally change how pandas currently works.
And would for sure break a huge amount of code (many pandas code relies on the index being certain values, certainly when having specific indexes like a DatetimeIndex).

There are some ideas for a future release of pandas to allow dataframes without an explicit index (see wesm/pandas2#17, but that is currenlty just a discussion, no code or commitment this actually will happen)

@jreback
Copy link
Contributor

jreback commented Apr 6, 2017

@jadolfbr this is fundamental pandas behavior. The index is preserved thru virtually all operations. That's the point.

And combing is actually quite easy with .join, or assignment, again, that the fundamental thing that pandas does, it aligns on the index.

@jreback jreback closed this as completed Apr 6, 2017
@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves Usage Question labels Apr 6, 2017
@jreback jreback added this to the No action milestone Apr 6, 2017
@jadolfbr
Copy link
Author

jadolfbr commented Apr 6, 2017

I think options to allow dataframes without indexes would be great. They are extremely unwieldy without resetting that index. Maybe if you have multiple indexes with layers, etc, they would be good. However, these easily run into tons of problems in pandas as it stands now, so most of us in our lab shy away from that.

Here is is a simple example. Maybe you can suggest a better way and say that I'm using pandas wrong. That's fine too. This is just to divide two values that are different experiments (and yes, in this case the row order does matter):

length_data['length_rr'] = (rr_data[rr_data['exp'] == 'mw']['length_rr'].reset_index(drop=True)\
                            /rr_data[rr_data['exp'] == 'mo']['length_rr'].reset_index(drop=True))

length_data2['length_rr'] = (rr_data[rr_data['exp'] == 'rmw']['length_rr'].reset_index(drop=True)\
                             /rr_data[rr_data['exp'] == 'rmo']['length_rr'].reset_index(drop=True))

length_enrich = pandas.concat([length_data, length_data2]).reset_index(drop=True)

Note that for the concat, if you don't reset and drop the index, pandas throws a duplicate index error if you do not reset with drop.

For joining, etc. the indexes can again get in the way. Many times you want to be joining based on some operation of the data, so we use merge. But I guess that might be preference.

@jreback
Copy link
Contributor

jreback commented Apr 6, 2017

you are not using pandas power at all. you are in fact making a big assumption that the data that you are dividing is exactly the same length and perfectly lines up. maybe that's always true for you.

I would probably do something like this. In fact this is quite general and deals with missing labeled data.

In [34]: df = DataFrame({'l': [1, 1, 1, 2, 2, 2], 'obs': [1, 2, 3, 1, 2, 3], 'value': [1, 2, 3, 4, 5, 6]})

In [35]: df
Out[35]: 
   l  obs  value
0  1    1      1
1  1    2      2
2  1    3      3
3  2    1      4
4  2    2      5
5  2    3      6

In [36]: df = df.set_index(['l', 'obs'])

In [37]: df
Out[37]: 
       value
l obs       
1 1        1
  2        2
  3        3
2 1        4
  2        5
  3        6

In [38]: df.value.loc[1] / df.value.loc[2]
Out[38]: 
obs
1    0.25
2    0.40
3    0.50
Name: value, dtype: float64

This is a slightly different and IMHO better way of organizing things.

In [39]: df.unstack()
Out[39]: 
    value      
obs     1  2  3
l              
1       1  2  3
2       4  5  6

In [40]: u = df.unstack()

In [41]: u.loc[1] / u.loc[2]
Out[41]: 
       obs
value  1      0.25
       2      0.40
       3      0.50

@jadolfbr
Copy link
Author

jadolfbr commented Apr 6, 2017

Thanks for the suggestion. Yes, this seems much better than what I was trying to do - use the indexes instead of fighting with them and trying to go around them. Makes sense. I guess this would make joining a whole lot more straightforward too. Awesome. Thanks for taking the time to write back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Usage Question
Projects
None yet
Development

No branches or pull requests

4 participants