Skip to content

Issue Using Chained Accessors with Multiple dtypes & Performance Tips #4546

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dirkbike opened this issue Aug 13, 2013 · 14 comments
Closed

Issue Using Chained Accessors with Multiple dtypes & Performance Tips #4546

dirkbike opened this issue Aug 13, 2013 · 14 comments

Comments

@dirkbike
Copy link

The below code should print 99.9, but instead prints 10.5. This bug goes away if column b in the dataframe is set to all float values instead of int values. I am running Pandas 0.12.0 and Python 2.7.3.

import pandas
df = pandas.DataFrame({'a':[7.3,5.1,3.0,5.5,6.4],'b':[5,4,5,1,6]})
df['a'] = df.apply(lambda x: 10.5, axis=1)
df.iloc[0]['a'] = 99.9
print df.iloc[0]['a']
@jreback
Copy link
Contributor

jreback commented Aug 13, 2013

not a bug see #4531 for an example

this is a chained accessor and sometimes can work depending on whether the frame has multiple dtypes and the memory layout, but is not guaranteed nor recommended syntax

you should set via iloc[row,col]

@dirkbike
Copy link
Author

Ok, so, say for performance reasons (http://wiki.python.org/moin/PythonSpeed/PerformanceTips#Avoiding_dots...) I wanted to avoid doing repeated dot references to df.iloc[0]. Is there a way I can safely set df.iloc[0] to a local variable and access its contents with another accessor?

@jreback
Copy link
Contributor

jreback commented Aug 13, 2013

it's not going to be the bottleneck
what are u trying to do?

@dirkbike
Copy link
Author

I want to iterate through a data frame and make changes to a row relative to the values in the previous row. I'm hunting for ways to make the each iteration run faster. Ideally I'd use something like a rolling_apply with a window of 2, but rolling_apply only works for a single column.

@jreback
Copy link
Contributor

jreback commented Aug 13, 2013

much better to create a mask and then assign a new frame

put up a sample frame and what you want the final to look like and I'll show u

@dirkbike
Copy link
Author

Here's an example of my old code. I'm trying to make the code in the iterate function's for loop run as fast as possible. Essentially I want every C element to add itself to the product of the current A element and previous B element.

import pandas
import numpy as np
import time

length = 10000
d = {'a':np.random.randn(length),'b':np.random.randn(length),'c':np.random.randn(length)}
df = pandas.DataFrame(d)

def iterate(df):
    for i in xrange(1,len(df)):
        prev = df.iloc[i-1]
        curr = df.iloc[i]
        curr['c'] = curr['c'] + prev['b']*curr['a']

print df.head()

start = time.time()
iterate(df)
end = time.time()

print 'duration: %0.3f' % (end-start)

print df.head()

@dirkbike
Copy link
Author

In my full code, the iterating loop is something that gets implemented by user defined class, so it isn't necessarily known ahead of time. It could, for example, have conditional statements that change the current row before the next iteration is run.

@jreback
Copy link
Contributor

jreback commented Aug 13, 2013

This calculation is trivially vectorized

In [5]: def f():
   ...:     df2 = df.copy()
   ...:     for i in xrange(1,len(df2)):
   ...:         prev = df2.iloc[i-1]
   ...:         curr = df2.iloc[i]
   ...:         curr['C'] = curr['C'] + prev['B']*curr['A']
   ...:     return df2
   ...: 

In [6]: %timeit f()
1 loops, best of 3: 600 ms per loop

Try this one

 [9]: def g():
   ...:     df2 = df.copy()
   ...:     df2['C'] = df2['C'] + df2.shift()['B']*df2['A']
   ...:     return df2
   ...: 

In [10]: %timeit g()
1000 loops, best of 3: 453 µs per loop

And they do the same calculation

In [11]: g().head()
Out[11]: 
          A         B         C
0  1.027036 -0.600508       NaN
1 -1.157011  0.758945  2.822018
2  0.030327 -3.206018 -1.763926
3 -0.284152  0.618429  0.049077
4 -1.528973  1.088540 -2.762662

In [12]: f().head()
Out[12]: 
          A         B         C
0  1.027036 -0.600508 -0.350233
1 -1.157011  0.758945  2.822018
2  0.030327 -3.206018 -1.763926
3 -0.284152  0.618429  0.049077
4 -1.528973  1.088540 -2.762662

@jreback
Copy link
Contributor

jreback commented Aug 13, 2013

in pandas there are very very few inplace operations by default (actually only setting); this is on purpose. It is almost always faster to construct a new calculated result that to change an existing data structure.

@dirkbike
Copy link
Author

Yeah I see your point. Very nice. So, how would I be able to reformulate a conditional statement like this?

def iterate(df):
    for i in xrange(1,len(df)):
        prev = df.iloc[i-1]
        curr = df.iloc[i]
        curr['c'] = curr['c'] + prev['b']*curr['a']
        if curr['c'] > 1.0:
            curr['b'] -= 0.1

@jreback
Copy link
Contributor

jreback commented Aug 13, 2013

if it was not recurrent then

df.loc[df['c']>1,'b'] -= 0.1

however since this looks like feedback, you might need to iterate

@dirkbike
Copy link
Author

Understood. Could this be a use case to warrant the development of a rolling_apply that works across entire rows? This situation (non-recurrent conditional operations that depend on previous row contents) shows up a lot for me when I try to use a dataframe as a basis for a time-dependent simulation.

@jreback
Copy link
Contributor

jreback commented Aug 13, 2013

yes this could be an enhancement

you might want to construct an ewma like relation (which is what it seems like you are after); this is very fast as its implemented in cython.

That's the whole problem with rolling_apply/apply the functions evals are slow relative to the looping because you have to keep coming back to python.

So another option is to look at: http://pandas.pydata.org/pandas-docs/dev/enhancingperf.html

also if you don't the indexing features of iloc, just drop down to a values, e.g. df.values gives you back a numpy array which doesn't have some of the indexing niceties, but can be faster (also you may want to lookat at/iat: http://pandas.pydata.org/pandas-docs/dev/indexing.html#fast-scalar-value-getting-and-setting

keep in mine optimizing your computation should be the last thing you are doing, first profile!

@dirkbike
Copy link
Author

Thanks for the tips!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants