Skip to content

feature request: Support for 'named' lambda functions in DataFrame.agg([]) #10100

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jkokorian opened this issue May 11, 2015 · 9 comments
Closed

Comments

@jkokorian
Copy link

I often have the situation where I would like to apply multiple aggregation functions to all the columns of a grouped dataframe, like:

grouped = df.groupby('somekey')
dfAggregated = grouped.agg([np.mean, np.std])

That works well, but sometimes (all the time, actually) I would also like to be able to use lambda functions this way, like:

grouped = df.groupby('somekey')
dfAggregated = grouped.agg([np.mean, np.std, lambda v: v.mean()/v.max()])

This works fine, but the resulting column name will now be 'lambda', which is ugly. This can be resolved by using the much more verbose syntax where you specify a dictionary for every column separately, but I would propose to allow the following syntax:

grouped = df.groupby('somekey')
dfAggregated = grouped.agg([np.mean,np.std,{'normalized_mean': lambda v: v.mean()/v.max()}])

The dictionary key should then be used as the resulting column name.

Interestingly, using this syntax in the version 0.16 does not produce an error, but produces a column named 'Nan', that is filled with tupple values: ('n','o','r','m','a','l','i','z','e','d','_','m','e','a','n'), which I don't think is of use to anyone:)

@jorisvandenbossche
Copy link
Member

At this moment, you can also just define a function, so its name will be used:

def normalized_mean(v):
    return v.mean()/v.max()

grouped.agg([np.mean, np.std, normalized_mean])

As this is quite easy as well, I don't know we should add that extra complexity

@jkokorian
Copy link
Author

I agree, that's how I solve it now, but this adds a function to the global namespace, which is not always desired. If the computation that you require is a big one, than sure, that should be a proper function.

For small inline calculations like the one I used as an example, a lambda would be much nicer. Also if you need to pass additional arguments to your function, a lambda is just a much more elegant way of coding it that with a new global function.

In the pandas docs, one of the examples on how to use 'agg' actually uses lambda's this way.

grouped.agg({'C' : np.sum,
             'D' : lambda x: np.std(x, ddof=1)})

(http://pandas.pydata.org/pandas-docs/stable/groupby.html#applying-different-functions-to-dataframe-columns)

I don't think that the feature actually "adds complexity". It gives you an additional, elegant way of doing something without even changing the existing API functionality.

@jreback
Copy link
Contributor

jreback commented May 11, 2015

see #8593

This API needs unification; if you want to spec out pd.Summary gr8.

@jkokorian
Copy link
Author

Yet another argument for implementing it: the current behavior when you enter a dict object seems to be a bit arbitrary. Something would need to be done about that anyway...

@shoyer
Copy link
Member

shoyer commented May 12, 2015

@jreback What exactly is your idea for pd.Summary?

I'm not opposed to improvements in this area, but I agree with @jorisvandenbossche that the value is questionable in this particular case.

@jkokorian
Copy link
Author

What is pd.Summary?

@jorisvandenbossche
Copy link
Member

@jkokorian an idea of @jreback, but which is not implemented yet, see his comment here: #8593

@victorlin
Copy link

The introduction of named aggregation in 0.25.0 seems to solve this issue.

@rhshadrach
Copy link
Member

Thanks - agreed @victorlin. Also, one can now use kwargs with agg as well:

df = pd.DataFrame({'a': [1, 1, 2, 2], 'b': [1, 2, 3, 4]})
df.groupby('a').agg(normalized_mean=('b', lambda v: v.mean()/v.max()))

produces

   normalized_mean
a                 
1            0.750
2            0.875

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants