Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrameGroupBy.progress_apply not always equal to DataFrameGroupBy.apply #697

Open
3 tasks done
nalepae opened this issue Mar 16, 2019 · 4 comments
Open
3 tasks done
Labels
help wanted 🙏 We need you (discussion or implementation) submodule ⊂ Periphery/subclasses to-fix ⌛ In progress

Comments

@nalepae
Copy link

nalepae commented Mar 16, 2019

  • I have visited the [source website], and in particular
    read the [known issues]
  • I have searched through the [issue tracker] for duplicates
  • I have mentioned version numbers, operating system and
    environment, where applicable:

tqdm version: 4.31.1
Python version: 3.7.1
OS Version: Ubuntu 16.04

Context:

import pandas as pd
import numpy as np
from tqdm import tqdm
tqdm.pandas()

df_size = int(5e6)
df = pd.DataFrame(dict(a=np.random.randint(1, 8, df_size),
                       b=np.random.rand(df_size)))

Observed:

df.groupby('a').apply(max) 
     a         b
a               
1  1.0  0.999999
2  2.0  0.999997
3  3.0  0.999999
4  4.0  0.999997
5  5.0  0.999999
6  6.0  1.000000
7  7.0  1.000000

# but
df.groupby('a').progress_apply(max)
a
1    b
2    b
3    b
4    b
5    b
6    b
7    b
dtype: object

Expected:
df.groupby('a').apply(max) and df.groupby('a').progress_apply(max) return the same value.

Additional information:
Replacing max by sum returns normal result for apply, but throws this error for progress_apply:

TypeError: unsupported operand type(s) for +: 'int' and 'str'

@nalepae nalepae changed the title DataFrameGroupBy.progress_apply not always equals to DataFrameGroupBy.apply DataFrameGroupBy.progress_apply not always equal to DataFrameGroupBy.apply Mar 16, 2019
@mikekutzma
Copy link
Contributor

I believe this is because pandas apply does an internal check for whether the method passed was in a predefined list. If so, the method actually applied is not actually the method passed. For instance, max becomes amax.
In progress_apply, the method being passed is actually the wrapper method, so this is not accurately being changed by pandas to amax.

From https://github.com/tqdm/tqdm/blob/master/tqdm/_tqdm.py#L668

                # Define bar updating wrapper
                def wrapper(*args, **kwargs):
                    # update tbar correctly
                    # it seems `pandas apply` calls `func` twice
                    # on the first column/row to decide whether it can
                    # take a fast or slow code path; so stop when t.total==t.n
                    t.update(n=1 if not t.total or t.n < t.total else 0)
                    return func(*args, **kwargs)

                # Apply the provided function (in **kwargs)
                # on the df using our wrapper (which provides bar updating)
result = getattr(df, df_function)(wrapper, **kwargs)

From https://github.com/pandas-dev/pandas/blob/master/pandas/core/groupby/groupby.py#L667

@Appender(_apply_docs['template']
          .format(input="dataframe",
                  examples=_apply_docs['dataframe_examples']))
def apply(self, func, *args, **kwargs):

    func = self._is_builtin_func(func)

    # this is needed so we don't try and wrap strings. If we could
    # resolve functions to their callable functions prior, this
    # wouldn't be needed
    if args or kwargs:
        if callable(func):

            @wraps(func)
            def f(g):
                with np.errstate(all='ignore'):
                    return func(g, *args, **kwargs)
        else:
            raise ValueError('func must be a callable if args or '
                             'kwargs are supplied')
    else:
        f = func

Thus, apply is applying amax to each group, and progress_apply is applying max.

@casperdcl
Copy link
Member

thanks @mikekutzma - is there any way around this? Giving the wrapper function the same name as the wrapped one?

@casperdcl casperdcl added help wanted 🙏 We need you (discussion or implementation) submodule ⊂ Periphery/subclasses to-fix ⌛ In progress labels Apr 22, 2019
@casperdcl casperdcl self-assigned this Apr 22, 2019
@mikekutzma
Copy link
Contributor

@casperdcl You can assign this to me and I'll take a look, unless you've already started work on it.

@casperdcl casperdcl removed their assignment May 9, 2019
@casperdcl
Copy link
Member

@mikekutzma go for it - github won't allow me to assign you though, though of course you can always open a PR.

mikekutzma added a commit to mikekutzma/tqdm that referenced this issue Nov 13, 2019
Applied func will be checked against pandas internal builtins in order
to apply proper transformation, i.e max->amax

Closes:tqdm#697
mikekutzma added a commit to mikekutzma/tqdm that referenced this issue Nov 13, 2019
Applied func will be checked against pandas internal builtins in order
to apply proper transformation, i.e max->amax

Closes:tqdm#697
casperdcl pushed a commit that referenced this issue Nov 21, 2019
Applied func will be checked against pandas internal builtins in order
to apply proper transformation, i.e max->amax

Closes:#697
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted 🙏 We need you (discussion or implementation) submodule ⊂ Periphery/subclasses to-fix ⌛ In progress
Projects
None yet
Development

No branches or pull requests

3 participants