Skip to content

as_index =True for groupby transform inconsistent with agg #15290

Closed
@Kevin-McIsaac

Description

@Kevin-McIsaac

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np
index = pd.date_range('10/1/1999', periods=1100, name='Date')

ts = pd.DataFrame({'Entry': np.random.normal(300, 50, 1100), 
                                'Exit':  np.random.normal(300, 50, 1100)}, index)

ts['Year'] = ts.index.year
ts['Day'] = ts.index.weekday_name

zscore = lambda x: (x - x.mean()) / x.std()
ts.groupby(['Year', 'Day'],  as_index=True).transform(lambda g:g).head(5)

Problem description

Unlike agg, when using transform

  1. as_index =True does not set the index to the groupby by (i.e ['Year', 'Day']).
  2. as_index = False does not prepend the groupby by column to the output of the transform as is done with agg.

I found this inconsistency very confusing and hard to work with.

This is also noted in ##5755

Expected Output

If as_index =True, the index of the result should the groupby by (i.e ['Year', 'Day']).

if as_index =False, the output should have the the groupby by (i.e ['Year', 'Day']) prepended to the as is done with agg.

I've really struggled using transform, partly as the semantics of transform were unclear to me. Specifically :

  1. Which columns is the transform applied to. Is it all columns of the df or only those not in the groupby by? It looks like its the latter. This seems to be true with agg too. It would have been helpful to me if this was clearer in the documentation

  2. Is the transform passed a df of columns or is it applied column by column? My guess is it's the former if the transform takes a df otherwise it's the latter. This seems to be hinted at in some parts of the documentation but it could be made more explicit.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 4.4.41-36.55.amzn1.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.19.2
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.11.3
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.0
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 2.0.0
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.2
bs4: 4.5.3
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.5
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext lo64)
jinja2: 2.9.4
boto: 2.45.0
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions