Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: support apply method for ExtensionArray backed Series #28955

Closed
amine-aboufirass opened this issue Oct 13, 2019 · 13 comments
Closed

ENH: support apply method for ExtensionArray backed Series #28955

amine-aboufirass opened this issue Oct 13, 2019 · 13 comments
Labels
Apply Apply, Aggregate, Transform, Map Enhancement ExtensionArray Extending pandas with custom dtypes or arrays.

Comments

@amine-aboufirass
Copy link

Code Sample, a copy-pastable example if possible

import pint
ureg = pint.UnitRegistry()
pint.PintType.ureg = ureg

def g(x):
    return x+1*ureg.day
df = pd.DataFrame({'A':pd.Series([1,2,3,4], dtype='pint[day]'),'B':pd.Series([5,6,7,8], dtype='pint[day]')})
res = df['A'].apply(g)
print(type(df['A'].values))
print(type(res.values))

Problem description

I am experimenting with the pint-pandas project which builds an ExtensionArray to be able to work with units on dataframes. The above code sample shows that .apply method is not implemented for external extension arrays.

I am using the pint-pandas-plotting branch of the pint-pandas project as this is the one which is compatible with pandas 0.25. I installed this branch by downloading, navigating to root directory and running something along the lines of:

python setup.py -e .

Expected Output

The expected output for print(type(res.values)) should be a PintArray and not a numpy array of Quantity objects.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.6.7.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None
pandas : 0.25.1
numpy : 1.17.2
pytz : 2019.3
dateutil : 2.8.0
pip : 19.2.3
setuptools : 41.4.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.8.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None

@TomAugspurger
Copy link
Contributor

I think this primarily comes down to pandas' dtype inference not being able to infer extension types from a list of objects yet.

In [10]: from pandas.tests.extension.decimal import DecimalArray, make_data

In [11]: import pandas as pd

In [12]: ser = pd.Series(DecimalArray(make_data()))
    ...: ser
Out[12]:
0     Decimal: 0.35594459693084556928255324237397871...
1     Decimal: 0.28919229388647194056716216437052935...
2     Decimal: 0.87055683509853509782772107428172603...
3     Decimal: 0.26522013357197371519191619881894439...
4     Decimal: 0.84871717478470365403353525834972970...
                            ...
95    Decimal: 0.68471570624769151347521756179048679...
96    Decimal: 0.08382578509377813791303424295620061...
97    Decimal: 0.09951047765425147240136993787018582...
98    Decimal: 0.69957638105169761555401919395080767...
99    Decimal: 0.83568359682548865041695762556628324...
Length: 100, dtype: decimal

In [13]: ser.apply(lambda x: x + 1)
Out[13]:
0     1.355944596930845569282553242
1     1.289192293886471940567162164
2     1.870556835098535097827721074
3     1.265220133571973715191916199
4     1.848717174784703654033535258
                  ...
95    1.684715706247691513475217562
96    1.083825785093778137913034243
97    1.099510477654251472401369938
98    1.699576381051697615554019194
99    1.835683596825488650416957626
Length: 100, dtype: object

The .apply is down elementwise, so you get back a list (or object-dtype ndarray) of Decimal objects.

Unless we had some kind of meta keyword to control the result shape & dtype (which I think I've suggested in the past @jreback), I think we'll need to do the inference.

@jreback
Copy link
Contributor

jreback commented Oct 14, 2019

we should just fix infer_dtype

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Oct 14, 2019 via email

@jreback
Copy link
Contributor

jreback commented Oct 14, 2019

Probably, but that's a bit difficult since it's in Cython. It's not clear to me how to allow EA authors to register something here.

i don’t think it’s very hard we already have a function to construct from a sequence on EA

just need to call it; once we have gone down the object path (meaning we have not inferred to other things); then we need to check if we have an EA scalar and call the appropriate constructor

@jorisvandenbossche
Copy link
Member

I think this primarily comes down to pandas' dtype inference not being able to infer extension types from a list of objects yet.

Or, we could let the ExtensionArray be responsible for inferring (for this specific case). Either by calling ExtensionArray._from_sequence on the result, or by having a EA.map like we actually already have for our internal EAs.

@jreback
Copy link
Contributor

jreback commented Oct 14, 2019

I think this primarily comes down to pandas' dtype inference not being able to infer extension types from a list of objects yet.

Or, we could let the ExtensionArray be responsible for inferring (for this specific case). Either by calling ExtensionArray._from_sequence on the result, or by having a EA.map like we actually already have for our internal EAs.

sure

but this is trickier as we don’t know which EA
this belong to a-priori

actually i think you are right for this case @jorisvandenbossche we need some logic in apply to handle a returned sequence of the same type EA (it would be very tricky to handle a different return type)

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Oct 14, 2019

We actually have this in our code right now: "Need to figure out if we want ExtensionArray.map first. "

def map(self, mapper):
# TODO(GH-23179): Add ExtensionArray.map
# Need to figure out if we want ExtensionArray.map first.
# If so, then we can refactor IndexOpsMixin._map_values to
# a standalone function and call from here..
# Else, just rewrite _map_infer_values to do the right thing.
from pandas import Index
return Index(self).map(mapper).array

@jorisvandenbossche jorisvandenbossche changed the title apply method not implemented for ExtensionArray ENH: support apply method for ExtensionArray backed Series Oct 14, 2019
@jreback
Copy link
Contributor

jreback commented Oct 14, 2019

hmm maybe close this in favor of that issue then (or consolidate issues as that’s slightly different)

@jbrockmendel
Copy link
Member

This might be useful in some of the groupby.apply/agg/etc stuff I'm working on. A lot of the corner cases involve ops being done on Categorical or IntegerArray and having to check whether we can/should cast back to the original dtype.

@jbrockmendel jbrockmendel added Apply Apply, Aggregate, Transform, Map ExtensionArray Extending pandas with custom dtypes or arrays. labels Oct 16, 2019
@jbrockmendel
Copy link
Member

@topper-123 does your recent apply/map work do anything to improve this?

@topper-123
Copy link
Contributor

There is now in the main branch an ExtensionArray.map method, which subclasses can override, so that will definitely help and the example code in OP can def. be made to work correctly if they implement their own map method.

@andrewgsavage
Copy link

The example in the OP now works as expected using main pandas and master pint-pandas branches.

import pandas as pd
import pint
import pint_pandas
ureg = pint.get_application_registry()


def g(x):
    return x+1*ureg.day
df = pd.DataFrame({'A':pd.Series([1,2,3,4], dtype='pint[day]'),'B':pd.Series([5,6,7,8], dtype='pint[day]')})
res = df['A'].apply(g)
print(type(df['A'].values))
print(type(res.values))


<class 'pint_pandas.pint_array.PintArray'>
<class 'pint_pandas.pint_array.PintArray'>

@topper-123
Copy link
Contributor

topper-123 commented Aug 15, 2023

Great, so I think this can be closed now. If someone objects to that, just ping this thread and I will open it again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Enhancement ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

No branches or pull requests

8 participants