Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: pandas 1.5.2 segfault with pypy 3.9-7.3.11 #50817

Closed
2 of 3 tasks
nicolaerosia opened this issue Jan 18, 2023 · 11 comments
Closed
2 of 3 tasks

BUG: pandas 1.5.2 segfault with pypy 3.9-7.3.11 #50817

nicolaerosia opened this issue Jan 18, 2023 · 11 comments
Labels
Bug PyPy Segfault Non-Recoverable Error
Milestone

Comments

@nicolaerosia
Copy link

nicolaerosia commented Jan 18, 2023

Pandas version checks

Reproducible Example

docker run -it --rm \
--entrypoint /bin/bash \
docker.io/pypy:3.9-7.3.11

pypy3.9 -m venv venv-pandas-152-pypy39
source venv-pandas-152-pypy39/bin/activate
pip install pandas==1.5.2
python -c 'import pandas as pd; pd.Timedelta("5min")'

Issue Description

Segmentation fault when using pypy 3.9-7.3.11 with pandas 1.5.2

Expected Behavior

no segfault

Installed Versions

python -c 'import pandas as pd; pd.show_versions()'

INSTALLED VERSIONS
------------------
commit           : 8dab54d6573f7186ff0c3b6364d5e4dd635ff3e7
python           : 3.9.16.final.0
python-bits      : 64
OS               : Linux
OS-release       : 6.1.6-200.fc37.x86_64
Version          : #1 SMP PREEMPT_DYNAMIC Sat Jan 14 16:55:06 UTC 2023
machine          : x86_64
processor        : 
byteorder        : little
LC_ALL           : None
LANG             : C.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.5.2
numpy            : 1.24.1
pytz             : 2022.7.1
dateutil         : 2.8.2
setuptools       : 58.1.0
pip              : 22.3.1
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : None
pandas_datareader: None
bs4              : None
bottleneck       : None
brotli           : None
fastparquet      : None
fsspec           : None
gcsfs            : None
matplotlib       : None
numba            : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pyreadstat       : None
pyxlsb           : None
s3fs             : None
scipy            : None
snappy           : None
sqlalchemy       : 1.4.46
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
zstandard        : 0.19.0
tzdata           : None
@nicolaerosia nicolaerosia added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 18, 2023
@lithomas1 lithomas1 added Segfault Non-Recoverable Error PyPy and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 18, 2023
@lithomas1
Copy link
Member

@nicolaerosia

Thanks for reporting this. I'll try to take a look at this over the weekend.

Do note, though, that PyPy is not one of our officially supported platforms.

I set up the infrastructure to test pandas on PyPy a long while back, but as of right now it basically only tests that compiling pandas on PyPy works (the tests are allowed to fail without failing our CI checks).

@nicolaerosia
Copy link
Author

Thank you, I’m aware it is best effort. FYI 1.4.4 works.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jan 23, 2023

(the tests are allowed to fail without failing our CI checks)

It seems that those tests have been segfaulting as well for a long time. It already fails on pytest initialization (reading of conftest?) or collection:

Run ci/run_tests.sh
PYTHONHASHSEED=2937057230
xvfb-run pytest -r fEs -n auto --dist=loadfile --max-worker-restart 0 pandas -m "not slow and not network and not single_cpu"
Segmentation fault (core dumped)

The failure is visible in the CI logs as far back as the logs are still available, and based on the time that the build takes (that is still visible for older builds), it seems to certainly go back to 1.5.0rc0.
Using those timings, that traces it back to #47641 (that's the first PR where the timing drops from ~40min to ~15min).

I can also reproduce this locally, and the gdb backtrace confirms the relationship with Timedelta:

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007fffb2971aec in __pyx_getprop_6pandas_5_libs_6tslibs_10timedeltas_10_Timedelta_days () from /home/joris/scipy/pandas/pandas/_libs/tslibs/timedeltas.pypy39-pp73-x86_64-linux-gnu.so
(gdb) bt
#0  0x00007fffb2971aec in __pyx_getprop_6pandas_5_libs_6tslibs_10timedeltas_10_Timedelta_days () from /home/joris/scipy/pandas/pandas/_libs/tslibs/timedeltas.pypy39-pp73-x86_64-linux-gnu.so
#11 0x00007fffb296e2f5 in __pyx_tp_new_6pandas_5_libs_6tslibs_10timedeltas__Timedelta () from /home/joris/scipy/pandas/pandas/_libs/tslibs/timedeltas.pypy39-pp73-x86_64-linux-gnu.so
#20 0x00007fffb298aa7d in __pyx_f_6pandas_5_libs_6tslibs_10timedeltas__timedelta_from_value_and_reso () from /home/joris/scipy/pandas/pandas/_libs/tslibs/timedeltas.pypy39-pp73-x86_64-linux-gnu.so
#21 0x00007fffb298df6c in __pyx_pw_6pandas_5_libs_6tslibs_10timedeltas_10_Timedelta_25_from_value_and_reso () from /home/joris/scipy/pandas/pandas/_libs/tslibs/timedeltas.pypy39-pp73-x86_64-linux-gnu.so
#29 0x00007fffb29af130 in __pyx_pw_6pandas_5_libs_6tslibs_10timedeltas_9Timedelta_1__new__ () from /home/joris/scipy/pandas/pandas/_libs/tslibs/timedeltas.pypy39-pp73-x86_64-linux-gnu.so
#41 0x00007fffb2a488bb in __pyx_pw_6pandas_5_libs_6tslibs_10timestamps_10_Timestamp_9__add__ () from /home/joris/scipy/pandas/pandas/_libs/tslibs/timestamps.pypy39-pp73-x86_64-linux-gnu.so
#50 0x00007fffb2abe869 in __pyx_pw_6pandas_5_libs_6tslibs_7offsets_11BusinessDay_5_apply () from /home/joris/scipy/pandas/pandas/_libs/tslibs/offsets.pypy39-pp73-x86_64-linux-gnu.so
#55 0x00007fffb2af0fbd in __pyx_pw_6pandas_5_libs_6tslibs_7offsets_11apply_wraps_1wrapper () from /home/joris/scipy/pandas/pandas/_libs/tslibs/offsets.pypy39-pp73-x86_64-linux-gnu.so
#65 0x00007fffb2ac5166 in __pyx_pw_6pandas_5_libs_6tslibs_7offsets_10BaseOffset_11__add__ () from /home/joris/scipy/pandas/pandas/_libs/tslibs/offsets.pypy39-pp73-x86_64-linux-gnu.so
#72 0x00007fffb2aa010c in __pyx_pw_6pandas_5_libs_6tslibs_7offsets_10BaseOffset_13__radd__ () from /home/joris/scipy/pandas/pandas/_libs/tslibs/offsets.pypy39-pp73-x86_64-linux-gnu.so
#81 0x00007fffb2ac12ee in __pyx_pw_6pandas_5_libs_6tslibs_7offsets_10BaseOffset_41rollforward () from /home/joris/scipy/pandas/pandas/_libs/tslibs/offsets.pypy39-pp73-x86_64-linux-gnu.so

cc @jbrockmendel

@jorisvandenbossche jorisvandenbossche added this to the 2.0 milestone Jan 23, 2023
@jorisvandenbossche
Copy link
Member

And I can also reproduce this with a simple Timedelta call:

>>>> import pandas as pd
>>>> pd.Timedelta(1, unit="D")
Segmentation fault (core dumped)

@lithomas1
Copy link
Member

lithomas1 commented Jan 23, 2023

Sorry for the silence here, was traveling over the past day, and thanks for the assistance in debugging Joris.

Something is going wrong in calling the function _ensure_components()

__pyx_t_1 = ((struct __pyx_vtabstruct_6pandas_5_libs_6tslibs_10timedeltas__Timedelta *)__pyx_v_self->__pyx_vtab)->_ensure_components(__pyx_v_self); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 1034, __pyx_L1_error)

is the code used to call _ensure_components.

However, (struct __pyx_vtabstruct_6pandas_5_libs_6tslibs_10timedeltas__Timedelta *)__pyx_v_self->__pyx_vtab is NULL, which results in a segfault.

@lithomas1
Copy link
Member

lithomas1 commented Jan 23, 2023

Maybe the problem is since days is defined by datetime.timedelta here.
https://github.com/cython/cython/blob/69b6107922efe895f83441b4b9d4dd64ff3d46b6/Cython/Includes/cpython/datetime.pxd#L176-L179

Best guess is that cython is getting confused between the cdef'ed days and the Python property days.

Renaming days to days1, seconds to seconds1, and microseconds to microseconds1 fixed it for me.

@mroeschke mroeschke modified the milestones: 2.0, 3.0 Feb 8, 2023
@rafaelfliu
Copy link

Hi! I'm also encountering a similar issue on this. Happened when I'm using the resample() function on a dataframe with DateTimeIndex.

So I guess there's no timeline yet, for when this will be fixed?

python: 3.9.16
pypy: 7.3.11
pandas: 1.5.3
host: ubuntu 22.04.2 lts

@WillAyd
Copy link
Member

WillAyd commented Apr 12, 2023

However, (struct __pyx_vtabstruct_6pandas_5_libs_6tslibs_10timedeltas__Timedelta *)__pyx_v_self->__pyx_vtab is NULL, which results in a segfault.

This might be a problem with PyPy itself and the order in which it tries to construct the class. I think the property should have access to other attributes / methods on self, so __pyx_v_self->__pyx_vtab being NULL is a red flag

@mattip
Copy link
Contributor

mattip commented May 7, 2023

TL;DR: the problem is due to PyPy, not Cython nor pandas.

The problem is:

  • pandas overrides the timedelta.days getter with a cython-based one that assumes the object is fully constructed (i.e. it has set the __pyx_vtab, which is done one line after calling tp_new()
  • pypy assumed it could safely call that getter descriptor when converting a pure-python datetime.timedelta into a C PyDateTime_Delta object with a days field in the C struct
  • When calling tp_new() from C, it is not safe to use the descriptor, since __pyx_vtab is not set.

The solution is for PyPy to not assume it can call the days getter from C, rather it should use the private _days attribute which is well defined once the python object is created. Fixed in 267c2f5eca33

Please ping me on any issues with @pypy

@jorisvandenbossche
Copy link
Member

Thanks @mattip!

@lithomas1
Copy link
Member

Closing since this seems to works now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug PyPy Segfault Non-Recoverable Error
Projects
None yet
Development

No branches or pull requests

7 participants