Skip to content

DataFrame.to_json(orient='table') emits data:str instead of data:[dict,] after a number of requests under mod-wsgi #20728

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
akrherz opened this issue Apr 18, 2018 · 9 comments
Labels
IO JSON read_json, to_json, json_normalize

Comments

@akrherz
Copy link

akrherz commented Apr 18, 2018

Sadly, I don't have a SSCE for this, but the setup seems to reproduce the bug easily for me in production. I am currently using conda-forge current pandas (0.22.0) on python2.7 within a single threaded mod-wsgi daemon process. My general code is

df.to_json(orient='table', default_handler=str)

This will work for some number of sequential requests underneath mod-wsgi. By work, I mean the emitted JSON object has a data attribute with an array of dict objects, one for each row.

"data": [{"col":"val", "col2": "val2"},{"col":"val3", "col2": "val4"}...]

After some number of requests though, the emitted JSON looks like so

"data":"val    val2     ...\nval3   val4  ...\n"

restarting Apache/restarting mod-wsgi will return to_json to properly emitting the same data frame with the proper "data":[dict]

Output of pd.show_versions()

[paste the output of ``pd.show_versions()`` here below this line] >>> pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.14.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-862.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.utf8
LOCALE: None.None

pandas: 0.22.0
pytest: 3.5.0
pip: 9.0.3
setuptools: 39.0.1
Cython: 0.28.2
numpy: 1.14.2
scipy: 1.0.1
pyarrow: None
xarray: None
IPython: 5.6.0
sphinx: 1.7.2
patsy: 0.5.0
dateutil: 2.7.2
pytz: 2018.4
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.2
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.4
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.6
pymysql: None
psycopg2: 2.7.4 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

I have been fighting mod-wsgi for many moons now with various libs like numpy, matplotlib and pandas, so I suspect perhaps this just isn't a good idea. If you have a suggestion of a good long-run web process to run pandas within, I would be grateful to know as well. Thank you for your time!

@WillAyd
Copy link
Member

WillAyd commented Apr 18, 2018

Unfortunately this is tough to work with...do you have any way of isolating this from mod_wsgi? Otherwise how do you know its a pandas issue and not with that package?

@akrherz
Copy link
Author

akrherz commented Apr 18, 2018

@WillAyd yeah, I understand :( I've been trying to create a reproducer without much luck :( As a comment, if I remove default_handler=str from to_json this issue goes away, but eventually I hit OverflowErrors as in #14256. A workaround is to set the number of requests per mod-wsgi process to a very low number, like 100.

@WillAyd
Copy link
Member

WillAyd commented Apr 18, 2018

Maybe trying another serving option like gunicorn? Whether that works or not could clue in on any potential issue.

I haven't used pandas extensively with any kind of deployed server before. You may also want to join the Gitter channel to see if anyone out there has expertise to offer

@chris-b1
Copy link
Contributor

chris-b1 commented Apr 19, 2018

mod_wsgi in some configurations uses python sub-interpreters, which results in global c extension state, might be something in our json code that hits that, it's a bit more finicky than threading.

Can you adding this to your apache config - would confirm that's the issue, if it fixes

WSGIApplicationGroup %{GLOBAL}

http://modwsgi.readthedocs.io/en/develop/configuration-directives/WSGIApplicationGroup.html

For reference ran into something similar in arrow
https://issues.apache.org/jira/browse/ARROW-1327

@chris-b1 chris-b1 added the IO JSON read_json, to_json, json_normalize label Apr 19, 2018
@akrherz
Copy link
Author

akrherz commented Apr 19, 2018

Thank you @chris-b1 for the suggestion. I think I have implemented your suggestion and the watched pot has yet to boil for this issue. My apache config now looks like

WSGIDaemonProcess iemwsgi_iws processes=15 threads=1 display-name=%{GROUP} maximum-requests=10000
<VirtualHost...>

 <Directory...>
      SetHandler wsgi-script
      WSGIApplicationGroup %{GLOBAL}
      WSGIProcessGroup iemwsgi_iws
  </Directory>
</VirtualHost>

will update after running this a few days without issue, I typically would see it happen once or twice per day.

@akrherz
Copy link
Author

akrherz commented Apr 20, 2018

Well, so far so good with the WSGIApplicationGroup %{GLOBAL} setting. At the risk of hijacking my own issue and conflating things, it may be useful to note for others reading this thread that if I set WSGIDaemonProcess threads to some larger than 1 number, I end up getting "random" deadlocks/hangs within pandas.io.sql.read_sql which may be issues with psycopg2. I have not tried this new WSGIApplicationGroup and more than one thread yet, I am just giddy that the JSON dumping now consistently works!

@akrherz
Copy link
Author

akrherz commented May 2, 2018

Zero issues noted since the change to WSGIApplicationGroup %{GLOBAL}

@chris-b1
Copy link
Contributor

chris-b1 commented May 3, 2018

Cool, if you're feeling brave would welcome any debugging/investigation to see what the underlying issue is. Not entirely sure what the best way do that is, may be some helpful pointers here
https://emptysqua.re/blog/python-c-extensions-and-mod-wsgi/

@WillAyd
Copy link
Member

WillAyd commented Apr 27, 2019

Closing as not reproducible outside of mod_wsgi

@WillAyd WillAyd closed this as completed Apr 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

No branches or pull requests

3 participants