Skip to content

BUG: to_json with objects causing segfault #14256

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tjader opened this issue Sep 20, 2016 · 14 comments · Fixed by #17857
Closed

BUG: to_json with objects causing segfault #14256

tjader opened this issue Sep 20, 2016 · 14 comments · Fixed by #17857
Labels
Bug IO JSON read_json, to_json, json_normalize
Milestone

Comments

@tjader
Copy link

tjader commented Sep 20, 2016

Code Sample, a copy-pastable example if possible

Creating an bson objectID, without giving an objectID exclusively is ok.

>>> import bson
>>> import pandas as pd
>>> pd.DataFrame({'A': [bson.objectid.ObjectId()]}).to_json()
Out[4]: '{"A":{"0":{"binary":"W\\u0e32\\u224cug\\u00fcR","generation_time":1474361586000}}}'
>>> pd.DataFrame({'A': [bson.objectid.ObjectId()], 'B': [1]}).to_json()
Out[5]: '{"A":{"0":{"binary":"W\\u0e4e\\u224cug\\u00fcS","generation_time":1474361614000}},"B":{"0":1}}'

However, if you provide an ID explicitly, an exception is raised

>>> pd.DataFrame({'A': [bson.objectid.ObjectId('574b4454ba8c5eb4f98a8f45')]}).to_json()
Traceback (most recent call last):
  File "/auto/energymdl2/anaconda/envs/commod_20160831/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2885, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-7-c9a20090d481>", line 1, in <module>
    pd.DataFrame({'A': [bson.objectid.ObjectId('574b4454ba8c5eb4f98a8f45')]}).to_json()
  File "/auto/energymdl2/anaconda/envs/commod_20160831/lib/python2.7/site-packages/pandas/core/generic.py", line 1056, in to_json
    default_handler=default_handler)
  File "/auto/energymdl2/anaconda/envs/commod_20160831/lib/python2.7/site-packages/pandas/io/json.py", line 36, in to_json
    date_unit=date_unit, default_handler=default_handler).write()
  File "/auto/energymdl2/anaconda/envs/commod_20160831/lib/python2.7/site-packages/pandas/io/json.py", line 79, in write
    default_handler=self.default_handler)
OverflowError: Unsupported UTF-8 sequence length when encoding string

And worse, if the column is not the only column, the entire process dies.

>>> pd.DataFrame({'A': [bson.objectid.ObjectId('574b4454ba8c5eb4f98a8f45')], 'B': [1]}).to_json()
Process finished with exit code 139

Expected Output

output of pd.show_versions()

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 26.1.1
Cython: 0.24
numpy: 1.10.4
scipy: 0.17.0
statsmodels: 0.6.1
xarray: 0.7.2
IPython: 4.1.2
sphinx: 1.3.5
patsy: 0.4.1
dateutil: 2.5.2
pytz: 2016.6.1
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.5.2
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.6.0
bs4: 4.3.2
html5lib: 0.999
httplib2: 0.9.2
apiclient: 1.5.0
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.39.0
pandas_datareader: None

pymongo version is 3.3.0

@jreback
Copy link
Contributor

jreback commented Sep 20, 2016

When passing object dtypes which don't actually contain strings (though they could also contain objects which have a good enough response to special methods to work), you must supply a default_handler.

So the first 2 cases above are expected.

The 3rd is handled this way.

In [6]: pd.DataFrame({'A': [bson.objectid.ObjectId('574b4454ba8c5eb4f98a8f45')]}).to_json(default_handler=str)
Out[6]: '{"A":{"0":"574b4454ba8c5eb4f98a8f45"}}'

seg faulting shouldn't happen though; we should get an exception that a default_handler is not supplied.

@jreback
Copy link
Contributor

jreback commented Sep 20, 2016

@jreback
Copy link
Contributor

jreback commented Sep 20, 2016

cc @kawochen
cc @Komnomnomnom

@jreback jreback added Bug IO JSON read_json, to_json, json_normalize Difficulty Intermediate labels Sep 20, 2016
@jreback jreback added this to the Next Major Release milestone Sep 20, 2016
@jreback jreback changed the title to_json may kill the python process BUG: to_json with objects causing segfault Sep 20, 2016
@jreback
Copy link
Contributor

jreback commented Sep 20, 2016

I suppose the 2nd path is also not reporting that a default_handler is missing

In [10]: pd.DataFrame({'A': [bson.objectid.ObjectId('574b4454ba8c5eb4f98a8f45')]}).to_json(default_handler=str)
Out[10]: '{"A":{"0":"574b4454ba8c5eb4f98a8f45"}}'

@detroitcoder
Copy link

This impacted us this weekend as well. Our default_handler was only handling specific objects that we wanted to control the json serialization for, but would other wise return the object. We have since changed the logic of the default_handler to serialize everything, but just raising an error if a default_handler is not present does not prevent the to_json method from causing a segfault for different objects.

@Komnomnomnom
Copy link
Contributor

@jreback I should have some time this weekend or early next week to dig into these segfaults (if nobody gets to it first)

@DavidCEllis
Copy link

DavidCEllis commented Mar 31, 2017

This also comes up if you have shapely geometries in a column (came up by accident when a geopandas GeoDataFrame got converted to a regular DataFrame).

If you have a small enough sample the json encoder hits the recursion limit and you get an error.

>>> import pandas as pd
>>> from shapely.geometry import Polygon
>>> geom = Polygon([(0, 0), (1, 1), (1, 0)])
>>> df = pd.DataFrame([('testval {}'.format(i), geom) for i in range(5)], columns=['value', 'geometry'])
>>> df.to_json()

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/david/miniconda3/envs/geotesting/lib/python3.6/site-packages/pandas/core/generic.py", line 1089, in to_json
    lines=lines)
  File "/home/david/miniconda3/envs/geotesting/lib/python3.6/site-packages/pandas/io/json.py", line 39, in to_json
    date_unit=date_unit, default_handler=default_handler).write()
  File "/home/david/miniconda3/envs/geotesting/lib/python3.6/site-packages/pandas/io/json.py", line 85, in write
    default_handler=self.default_handler)
OverflowError: Maximum recursion level reached

Add more rows to the DataFrame and you can get a segfault (doesn't appear to be guaranteed - sometimes you get the OverflowError).

>>> df = pd.DataFrame([('testval {}'.format(i), geom) for i in range(5000)], columns=['value', 'geometry'])
>>> df.to_json()
Segmentation fault (core dumped)

@jreback
Copy link
Contributor

jreback commented Mar 31, 2017

@DavidCEllis you need to supply a default_handler

@DavidCEllis
Copy link

@jreback Sorry, I wasn't quite clear - this was a simple way to reproduce the segfault. It's not how I ran into the issue. I know how to make it work, I just wouldn't expect a segfault.

The issue was I expected the object to be a GeoPandas GeoDataFrame and it had converted to a regular DataFrame through some operation. On a GeoDataFrame the method works without needing to specify a default_handler.

On a regular DataFrame I would expect an exception like the overflow error but got a segfault.

>>> import pandas as pd
>>> import geopandas as gpd
>>> from shapely.geometry import Polygon
>>> geom = Polygon([(0, 0), (1, 1), (1, 0)])
>>> gdf = gpd.GeoDataFrame([('testval {}'.format(i), geom) for i in range(5000)], columns=['value', 'geometry'])
>>> gdf.to_json()
'Really long GeoJSON string output'
>>> df = pd.DataFrame(gdf)  # GeoDataFrame is a subclass
>>> df.to_json()
Segmentation fault (core dumped)

@jreback
Copy link
Contributor

jreback commented Mar 31, 2017

@DavidCEllis as you can see from above this is an open bug, pull-requests are welcome to fix. This should raise as default_handler not supplied. you cannot serialize something that is not a standard object or a pandas object (w/o special support of course). but it shouldn't segfault either.

@DavidCEllis
Copy link

Fair point. Unfortunately I got to the point where the json export methods send the entire dataframe into a C function and I'm not a C programmer.

Based on the docs you linked earlier I think the default_handler error not supplied will only come up if you supply an unsupported numpy dtype? It looks like it's falling back on the unsupported object behaviour which finishes with:

convert the object to a dict by traversing its contents. However this will often fail with an OverflowError or give unexpected results

It seems that sometimes it ends up segfaulting instead of raising the OverflowError. On testing it seemed more likely to segfault the larger the array, sometimes the same sized array would segfault and sometimes it would raise the OverflowError. Not sure if this is useful but it seemed to be additional information on how it was being triggered.

@detroitcoder
Copy link

detroitcoder commented Apr 1, 2017 via email

@matthiashuschle
Copy link
Contributor

I'm currently preparing a fix for the segfaults.
Both cases above are the result of an overflow of the JSON string buffer, but with different origins.
The first comes from an infinite loop, as the column index is not incremented when in a previous column an error message was set, e.g. from illegal symbols (non-UTF8), as this keeps the nested encode-call from finishing the iteration. In the example above, the bson objectID without initiated value works just by chance.
The other case has to handle objects with infinite nesting (due to the properties of the Polygon). The overflow occurs before the recursion limit breaks the process, as the writing of labels does not check the remaining buffer length. This is a problem that may cause overflows also with large tables without nesting. It might also be the reason that the first case didn't result in an OOM situation.
For the first one I already have a working solution and tests for it, and for the second one I'm currently in contact with my personal C-guru for a clean solution. I'll create a PR when it's done.

@mike-seekwell
Copy link

mike-seekwell commented Mar 21, 2018

@detroitcoder Do you have an example of a default_handler you use? I'm not clear how to implement one. I'd be fine with the handler just return an empty or static string.

Edit - Sorry, I see now you can just use df.to_json(orient='records', default_handler = str)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants