-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: to_json with objects causing segfault #14256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
When passing object dtypes which don't actually contain strings (though they could also contain objects which have a good enough response to special methods to work), you must supply a So the first 2 cases above are expected. The 3rd is handled this way.
seg faulting shouldn't happen though; we should get an exception that a |
cc @kawochen |
I suppose the 2nd path is also not reporting that a
|
This impacted us this weekend as well. Our default_handler was only handling specific objects that we wanted to control the json serialization for, but would other wise return the object. We have since changed the logic of the default_handler to serialize everything, but just raising an error if a default_handler is not present does not prevent the to_json method from causing a segfault for different objects. |
@jreback I should have some time this weekend or early next week to dig into these segfaults (if nobody gets to it first) |
This also comes up if you have shapely geometries in a column (came up by accident when a geopandas GeoDataFrame got converted to a regular DataFrame). If you have a small enough sample the json encoder hits the recursion limit and you get an error. >>> import pandas as pd
>>> from shapely.geometry import Polygon
>>> geom = Polygon([(0, 0), (1, 1), (1, 0)])
>>> df = pd.DataFrame([('testval {}'.format(i), geom) for i in range(5)], columns=['value', 'geometry'])
>>> df.to_json()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/david/miniconda3/envs/geotesting/lib/python3.6/site-packages/pandas/core/generic.py", line 1089, in to_json
lines=lines)
File "/home/david/miniconda3/envs/geotesting/lib/python3.6/site-packages/pandas/io/json.py", line 39, in to_json
date_unit=date_unit, default_handler=default_handler).write()
File "/home/david/miniconda3/envs/geotesting/lib/python3.6/site-packages/pandas/io/json.py", line 85, in write
default_handler=self.default_handler)
OverflowError: Maximum recursion level reached Add more rows to the DataFrame and you can get a segfault (doesn't appear to be guaranteed - sometimes you get the OverflowError). >>> df = pd.DataFrame([('testval {}'.format(i), geom) for i in range(5000)], columns=['value', 'geometry'])
>>> df.to_json()
Segmentation fault (core dumped) |
@DavidCEllis you need to supply a |
@jreback Sorry, I wasn't quite clear - this was a simple way to reproduce the segfault. It's not how I ran into the issue. I know how to make it work, I just wouldn't expect a segfault. The issue was I expected the object to be a GeoPandas GeoDataFrame and it had converted to a regular DataFrame through some operation. On a GeoDataFrame the method works without needing to specify a default_handler. On a regular DataFrame I would expect an exception like the overflow error but got a segfault. >>> import pandas as pd
>>> import geopandas as gpd
>>> from shapely.geometry import Polygon
>>> geom = Polygon([(0, 0), (1, 1), (1, 0)])
>>> gdf = gpd.GeoDataFrame([('testval {}'.format(i), geom) for i in range(5000)], columns=['value', 'geometry'])
>>> gdf.to_json()
'Really long GeoJSON string output'
>>> df = pd.DataFrame(gdf) # GeoDataFrame is a subclass
>>> df.to_json()
Segmentation fault (core dumped) |
@DavidCEllis as you can see from above this is an open bug, pull-requests are welcome to fix. This should raise as |
Fair point. Unfortunately I got to the point where the json export methods send the entire dataframe into a C function and I'm not a C programmer. Based on the docs you linked earlier I think the default_handler error not supplied will only come up if you supply an unsupported numpy dtype? It looks like it's falling back on the unsupported object behaviour which finishes with:
It seems that sometimes it ends up segfaulting instead of raising the OverflowError. On testing it seemed more likely to segfault the larger the array, sometimes the same sized array would segfault and sometimes it would raise the OverflowError. Not sure if this is useful but it seemed to be additional information on how it was being triggered. |
When dealing with dataframes that contain exotic datatypes you need a
default handler for to_json. This has bit my team a couple times now since
we first posted the linked issue above. For now always include a
default_handler.
…On Fri, Mar 31, 2017 at 1:13 PM, David Ellis ***@***.***> wrote:
Fair point. Unfortunately I got to the point where the json export methods
send the entire dataframe into a C function and I'm not a C programmer.
Based on the docs you linked earlier
<http://pandas.pydata.org/pandas-docs/stable/io.html#fallback-behavior> I
think the default_handler error not supplied will only come up if you
supply an unsupported numpy dtype? It looks like it's falling back on the
unsupported object behaviour which finishes with:
convert the object to a dict by traversing its contents. However this will
often fail with an OverflowError or give unexpected results
It seems that sometimes it ends up segfaulting instead of raising the
OverflowError. On testing it seemed more likely to segfault the larger the
array, sometimes the same sized array would segfault and sometimes it would
raise the OverflowError. Not sure if this is useful but it seemed to be
additional information on how it was being triggered.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#14256 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AF-fO4893Bxug5QD99HKSg4dqCmKj69Xks5rrUJUgaJpZM4KBYfC>
.
|
I'm currently preparing a fix for the segfaults. |
@detroitcoder Do you have an example of a Edit - Sorry, I see now you can just use |
Code Sample, a copy-pastable example if possible
Creating an bson objectID, without giving an objectID exclusively is ok.
However, if you provide an ID explicitly, an exception is raised
And worse, if the column is not the only column, the entire process dies.
Expected Output
output of
pd.show_versions()
pymongo version is 3.3.0
The text was updated successfully, but these errors were encountered: