-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-1695: [Serialization] Fix reference counting of numpy arrays created in custom serializer #1220
Conversation
9d87906
to
7e23bb5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. So to confirm I understand what is going on -- the custom serializer was producing a temporary NumPy array which was being decref'd in a ScopedRef
or OwnedRef
before it was able to get boxed properly in an arrow::Tensor
. Right?
Yes, this is correct! It is decref'd explicitly here: https://github.com/pcmoritz/arrow/blob/7e23bb5e7cd666e311595f76d60dbd08bf71920e/cpp/src/arrow/python/python_to_arrow.cc#L674 |
The test failure is in the go bindings and unrelated to this PR (cc @kou are you aware of this problem?). I'll merge this since I'd like to do some follow up work that needs this fix. |
custom_serializer=serialize_dummy_class, | ||
custom_deserializer=deserialize_dummy_class) | ||
|
||
pa.serialize(DummyClass()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should also deserialize the object and assert that it is the correct value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops too late. This is a regression test to catch the segfault which used to already occur in the serialize call, so only testing that is fine; the actual codepath of serializing a numpy array via custom serializer will be tested in the PR I'm about to create, which is about serializing pytorch Tensors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You want to handle pytorch tensors in Arrow? As opposed to in Ray?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'd like to handle them in Arrow conditionally on pytorch being installed, much in the same way as we handle pandas. There is precedent for that, the glib bindings do it for raw lua torch tensors I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, sounds good.
I created #1234 to fix it. |
Awesome, thanks :) |
This uses the NumPyBuffer built into Arrow's Tensor facility to protect the numpy arrays holding the Tensors to be serialized. See also the problem description in https://issues.apache.org/jira/browse/ARROW-1695.