Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-1695: [Serialization] Fix reference counting of numpy arrays created in custom serializer #1220

Closed
wants to merge 2 commits into from

Conversation

pcmoritz
Copy link
Contributor

@pcmoritz pcmoritz commented Oct 20, 2017

This uses the NumPyBuffer built into Arrow's Tensor facility to protect the numpy arrays holding the Tensors to be serialized. See also the problem description in https://issues.apache.org/jira/browse/ARROW-1695.

@pcmoritz pcmoritz changed the title ARROW-1695: [Serialization] Fix reference counting of numpy arrays created in custom serialializer ARROW-1695: [Serialization] Fix reference counting of numpy arrays created in custom serializer Oct 20, 2017
@pcmoritz pcmoritz force-pushed the fix-serialize-tensors branch from 9d87906 to 7e23bb5 Compare October 20, 2017 18:56
Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. So to confirm I understand what is going on -- the custom serializer was producing a temporary NumPy array which was being decref'd in a ScopedRef or OwnedRef before it was able to get boxed properly in an arrow::Tensor. Right?

@pcmoritz
Copy link
Contributor Author

@pcmoritz
Copy link
Contributor Author

pcmoritz commented Oct 20, 2017

The test failure is in the go bindings and unrelated to this PR (cc @kou are you aware of this problem?). I'll merge this since I'd like to do some follow up work that needs this fix.

@asfgit asfgit closed this in 971e99d Oct 20, 2017
custom_serializer=serialize_dummy_class,
custom_deserializer=deserialize_dummy_class)

pa.serialize(DummyClass())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also deserialize the object and assert that it is the correct value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops too late. This is a regression test to catch the segfault which used to already occur in the serialize call, so only testing that is fine; the actual codepath of serializing a numpy array via custom serializer will be tested in the PR I'm about to create, which is about serializing pytorch Tensors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You want to handle pytorch tensors in Arrow? As opposed to in Ray?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'd like to handle them in Arrow conditionally on pytorch being installed, much in the same way as we handle pandas. There is precedent for that, the glib bindings do it for raw lua torch tensors I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, sounds good.

@kou
Copy link
Member

kou commented Oct 22, 2017

The test failure is in the go bindings and unrelated to this PR

I created #1234 to fix it.

@pcmoritz
Copy link
Contributor Author

Awesome, thanks :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants