Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feather.materialize can easily cause segfaults #3

Closed
ExpandingMan opened this issue Mar 21, 2018 · 6 comments
Closed

Feather.materialize can easily cause segfaults #3

ExpandingMan opened this issue Mar 21, 2018 · 6 comments

Comments

@ExpandingMan
Copy link

Not necessarily directly relevant to this package, but see my comment. Again, I'm not seeing any really elegant solutions to this.

@davidanthoff
Copy link
Member

I think I don't fully understand the issue here. The various array types in Arrow.jl have a reference to the underlying data vector, right? Doesn't that keep the memory alive?

@ExpandingMan
Copy link
Author

It would, but your problem is that you're assuming that those ArrowVectors have to always exist.

When you do a getindex of a contiguous subset of an ArrowVector it does unsafe_wrap. That prevents free from being called on those pointers, but it does not protect the underlying data buffer that they originated from. Therefore, you can still have the underlying buffer GC'd away and those unsafe_wraped arrays wind up with their data swept out from under them and you segfault.

So, here's what happens: I have a bunch of ArrowVectors, I take some slices of them which become unsafe_wraped Vectors, and I pass those Vectors off to some other part of the program. Meanwhile, I'm not using the ArrowVectors anymore so they get GC'd. Then, you try to access data from one of those Vectors you created and you segfault.

Feather.materialize was the perfect scenario for making this happen: I'd create a Feather.Source containing all of the ArrowVectors and convert them into Vectors (that's what materialize did). But this conversion was a lie, it was just a call to unsafe_wrap. So, I return all of these Vectors, but as soon as I exit the materialize function itself the Feather.Source is gone and with it all references to the original buffer. Then, accessing the Vector immediately segfaults (usually immediately, depends on when the GC depends to kick in).

The solution to this is to stop abusing unsafe_wrap to cheat and create fictitious views wherever possible. I will update you with more details here.

@davidanthoff
Copy link
Member

Ah, thanks, got it!

I guess another option would be to return a custom array type in these cases that keeps holding a reference to the underlying data array, right? But that is probably too much hassle. At least from the Query.jl integration point of view your current plan sounds excellent.

@ExpandingMan
Copy link
Author

Nope, not a hassle at all, in fact this is already done for me by Base! This is what SubArray does, so if you take a view, you can propagate that view anywhere you like and still don't have to worry about the data buffer getting GC'd. I've thought pretty carefully about the getindex and view semantics by now, and I've come to realize the wisdom of how things are done in Base.

Like I said, I'm also about to create the arrowview function (name not final) which will return to you views as other ArrowVectors instead of SubArrays. This will be useful in cases in which you have arrow formatted data and want to write only a subset of those arrow formatted arrays into another arrow formatted buffer.

@davidanthoff
Copy link
Member

Ah, yes, of course! Ok, so essentially this issue here can just be closed, right? Everything seems sorted out on that front.

@ExpandingMan
Copy link
Author

Yup, all sorted out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants