Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhauled to Arrow Back-End and Better Memory Safety #78

Merged
merged 45 commits into from
Apr 10, 2018

Conversation

ExpandingMan
Copy link
Collaborator

@ExpandingMan ExpandingMan commented Feb 20, 2018

I have rewritten Feather.jl to use my new Arrow.jl back-end. The Arrow.jl package provides AbstractVector objects that provide access to Arrow formatted data. Because the existing Feather.jl mostly deals with accessing Arrow data, this rewrite was very extensive. This PR should maintain all existing functionality and expands on it, with the exception of appending DataFrames (more on this below). What follows is an overview of the overhauled package.

Arrow.jl of course needs a tagged released and complete unit tests for this to be merged, but I wanted to put up this PR so we could start figuring out what would need to be done.

New Default Reading Behavior

Creating a Feather.Source, or calling Feather.read will now only construct ArrowVector objects. In the case of Feather.read a DataFrame will be created with ArrowVector columns. ArrowVectors simply reference existing data, so, in the case of memory mapping, once the file is memory mapped nothing is actually read in until requested by the user. This allows the user to browse a feather file at their leisure, even performing query operations while only loading in data as necessary. The old default functionality of reading the entire file into memory is now provided by Feather.materialize. This method takes care of not only the requested behavior of reading in only particular columns, but any arbitrary subset of the full table.

Better Memory Safety

This has been discussed extensively elsewhere. If reinterpret is ever more efficient we will have full memory safety, but that seems a long way off.

Dropped Support for Some Non-Standard Formatting

In particular, categorical arrays now must use Int32 reference values. This is specified by the Arrow Standard. This also no longer supports the really old version of Feather that didn't use Arrow padding, but as there was a warning saying that that data would be unreadable anyway this seems fine.

Less Dependent on DataStreams

@davidanthoff was asking if we could split off the core functionality of Feather into a sepearate FeatherBase.jl that doesn't depend on DataStreams. Since a great deal of the functionality of this package has been moved to Arrow in this PR anyway, I thought it would be really great if we could keep this whole. While retaining all DataStreams functionality and the Source/Sink structure, the only place where the core functionality of this package really relies on DataStreams is now Data.Schema, which, to my knowledge, has never changed since DataStreams was created. Hopefully everyone will be sufficiently happy with this that we don't need to bother creating a new package? 😉

Appropriate Updates to Unit Tests

Mostly they are now organized into @testset. In some cases slight adjustments to the tests were needed.

@ExpandingMan
Copy link
Collaborator Author

I've noticed a big problem with Feather.materialize. It's possible, in fact likely, that if you call this function and nothing else all references to the original data buffer will disappear so that it gets garbage collected and you segfault. I don't see any really elegant solution to this as things stand. Certainly I can get rid of the method that only takes a filename as argument, but we'd have to tell users "please keep the Source around somewhere, it can't get gc'd". The only completely reliable alternative I can think of would be to have materialize always do a deepcopy, but ugh, that would be absolutely awful.

My hope was that the pointers would be very temporary anyway, but I still haven't been able to get any news from the core devs on how hard it will be to fix ReinterpretArray. See here. Suggestions, comments welcome!

@KristofferC
Copy link

GC.@preserve is the standard way to keep an object alive.

@ExpandingMan
Copy link
Collaborator Author

Thanks, but I've already thought of doing the equivalent and it doesn't solve the problem does it? Then the data will just never ever get gc'd even if the things referring to it go out of scope. This might be really bad if it's a sufficiently big dataset (though I'm very hazy on how exactly memory mapping works, so I don't know if that can help here).

@KristofferC
Copy link

The data is preserved for the scope of the preserve block. AFAIU there is no other way to guarantee not getting garbage collected.

@ExpandingMan
Copy link
Collaborator Author

Yeah, in that case we'd have to have users call it manually. Or, I suppose we could create a macro. Will have to think on it further, thanks for your help.

@KristofferC
Copy link

The way things "should" work is with the ReinterpretedArray but I know it is non overhead free right now :(

@ExpandingMan
Copy link
Collaborator Author

Yeah I know. I'm learning that pointers are actually much scarier in languages that are not designed to use them. Actually, reinterpret seems fine in some cases but it's just so unpredictable. Any idea what the thinking is on it, whether this will be something that will be easy or difficult to fix? I'd feel a lot better about using it if I thought there was a reasonable expectation of it getting better sometime in the foreseeable future. So far I haven't heard any kind of assessment about what might be wrong and what might be required to fix it (and I'm afraid it's way above my expertise to look into myself).

writemetadata(sink::Sink) = writemetadata(sink.io, sink.ctable)


function Data.close!(sink::Sink)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is nice how all the writing happens in one place.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like I said I was contemplating breaking it up a little, but I'm not determined to do it. I definitely agree that it has to be done in such a way that it does not obscure where in the IO buffer you are at any point in the write process. Anyway, I'll leave this alone then, I didn't have any particularly inspired ideas about it: I think I just wanted to formalize how we deal with the header and tailer bytes a bit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it just like this; before it was a little all over the place in terms of where all the IO happened.

Copy link
Member

@quinnj quinnj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good; mainly keeping the necessary parts from before, and allowing Arrow to do a lot of the pointer hopping (I assume, I'll take a look at Arrow.jl next).

The DataStreams interface needs a bit of work to make sure we fully implement, but I can help w/ that easy enough.

Thanks again for all the work here! I'll start digging into Arrow.jl and once we get that registered, we can merge this!

src/utils.jl Outdated
getoutputlength(version::Int32, x::Integer) = version < FEATHER_VERSION ? x : padding(x)

function checkmagic(filename::AbstractString, data::AbstractVector{UInt8})
header = data[1:4]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will throw an out of bounds error if the user mistakenly passes a wrong file name (and the data vector is empty)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I'll add a check

src/utils.jl Outdated
end
end

function checkfilelength(filename::AbstractString, data::AbstractVector{UInt8})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should probably just be combined w/ the checkmagic above.

@ExpandingMan
Copy link
Collaborator Author

Great thanks.

I'm hoping you'll be very pleased with how isolated the pointer code is in Arrow.jl, I tried really hard to make it safe. Of course, there's nothing I can do about users giving the wrong array indices, but I hope that's all that can really go wrong.

Like I said, I'm still contemplating improving the Arrow.jl constructors, so that may change some things here. I also have to update it so that we can use other data types for the offsets, I think we'll need to do Int64 by default.

@ExpandingMan ExpandingMan changed the title Overhauled to Arrow Back-End and Full Memory Safety Overhauled to Arrow Back-End and Better Memory Safety Mar 25, 2018
@ExpandingMan
Copy link
Collaborator Author

Ok, I've just added a whole lot more unit tests, so this damn thing better not have any mysterious uncaught errors anymore. Still not totally sure how that godawful segfault was getting through those tests. Think 0.6 fails for some reason, will fix eventually.

Copy link
Member

@quinnj quinnj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good; I left a few comments. The biggest work needed is on the DataStreams implementation, which I can do. @ExpandingMan are you ok if I just push to your PR here?


if Base.VERSION < v"0.7.0-DEV.2575"
const Dates = Base.Dates
using Missings
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These imports don't really jive w/ the version check here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, thihnk I just left in an older existing version check. I don't know how to find the specific versions, do you know them?

src/Feather.jl Outdated
else
import Dates
end

if Base.VERSION >= v"0.7.0-DEV.2009"
if Base.VERSION ≥ v"0.7.0-DEV.2009"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should just use Compat for all these stdlib deprecations now

src/metadata.jl Outdated
encoding::Encoding
offset::Int64
length::Int64
null_count::Int64
total_bytes::Int64
end

# TODO why are these done this way rather with an abstract type???
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you expound this comment?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't even remember. I think I wrote that comment very early on, probably not realizing that a @UNION was coming from FlatBuffers. Will delete.

src/metadata.jl Outdated
Metadata.BINARY => Vector{UInt8},
Metadata.CATEGORY => Int64,
Metadata.TIMESTAMP => Int64,
Metadata.DATE => Int64,
Metadata.TIME => Int64
)

const julia2Type_ = Dict{DataType,Metadata.Type_}(
const MDATA_TYPE_DICT = Dict{DataType,Metadata.DType}(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should just spell out METADATA_TYPE_DICT, MDATA is unclear.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will change

src/metadata.jl Outdated
)
const julia2TimeUnit = Dict{DataType,Metadata.TimeUnit}([(v, k) for (k,v) in TimeUnit2julia])
const MDATA_TIME_DICT = Dict{DataType,Metadata.TimeUnit}(v=>k for (k,v) in JULIA_TIME_DICT)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spell out in full here too

@ExpandingMan
Copy link
Collaborator Author

Absolutely, feel free to make push's to my PR. I've never head push's on a PR before, do I just merge them the same way I would PR's to master?

Also keep in mind that when we merge this we don't have to tag it immediately, so we can always make additional PR's immediately after. We'll have to do some documentation PR's before tagging regardless (I intended to wait until this is merged to do that).

@ExpandingMan
Copy link
Collaborator Author

I've fixed the broken unit test on 0.6. Note that DataStreams will need to be tagged for us to not get test errors on 0.7.

@quinnj quinnj merged commit a2558d1 into JuliaData:master Apr 10, 2018
@quinnj
Copy link
Member

quinnj commented Apr 10, 2018

Thanks @ExpandingMan for all the hard work here; awesome to see all the new arrow/feather functionality progressing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants