-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overhauled to Arrow Back-End and Better Memory Safety #78
Conversation
…etely fucked on write side.
I've noticed a big problem with My hope was that the pointers would be very temporary anyway, but I still haven't been able to get any news from the core devs on how hard it will be to fix |
GC.@preserve is the standard way to keep an object alive. |
Thanks, but I've already thought of doing the equivalent and it doesn't solve the problem does it? Then the data will just never ever get gc'd even if the things referring to it go out of scope. This might be really bad if it's a sufficiently big dataset (though I'm very hazy on how exactly memory mapping works, so I don't know if that can help here). |
The data is preserved for the scope of the preserve block. AFAIU there is no other way to guarantee not getting garbage collected. |
Yeah, in that case we'd have to have users call it manually. Or, I suppose we could create a macro. Will have to think on it further, thanks for your help. |
The way things "should" work is with the ReinterpretedArray but I know it is non overhead free right now :( |
Yeah I know. I'm learning that pointers are actually much scarier in languages that are not designed to use them. Actually, |
writemetadata(sink::Sink) = writemetadata(sink.io, sink.ctable) | ||
|
||
|
||
function Data.close!(sink::Sink) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is nice how all the writing happens in one place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like I said I was contemplating breaking it up a little, but I'm not determined to do it. I definitely agree that it has to be done in such a way that it does not obscure where in the IO buffer you are at any point in the write process. Anyway, I'll leave this alone then, I didn't have any particularly inspired ideas about it: I think I just wanted to formalize how we deal with the header and tailer bytes a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it just like this; before it was a little all over the place in terms of where all the IO happened.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks good; mainly keeping the necessary parts from before, and allowing Arrow to do a lot of the pointer hopping (I assume, I'll take a look at Arrow.jl next).
The DataStreams interface needs a bit of work to make sure we fully implement, but I can help w/ that easy enough.
Thanks again for all the work here! I'll start digging into Arrow.jl and once we get that registered, we can merge this!
src/utils.jl
Outdated
getoutputlength(version::Int32, x::Integer) = version < FEATHER_VERSION ? x : padding(x) | ||
|
||
function checkmagic(filename::AbstractString, data::AbstractVector{UInt8}) | ||
header = data[1:4] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will throw an out of bounds error if the user mistakenly passes a wrong file name (and the data
vector is empty)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I'll add a check
src/utils.jl
Outdated
end | ||
end | ||
|
||
function checkfilelength(filename::AbstractString, data::AbstractVector{UInt8}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should probably just be combined w/ the checkmagic
above.
Great thanks. I'm hoping you'll be very pleased with how isolated the pointer code is in Arrow.jl, I tried really hard to make it safe. Of course, there's nothing I can do about users giving the wrong array indices, but I hope that's all that can really go wrong. Like I said, I'm still contemplating improving the Arrow.jl constructors, so that may change some things here. I also have to update it so that we can use other data types for the offsets, I think we'll need to do |
Ok, I've just added a whole lot more unit tests, so this damn thing better not have any mysterious uncaught errors anymore. Still not totally sure how that godawful segfault was getting through those tests. Think 0.6 fails for some reason, will fix eventually. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good; I left a few comments. The biggest work needed is on the DataStreams implementation, which I can do. @ExpandingMan are you ok if I just push to your PR here?
|
||
if Base.VERSION < v"0.7.0-DEV.2575" | ||
const Dates = Base.Dates | ||
using Missings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These imports don't really jive w/ the version check here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, thihnk I just left in an older existing version check. I don't know how to find the specific versions, do you know them?
src/Feather.jl
Outdated
else | ||
import Dates | ||
end | ||
|
||
if Base.VERSION >= v"0.7.0-DEV.2009" | ||
if Base.VERSION ≥ v"0.7.0-DEV.2009" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should just use Compat for all these stdlib deprecations now
src/metadata.jl
Outdated
encoding::Encoding | ||
offset::Int64 | ||
length::Int64 | ||
null_count::Int64 | ||
total_bytes::Int64 | ||
end | ||
|
||
# TODO why are these done this way rather with an abstract type??? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you expound this comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't even remember. I think I wrote that comment very early on, probably not realizing that a @UNION
was coming from FlatBuffers. Will delete.
src/metadata.jl
Outdated
Metadata.BINARY => Vector{UInt8}, | ||
Metadata.CATEGORY => Int64, | ||
Metadata.TIMESTAMP => Int64, | ||
Metadata.DATE => Int64, | ||
Metadata.TIME => Int64 | ||
) | ||
|
||
const julia2Type_ = Dict{DataType,Metadata.Type_}( | ||
const MDATA_TYPE_DICT = Dict{DataType,Metadata.DType}( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should just spell out METADATA_TYPE_DICT
, MDATA
is unclear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will change
src/metadata.jl
Outdated
) | ||
const julia2TimeUnit = Dict{DataType,Metadata.TimeUnit}([(v, k) for (k,v) in TimeUnit2julia]) | ||
const MDATA_TIME_DICT = Dict{DataType,Metadata.TimeUnit}(v=>k for (k,v) in JULIA_TIME_DICT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spell out in full here too
Absolutely, feel free to make push's to my PR. I've never head push's on a PR before, do I just merge them the same way I would PR's to master? Also keep in mind that when we merge this we don't have to tag it immediately, so we can always make additional PR's immediately after. We'll have to do some documentation PR's before tagging regardless (I intended to wait until this is merged to do that). |
I've fixed the broken unit test on 0.6. Note that DataStreams will need to be tagged for us to not get test errors on 0.7. |
Thanks @ExpandingMan for all the hard work here; awesome to see all the new arrow/feather functionality progressing. |
I have rewritten Feather.jl to use my new Arrow.jl back-end. The Arrow.jl package provides
AbstractVector
objects that provide access to Arrow formatted data. Because the existing Feather.jl mostly deals with accessing Arrow data, this rewrite was very extensive. This PR should maintain all existing functionality and expands on it, with the exception of appending DataFrames (more on this below). What follows is an overview of the overhauled package.Arrow.jl of course needs a tagged released and complete unit tests for this to be merged, but I wanted to put up this PR so we could start figuring out what would need to be done.
New Default Reading Behavior
Creating a
Feather.Source
, or callingFeather.read
will now only constructArrowVector
objects. In the case ofFeather.read
aDataFrame
will be created withArrowVector
columns.ArrowVector
s simply reference existing data, so, in the case of memory mapping, once the file is memory mapped nothing is actually read in until requested by the user. This allows the user to browse a feather file at their leisure, even performing query operations while only loading in data as necessary. The old default functionality of reading the entire file into memory is now provided byFeather.materialize
. This method takes care of not only the requested behavior of reading in only particular columns, but any arbitrary subset of the full table.Better Memory Safety
This has been discussed extensively elsewhere. If
reinterpret
is ever more efficient we will have full memory safety, but that seems a long way off.Dropped Support for Some Non-Standard Formatting
In particular, categorical arrays now must use
Int32
reference values. This is specified by the Arrow Standard. This also no longer supports the really old version of Feather that didn't use Arrow padding, but as there was a warning saying that that data would be unreadable anyway this seems fine.Less Dependent on DataStreams
@davidanthoff was asking if we could split off the core functionality of Feather into a sepearate FeatherBase.jl that doesn't depend on DataStreams. Since a great deal of the functionality of this package has been moved to Arrow in this PR anyway, I thought it would be really great if we could keep this whole. While retaining all DataStreams functionality and the Source/Sink structure, the only place where the core functionality of this package really relies on DataStreams is now
Data.Schema
, which, to my knowledge, has never changed since DataStreams was created. Hopefully everyone will be sufficiently happy with this that we don't need to bother creating a new package? 😉Appropriate Updates to Unit Tests
Mostly they are now organized into
@testset
. In some cases slight adjustments to the tests were needed.