Overhauled to Arrow Back-End and Better Memory Safety #78

ExpandingMan · 2018-02-20T20:42:15Z

I have rewritten Feather.jl to use my new Arrow.jl back-end. The Arrow.jl package provides AbstractVector objects that provide access to Arrow formatted data. Because the existing Feather.jl mostly deals with accessing Arrow data, this rewrite was very extensive. This PR should maintain all existing functionality and expands on it, with the exception of appending DataFrames (more on this below). What follows is an overview of the overhauled package.

Arrow.jl of course needs a tagged released and complete unit tests for this to be merged, but I wanted to put up this PR so we could start figuring out what would need to be done.

New Default Reading Behavior

Creating a Feather.Source, or calling Feather.read will now only construct ArrowVector objects. In the case of Feather.read a DataFrame will be created with ArrowVector columns. ArrowVectors simply reference existing data, so, in the case of memory mapping, once the file is memory mapped nothing is actually read in until requested by the user. This allows the user to browse a feather file at their leisure, even performing query operations while only loading in data as necessary. The old default functionality of reading the entire file into memory is now provided by Feather.materialize. This method takes care of not only the requested behavior of reading in only particular columns, but any arbitrary subset of the full table.

Better Memory Safety

This has been discussed extensively elsewhere. If reinterpret is ever more efficient we will have full memory safety, but that seems a long way off.

Dropped Support for Some Non-Standard Formatting

In particular, categorical arrays now must use Int32 reference values. This is specified by the Arrow Standard. This also no longer supports the really old version of Feather that didn't use Arrow padding, but as there was a warning saying that that data would be unreadable anyway this seems fine.

Less Dependent on DataStreams

@davidanthoff was asking if we could split off the core functionality of Feather into a sepearate FeatherBase.jl that doesn't depend on DataStreams. Since a great deal of the functionality of this package has been moved to Arrow in this PR anyway, I thought it would be really great if we could keep this whole. While retaining all DataStreams functionality and the Source/Sink structure, the only place where the core functionality of this package really relies on DataStreams is now Data.Schema, which, to my knowledge, has never changed since DataStreams was created. Hopefully everyone will be sufficiently happy with this that we don't need to bother creating a new package? 😉

Appropriate Updates to Unit Tests

Mostly they are now organized into @testset. In some cases slight adjustments to the tests were needed.

…etely fucked on write side.

ExpandingMan · 2018-03-21T15:18:30Z

I've noticed a big problem with Feather.materialize. It's possible, in fact likely, that if you call this function and nothing else all references to the original data buffer will disappear so that it gets garbage collected and you segfault. I don't see any really elegant solution to this as things stand. Certainly I can get rid of the method that only takes a filename as argument, but we'd have to tell users "please keep the Source around somewhere, it can't get gc'd". The only completely reliable alternative I can think of would be to have materialize always do a deepcopy, but ugh, that would be absolutely awful.

My hope was that the pointers would be very temporary anyway, but I still haven't been able to get any news from the core devs on how hard it will be to fix ReinterpretArray. See here. Suggestions, comments welcome!

KristofferC · 2018-03-21T15:32:04Z

GC.@preserve is the standard way to keep an object alive.

ExpandingMan · 2018-03-21T15:38:40Z

Thanks, but I've already thought of doing the equivalent and it doesn't solve the problem does it? Then the data will just never ever get gc'd even if the things referring to it go out of scope. This might be really bad if it's a sufficiently big dataset (though I'm very hazy on how exactly memory mapping works, so I don't know if that can help here).

KristofferC · 2018-03-21T15:41:24Z

The data is preserved for the scope of the preserve block. AFAIU there is no other way to guarantee not getting garbage collected.

ExpandingMan · 2018-03-21T15:44:21Z

Yeah, in that case we'd have to have users call it manually. Or, I suppose we could create a macro. Will have to think on it further, thanks for your help.

KristofferC · 2018-03-21T15:47:02Z

The way things "should" work is with the ReinterpretedArray but I know it is non overhead free right now :(

ExpandingMan · 2018-03-21T15:50:58Z

Yeah I know. I'm learning that pointers are actually much scarier in languages that are not designed to use them. Actually, reinterpret seems fine in some cases but it's just so unpredictable. Any idea what the thinking is on it, whether this will be something that will be easy or difficult to fix? I'd feel a lot better about using it if I thought there was a reasonable expectation of it getting better sometime in the foreseeable future. So far I haven't heard any kind of assessment about what might be wrong and what might be required to fix it (and I'm afraid it's way above my expertise to look into myself).

quinnj · 2018-03-24T18:14:04Z

src/sink.jl

+writemetadata(sink::Sink) = writemetadata(sink.io, sink.ctable)
+
+
+function Data.close!(sink::Sink)


this is nice how all the writing happens in one place.

Like I said I was contemplating breaking it up a little, but I'm not determined to do it. I definitely agree that it has to be done in such a way that it does not obscure where in the IO buffer you are at any point in the write process. Anyway, I'll leave this alone then, I didn't have any particularly inspired ideas about it: I think I just wanted to formalize how we deal with the header and tailer bytes a bit.

I like it just like this; before it was a little all over the place in terms of where all the IO happened.

quinnj

Overall this looks good; mainly keeping the necessary parts from before, and allowing Arrow to do a lot of the pointer hopping (I assume, I'll take a look at Arrow.jl next).

The DataStreams interface needs a bit of work to make sure we fully implement, but I can help w/ that easy enough.

Thanks again for all the work here! I'll start digging into Arrow.jl and once we get that registered, we can merge this!

quinnj · 2018-03-24T18:18:10Z

src/utils.jl

+getoutputlength(version::Int32, x::Integer) = version < FEATHER_VERSION ? x : padding(x)
+
+function checkmagic(filename::AbstractString, data::AbstractVector{UInt8})
+    header = data[1:4]


this will throw an out of bounds error if the user mistakenly passes a wrong file name (and the data vector is empty)

ok, I'll add a check

quinnj · 2018-03-24T18:18:36Z

src/utils.jl

+    end
+end
+
+function checkfilelength(filename::AbstractString, data::AbstractVector{UInt8})


this should probably just be combined w/ the checkmagic above.

ExpandingMan · 2018-03-24T18:34:44Z

Great thanks.

I'm hoping you'll be very pleased with how isolated the pointer code is in Arrow.jl, I tried really hard to make it safe. Of course, there's nothing I can do about users giving the wrong array indices, but I hope that's all that can really go wrong.

Like I said, I'm still contemplating improving the Arrow.jl constructors, so that may change some things here. I also have to update it so that we can use other data types for the offsets, I think we'll need to do Int64 by default.

ExpandingMan · 2018-03-27T01:12:50Z

Ok, I've just added a whole lot more unit tests, so this damn thing better not have any mysterious uncaught errors anymore. Still not totally sure how that godawful segfault was getting through those tests. Think 0.6 fails for some reason, will fix eventually.

…hat it's registered).

…nto arrow1

quinnj

This is looking good; I left a few comments. The biggest work needed is on the DataStreams implementation, which I can do. @ExpandingMan are you ok if I just push to your PR here?

quinnj · 2018-04-04T17:42:21Z

src/Feather.jl


 if Base.VERSION < v"0.7.0-DEV.2575"
    const Dates = Base.Dates
+    using Missings


These imports don't really jive w/ the version check here.

Sorry, thihnk I just left in an older existing version check. I don't know how to find the specific versions, do you know them?

quinnj · 2018-04-04T17:42:42Z

src/Feather.jl

 else
    import Dates
 end
-
-if Base.VERSION >= v"0.7.0-DEV.2009"
+if Base.VERSION ≥ v"0.7.0-DEV.2009"


We should just use Compat for all these stdlib deprecations now

quinnj · 2018-04-04T17:46:20Z

src/metadata.jl

    encoding::Encoding
    offset::Int64
    length::Int64
    null_count::Int64
    total_bytes::Int64
 end

+# TODO why are these done this way rather with an abstract type???


Can you expound this comment?

I don't even remember. I think I wrote that comment very early on, probably not realizing that a @UNION was coming from FlatBuffers. Will delete.

quinnj · 2018-04-04T17:47:42Z

src/metadata.jl

    Metadata.BINARY    => Vector{UInt8},
    Metadata.CATEGORY  => Int64,
    Metadata.TIMESTAMP => Int64,
    Metadata.DATE      => Int64,
    Metadata.TIME      => Int64
 )

-const julia2Type_ = Dict{DataType,Metadata.Type_}(
+const MDATA_TYPE_DICT = Dict{DataType,Metadata.DType}(


we should just spell out METADATA_TYPE_DICT, MDATA is unclear.

will change

quinnj · 2018-04-04T17:52:11Z

src/metadata.jl

 )
-const julia2TimeUnit = Dict{DataType,Metadata.TimeUnit}([(v, k) for (k,v) in TimeUnit2julia])
+const MDATA_TIME_DICT = Dict{DataType,Metadata.TimeUnit}(v=>k for (k,v) in JULIA_TIME_DICT)


spell out in full here too

ExpandingMan · 2018-04-05T13:27:19Z

Absolutely, feel free to make push's to my PR. I've never head push's on a PR before, do I just merge them the same way I would PR's to master?

Also keep in mind that when we merge this we don't have to tag it immediately, so we can always make additional PR's immediately after. We'll have to do some documentation PR's before tagging regardless (I intended to wait until this is merged to do that).

…nto arrow1

ExpandingMan · 2018-04-09T15:42:54Z

I've fixed the broken unit test on 0.6. Note that DataStreams will need to be tagged for us to not get test errors on 0.7.

quinnj · 2018-04-10T22:44:09Z

Thanks @ExpandingMan for all the hard work here; awesome to see all the new arrow/feather functionality progressing.

ExpandingMan added 24 commits January 10, 2018 15:27

Fixed JuliaData#70.

2ec0117

Initial cleanup.

d437526

Split into multiple files.

c33a8f0

Moved Arrow.jl to its own directory.

f6f2f70

Fixed method ambiguity in getmetadata.

7d616f6

Initial implementation with arrow backend.

37dde32

Fixed errors; now works for basic bits types and strings.

a0cf08e

Now correctly implement datetime.

7ee89b0

Rewrote column constructors to be reasonable and sane.

24b7651

Dict encoding now working.

dd3c7a6

Started writing sink.

c81675a

Reads now work with new version of Arrow.

5a632ac

Continuing to work on sinks.

9470501

Everything works except dictionary encoding, which is currently compl…

3603fbd

…etely fucked on write side.

Most column types now supported.

5988002

Most functionality now properly implemented.

c38aa23

Finally supports bools!

82abe14

Trying to fixed DictEncoding but it's still fucked up.

de6dc61

Finally completely fixed DictEncoding.

1caf39f

Fixed unit testing.

cf8094f

Removed old reference files.

23d8ce4

Added some materialize methods.

486caf2

Removed old comment about DictEncoding being fucked up.

cbb2b89

Removed old reference file fileio.jl

bb582f1

ExpandingMan mentioned this pull request Feb 20, 2018

code reorganization (preparation for Arrow implementation) #73

Closed

ExpandingMan and others added 3 commits February 20, 2018 15:45

Removed spurious comment.

7f75e47

Removed spurious comment.

d859ca1

Merge branch 'master' into arrow1

f1cabdc

ExpandingMan mentioned this pull request Feb 20, 2018

code abstraction for Arrow #72

Closed

Tried to fix appveyor yaml.

1ab7749

Replaced uninitialized with undef.

748f286

ExpandingMan mentioned this pull request Mar 21, 2018

Feather.materialize can easily cause segfaults queryverse/FeatherLib.jl#3

Closed

quinnj reviewed Mar 24, 2018

View reviewed changes

Fixed file potential file validation bug.

b5bc7c5

ExpandingMan added 2 commits March 24, 2018 21:02

Updated for new Arrow locator interface.

5f317cd

Cleaned up Source functions a bit.

1fb65b7

ExpandingMan changed the title ~~Overhauled to Arrow Back-End and Full Memory Safety~~ Overhauled to Arrow Back-End and Better Memory Safety Mar 25, 2018

ExpandingMan added 3 commits March 26, 2018 16:55

Started adding extra tests.

2b08439

Removed references to pre 0.6 in README.

030de0e

Added more unit tests.

666cb8d

ExpandingMan added 2 commits April 2, 2018 09:56

Removed explicit Arrow clone commands from travis and appveyor (now t…

8c5dcc0

…hat it's registered).

Merge branch 'arrow1' of https://github.com/ExpandingMan/Feather.jl i…

50c1426

…nto arrow1

quinnj reviewed Apr 4, 2018

View reviewed changes

ExpandingMan added 3 commits April 5, 2018 19:27

Small fixes.

fdc4ef2

Fixed breaking test due to poor type inference on 0.6.

eae7b7e

Merge branch 'arrow1' of https://github.com/ExpandingMan/Feather.jl i…

cbccde1

…nto arrow1

quinnj merged commit a2558d1 into JuliaData:master Apr 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overhauled to Arrow Back-End and Better Memory Safety #78

Overhauled to Arrow Back-End and Better Memory Safety #78

ExpandingMan commented Feb 20, 2018 •

edited

Loading

ExpandingMan commented Mar 21, 2018

KristofferC commented Mar 21, 2018

ExpandingMan commented Mar 21, 2018

KristofferC commented Mar 21, 2018

ExpandingMan commented Mar 21, 2018

KristofferC commented Mar 21, 2018

ExpandingMan commented Mar 21, 2018

quinnj Mar 24, 2018

ExpandingMan Mar 24, 2018

quinnj Mar 24, 2018

quinnj left a comment

quinnj Mar 24, 2018

ExpandingMan Mar 24, 2018

quinnj Mar 24, 2018

ExpandingMan commented Mar 24, 2018

ExpandingMan commented Mar 27, 2018

quinnj left a comment

quinnj Apr 4, 2018

ExpandingMan Apr 5, 2018

quinnj Apr 4, 2018

quinnj Apr 4, 2018

ExpandingMan Apr 5, 2018

quinnj Apr 4, 2018

ExpandingMan Apr 5, 2018

quinnj Apr 4, 2018

ExpandingMan commented Apr 5, 2018

ExpandingMan commented Apr 9, 2018

quinnj commented Apr 10, 2018

		writemetadata(sink::Sink) = writemetadata(sink.io, sink.ctable)


		function Data.close!(sink::Sink)

Overhauled to Arrow Back-End and Better Memory Safety #78

Overhauled to Arrow Back-End and Better Memory Safety #78

Conversation

ExpandingMan commented Feb 20, 2018 • edited Loading

New Default Reading Behavior

Better Memory Safety

Dropped Support for Some Non-Standard Formatting

Less Dependent on DataStreams

Appropriate Updates to Unit Tests

ExpandingMan commented Mar 21, 2018

KristofferC commented Mar 21, 2018

ExpandingMan commented Mar 21, 2018

KristofferC commented Mar 21, 2018

ExpandingMan commented Mar 21, 2018

KristofferC commented Mar 21, 2018

ExpandingMan commented Mar 21, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

quinnj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ExpandingMan commented Mar 24, 2018

ExpandingMan commented Mar 27, 2018

quinnj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ExpandingMan commented Apr 5, 2018

ExpandingMan commented Apr 9, 2018

quinnj commented Apr 10, 2018

ExpandingMan commented Feb 20, 2018 •

edited

Loading