Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

code abstraction for Arrow #72

Closed
ExpandingMan opened this issue Jan 11, 2018 · 7 comments
Closed

code abstraction for Arrow #72

ExpandingMan opened this issue Jan 11, 2018 · 7 comments

Comments

@ExpandingMan
Copy link
Collaborator

ExpandingMan commented Jan 11, 2018

This issue is for tracking progress related to re-organizing Feather.jl to rely on an underlying "Arrow" object. Ultimately we will split off the implementation of Arrow into a separate Arrow.jl module.

Initially, the goal will be to create an ArrowBuffer (naming tentative) struct which implements the arrow design specification. With this done, Feather.jl will essentially contain DataStreams Source and Sink for writing to and from files.

@davidanthoff
Copy link
Contributor

I completely agree with this, it would be fantastic to have an Arrow.jl package that doesn't depend on either DataStreams. or TableTraits.jl, but just provides the underlying types and low level methods, and then Feather.jl could have the DataStreams.jl implementation and FeatherFiles.jl the TableTraits.jl implementation, but both would use the underlying Arrow.jl package.

@ExpandingMan
Copy link
Collaborator Author

ExpandingMan commented Jan 11, 2018

That's the plan.

@ExpandingMan
Copy link
Collaborator Author

You can see my progress here. So far I have partial implementations of primitive arrays and lists. My approach is to make these AbstractVectors with all the methods thereof.

One thing that I am very confused about is metdata and to what extent it should be standaradized. The Feather metadata format seems to have nothing to do with the metadata format that they talked about in Arrow. Is metadata entirely up to the implementation? If so it limits how much can be implemented directly in Arrow.jl. Right now I do not know whether Arrow.jl should contain any code for handling metadata at all.

Right now I only have methods for reading data. Once I am sure that things are basically working (that I can read feather files) I will work on write methods.

@ExpandingMan
Copy link
Collaborator Author

For those interested, I'm now mostly done with the read side of the implementation. If you want you can check out Arrow.jl and my arrow1 branch of Feather.jl and read in Feather files (DataStreams not implemented yet, but you can create dataframes that are basically just views of the data).

I haven't implemented categorical data, so next I have to do that and writing (writing actually should be pretty easy at this point).

@ExpandingMan
Copy link
Collaborator Author

I've now completed essentially all of the read side for both Arrow and Feather (except for Arrow structs and the DataStreams interface). Right now everything is read from the data directly in the correct binary format. I'd like to make it easy to do automatic conversions, but I'm still thinking about how to do that (this is important for datetime, for example, in which the underlying binary format in Arrow is different than in Julia).

Next I'm going to implement writing in both arrow and feather.

@ExpandingMan
Copy link
Collaborator Author

See #78.

@ExpandingMan
Copy link
Collaborator Author

Now merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants