Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-264: File format #123

Closed
wants to merge 21 commits into from

Conversation

julienledem
Copy link
Member

This is work in progress

@julienledem julienledem force-pushed the arrow_264_file_format branch from ec856f0 to b426e9f Compare August 20, 2016 17:00
@julienledem julienledem force-pushed the arrow_264_file_format branch from b426e9f to b0bf6bc Compare August 23, 2016 20:32
@julienledem
Copy link
Member Author

A version that works for flat schemas.
still WIP

@julienledem
Copy link
Member Author

this is getting close.

@wesm
Copy link
Member

wesm commented Aug 24, 2016

Cool, anything I can do to help? I'll try to find some time to work on the C++ side of this so we can get a passing build with the new metadata and possibly also the file layout. We can use the existing IPC code to make things simpler

I haven't looked yet, but we should align each of the buffer writes in the file layout at word boundaries -- the data buffers should already be padded / aligned but I'm not sure if the serialized flatbuffers are padded at 8 byte boundaries. We just dealt with this in Feather (wesm/feather@002c798)

@julienledem
Copy link
Member Author

@wesm I haven't done it yet but it should be easy to pad things to to stay on 8 bytes boundary.

@julienledem
Copy link
Member Author

@wesm added alignment.
Things that don't work yet: UnionVector and Maps are not nullable


UInt4Vector offsets;
final UInt4Vector offsets;// TODO: THis masks the same vector in the parent which is assigned to this in the constructor.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO (for JIRA?): lists in the current arrow spec use signed int32

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove the TODO. That was more a personal note. I initially thought that the masked field had a different content but they actually have exactly the same content and I made it immutable so that there's no uncertainty whether it stays like that.

I'll open a separate bug for unsigned/signed of the offset vector.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://issues.apache.org/jira/browse/ARROW-273 for the type of the offset vector

@wesm
Copy link
Member

wesm commented Aug 25, 2016

I'm about halfway through skimming the patch, will finish tomorrow and then I think we can merge this quite soon

@julienledem
Copy link
Member Author

to me this is good to go.

@wesm
Copy link
Member

wesm commented Aug 26, 2016

+1 -- just finished reviewing, nice that you put this together so quickly! Fine by me to merge; we should try to soon reconcile the outstanding metadata patches. There's other minor things like names for things in the flatbuffer metadata as discussed elsewhere

@asfgit asfgit closed this in 803afeb Aug 26, 2016
@julienledem julienledem deleted the arrow_264_file_format branch August 31, 2016 21:44
zhouyuan pushed a commit to zhouyuan/arrow that referenced this pull request Jul 18, 2022
* Add perf. test cases

* Remove unnecessary copy

* Handle zero input case

* Fix bugs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants