-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-8426: [Rust] [Parquet] - Add more support for converting Dicts #8402
ARROW-8426: [Rust] [Parquet] - Add more support for converting Dicts #8402
Conversation
Parquet's dictionary encoding is a complexity on its own. My understanding's that after a certain size, the dictionary no longer grows, but the additional values are stored the normal way. I'm still to spend more time on parquet-mr and the format.
Do you want to work on other index types and supporting primitive Arrow dictionaries? We could keep this PR open for longer; as long as it's not blocking any additional unit of work. |
b27f63a
to
bd3c714
Compare
Yup, I'm happy to do that! I'll be rebasing, addressing feedback, and adding to this on Wednesday. |
bd3c714
to
f70e6db
Compare
5fc3543
to
0dcf149
Compare
Status update: The other index types are done, but primitive dictionaries are not yet. |
f70e6db
to
ead5e14
Compare
2bf54f9
to
79b78d9
Compare
@vertexclique @nevi-me I'm feeling stuck on converting primitive dictionaries... I have a solution that works for one key/value type, but I tried to expand that to all the types and it involves listing out all the possible combinations (😱) and overflows the stack (😱😱😱). I have tried to find a different abstraction, though, and the type checker doesn't like anything I've come up with. Do you have any suggestions? |
@carols10cents -- one idea I had which might be less efficient at runtime but possibly be less complicated to implement, would be to use the arrow So rather than going directly from So for example, to generate a Dictionary<UInt8, Utf8> from a parquet column of Does that make sense? |
Not really, because I am using the |
That code seems to be using I was trying to suggest rather than making some sort of generic
Where the choice of type for |
8ccd9c3
to
9ba2179
Compare
I've botched this branch a bit with my rebase on the parquet branch. (EDIT: I only see now that you mention the overflow on your last commit) I've established that my safest path back to the parquet branch is to first rebase against one of my branches by nevi-me/arrow@ARROW-7842-cherry...integer32llc:dict @carols10cents please let me know what you think. |
@carols10cents @alamb I think the whole reader logic needs replumbing ... There's at least a 1:1 mapping between Parquet types and Arrow types, and we can cast from Arrow types to other Arrow types based on the Arrow metadata. This is a less complex path, because one of the things I've been concerned about is that I/we are going to struggle a lot when we get to deeply-nested reads. I previously didn't understand your needs re. dictionary support between Parquet > Arrow > DataFusion. I now have context, so I can make decisions better. My plan was to remove I've now done this in integer32llc#3, but I left a lot of The tests all pass now 🎊 |
This adds more support for: - When converting Arrow -> Parquet containing an Arrow Dictionary, materialize the Dictionary values and send to Parquet to be encoded with a dictionary or not according to the Parquet settings (not supported: converting an Arrow Dictionary directly to Parquet DictEncoding, also only supports Int32 index types in this commit, also removes NULLs) - When converting Parquet -> Arrow, noticing that the Arrow schema metadata in a Parquet file has a Dictionary type and converting the data to an Arrow dictionary (right now this only supports String dictionaries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My pleasure Carol, and thanks for cleaning up the cast converters
And yes, as you noted, I cherry-picked the "We need a custom comparison of ArrayData" commit from your ARROW-7842-cherry branch so that more tests would work on this branch. Do you think that commit is ready to go, even if the other commits on that branch aren't?
Yea, I really should get back to #8200. We need to make breaking changes to truly fix it, so I should disable the tests that still fail, so we can merge it in. I'll prioritise that this week.
I'm happy with this PR, subject to removing the C++ comment, and a clean CI (unless it's something minor).
I think after this, we should move work to the main branch for the rest of the Parquet <> Arrow IO 👍🏾
@nevi-me Rebased and fixed the last few things! I pulled the comment fix into its own PR, thanks for the tip on CI. Hoping for all greens now! |
This adds more support for: - When converting Arrow -> Parquet containing an Arrow Dictionary, materialize the Dictionary values and send to Parquet to be encoded with a dictionary or not according to the Parquet settings (deliberately not supporting converting an Arrow Dictionary directly to Parquet DictEncoding, and right now this only supports String dictionaries) - When converting Parquet -> Arrow, noticing that the Arrow schema metadata in a Parquet file has a Dictionary type and converting the data to an Arrow dictionary (right now this only supports String dictionaries) I'm not sure if this is in a good enough state to merge or not yet, please let me know @nevi-me ! Closes #8402 from carols10cents/dict Lead-authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com> Co-authored-by: Neville Dipale <nevilledips@gmail.com> Co-authored-by: Jake Goulding <jake.goulding@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
Merged |
This adds more support for: - When converting Arrow -> Parquet containing an Arrow Dictionary, materialize the Dictionary values and send to Parquet to be encoded with a dictionary or not according to the Parquet settings (deliberately not supporting converting an Arrow Dictionary directly to Parquet DictEncoding, and right now this only supports String dictionaries) - When converting Parquet -> Arrow, noticing that the Arrow schema metadata in a Parquet file has a Dictionary type and converting the data to an Arrow dictionary (right now this only supports String dictionaries) I'm not sure if this is in a good enough state to merge or not yet, please let me know @nevi-me ! Closes #8402 from carols10cents/dict Lead-authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com> Co-authored-by: Neville Dipale <nevilledips@gmail.com> Co-authored-by: Jake Goulding <jake.goulding@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
This adds more support for: - When converting Arrow -> Parquet containing an Arrow Dictionary, materialize the Dictionary values and send to Parquet to be encoded with a dictionary or not according to the Parquet settings (deliberately not supporting converting an Arrow Dictionary directly to Parquet DictEncoding, and right now this only supports String dictionaries) - When converting Parquet -> Arrow, noticing that the Arrow schema metadata in a Parquet file has a Dictionary type and converting the data to an Arrow dictionary (right now this only supports String dictionaries) I'm not sure if this is in a good enough state to merge or not yet, please let me know @nevi-me ! Closes #8402 from carols10cents/dict Lead-authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com> Co-authored-by: Neville Dipale <nevilledips@gmail.com> Co-authored-by: Jake Goulding <jake.goulding@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
This adds more support for: - When converting Arrow -> Parquet containing an Arrow Dictionary, materialize the Dictionary values and send to Parquet to be encoded with a dictionary or not according to the Parquet settings (deliberately not supporting converting an Arrow Dictionary directly to Parquet DictEncoding, and right now this only supports String dictionaries) - When converting Parquet -> Arrow, noticing that the Arrow schema metadata in a Parquet file has a Dictionary type and converting the data to an Arrow dictionary (right now this only supports String dictionaries) I'm not sure if this is in a good enough state to merge or not yet, please let me know @nevi-me ! Closes apache#8402 from carols10cents/dict Lead-authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com> Co-authored-by: Neville Dipale <nevilledips@gmail.com> Co-authored-by: Jake Goulding <jake.goulding@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
This adds more support for:
materialize the Dictionary values and send to Parquet to be encoded with
a dictionary or not according to the Parquet settings (deliberately not supporting
converting an Arrow Dictionary directly to Parquet DictEncoding, and right now this only supports String dictionaries)
metadata in a Parquet file has a Dictionary type and converting the data
to an Arrow dictionary (right now this only supports String dictionaries)
I'm not sure if this is in a good enough state to merge or not yet, please let me know @nevi-me !