Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-8426: [Rust] [Parquet] - Add more support for converting Dicts #8402

Closed

Conversation

carols10cents
Copy link
Contributor

@carols10cents carols10cents commented Oct 8, 2020

This adds more support for:

  • When converting Arrow -> Parquet containing an Arrow Dictionary,
    materialize the Dictionary values and send to Parquet to be encoded with
    a dictionary or not according to the Parquet settings (deliberately not supporting
    converting an Arrow Dictionary directly to Parquet DictEncoding, and right now this only supports String dictionaries)
  • When converting Parquet -> Arrow, noticing that the Arrow schema
    metadata in a Parquet file has a Dictionary type and converting the data
    to an Arrow dictionary (right now this only supports String dictionaries)

I'm not sure if this is in a good enough state to merge or not yet, please let me know @nevi-me !

@github-actions
Copy link

github-actions bot commented Oct 8, 2020

@nevi-me
Copy link
Contributor

nevi-me commented Oct 10, 2020

Parquet's dictionary encoding is a complexity on its own. My understanding's that after a certain size, the dictionary no longer grows, but the additional values are stored the normal way. I'm still to spend more time on parquet-mr and the format.
I think the approach of not forcing Arrow dictionaries to have Parquet dictionary encoding is good.

also only supports Int32 index types in this commit, also removes NULLs

Do you want to work on other index types and supporting primitive Arrow dictionaries? We could keep this PR open for longer; as long as it's not blocking any additional unit of work.

rust/parquet/src/arrow/converter.rs Outdated Show resolved Hide resolved
rust/parquet/src/arrow/arrow_writer.rs Outdated Show resolved Hide resolved
rust/parquet/src/arrow/array_reader.rs Outdated Show resolved Hide resolved
rust/parquet/src/arrow/arrow_writer.rs Outdated Show resolved Hide resolved
rust/parquet/src/arrow/arrow_writer.rs Outdated Show resolved Hide resolved
@nevi-me nevi-me force-pushed the rust-parquet-arrow-writer branch from b27f63a to bd3c714 Compare October 12, 2020 21:04
@carols10cents
Copy link
Contributor Author

Do you want to work on other index types and supporting primitive Arrow dictionaries? We could keep this PR open for longer; as long as it's not blocking any additional unit of work.

Yup, I'm happy to do that! I'll be rebasing, addressing feedback, and adding to this on Wednesday.

@nevi-me nevi-me force-pushed the rust-parquet-arrow-writer branch from bd3c714 to f70e6db Compare October 16, 2020 04:26
@carols10cents carols10cents force-pushed the dict branch 2 times, most recently from 5fc3543 to 0dcf149 Compare October 16, 2020 16:44
@carols10cents carols10cents marked this pull request as draft October 16, 2020 16:46
@carols10cents
Copy link
Contributor Author

Status update: The other index types are done, but primitive dictionaries are not yet.

@nevi-me nevi-me force-pushed the rust-parquet-arrow-writer branch from f70e6db to ead5e14 Compare October 17, 2020 17:52
@carols10cents carols10cents force-pushed the dict branch 3 times, most recently from 2bf54f9 to 79b78d9 Compare October 22, 2020 19:01
@carols10cents
Copy link
Contributor Author

@vertexclique @nevi-me I'm feeling stuck on converting primitive dictionaries...

I have a solution that works for one key/value type, but I tried to expand that to all the types and it involves listing out all the possible combinations (😱) and overflows the stack (😱😱😱).

I have tried to find a different abstraction, though, and the type checker doesn't like anything I've come up with. Do you have any suggestions?

@alamb
Copy link
Contributor

alamb commented Oct 23, 2020

@vertexclique @nevi-me I'm feeling stuck on converting primitive dictionaries...

I have a solution that works for one key/value type, but I tried to expand that to all the types and it involves listing out all the possible combinations (😱) and overflows the stack (😱😱😱).

I have tried to find a different abstraction, though, and the type checker doesn't like anything I've come up with. Do you have any suggestions?

@carols10cents -- one idea I had which might be less efficient at runtime but possibly be less complicated to implement, would be to use the arrow cast kernels here: https://github.com/apache/arrow/blob/master/rust/arrow/src/compute/kernels/cast.rs

So rather than going directly from ParquetType to DesiredArrowType we could go from ParquetType --> CanonicalArrowType and then from CanonicalArrowType --> DesiredArrowType

So for example, to generate a Dictionary<UInt8, Utf8> from a parquet column of Utf8 you could always create Dictionary<Uint64, Utf8> and then use cast to go to the desired arrow type

Does that make sense?

@carols10cents
Copy link
Contributor Author

@carols10cents -- one idea I had which might be less efficient at runtime but possibly be less complicated to implement, would be to use the arrow cast kernels here: https://github.com/apache/arrow/blob/master/rust/arrow/src/compute/kernels/cast.rs

So rather than going directly from ParquetType to DesiredArrowType we could go from ParquetType --> CanonicalArrowType and then from CanonicalArrowType --> DesiredArrowType

So for example, to generate a Dictionary<UInt8, Utf8> from a parquet column of Utf8 you could always create Dictionary<Uint64, Utf8> and then use cast to go to the desired arrow type

Does that make sense?

Not really, because I am using the cast kernels in the Converter: 4b59fc9 (#8402) in the style of the other converters in that file, so I'm not sure how to rearrange that to reduce complexity :-/ Could you possibly put together a code sketch of what you mean?

@alamb
Copy link
Contributor

alamb commented Oct 23, 2020

Not really, because I am using the cast kernels in the Converter: 4b59fc9 (#8402) in the style of the other converters in that file

That code seems to be using cast as part of the implementation of casting dictionary keys, very similar to the way cast does internally: https://github.com/apache/arrow/blob/master/rust/arrow/src/compute/kernels/cast.rs#L979-L982

I was trying to suggest rather than making some sort of generic Convert<> function there was a function like:

convert_parquet_column(col_iter) --> ArrayRef

Where the choice of type for ArrayRef is entirely based on the type of col_iter (the parquet type, not the desired arrow type). And then use the cast kernel to go to the specific Arrow type that is needed

@nevi-me nevi-me force-pushed the rust-parquet-arrow-writer branch from 8ccd9c3 to 9ba2179 Compare October 25, 2020 03:10
@nevi-me
Copy link
Contributor

nevi-me commented Oct 25, 2020

I've botched this branch a bit with my rebase on the parquet branch.
I rebased it against the parquet branch, but then I started getting stack overflows on datafusion and parquet.

(EDIT: I only see now that you mention the overflow on your last commit)

I've established that my safest path back to the parquet branch is to first rebase against one of my branches by nevi-me/arrow@ARROW-7842-cherry...integer32llc:dict

@carols10cents please let me know what you think.
I was trying to get everything up-to-date before I look at this PR locally (want to try Andrew's cast approach)

@nevi-me
Copy link
Contributor

nevi-me commented Oct 25, 2020

@carols10cents @alamb I think the whole reader logic needs replumbing ... There's at least a 1:1 mapping between Parquet types and Arrow types, and we can cast from Arrow types to other Arrow types based on the Arrow metadata. This is a less complex path, because one of the things I've been concerned about is that I/we are going to struggle a lot when we get to deeply-nested reads.

I previously didn't understand your needs re. dictionary support between Parquet > Arrow > DataFusion. I now have context, so I can make decisions better.

My plan was to remove trait CastRecordReader altogether, and instead use Arrow casts.
I prefer Arrow casts because they handle transparent casts of dyn Array & DataType::ANY instead of the combinatoral CastRecordReader.

I've now done this in integer32llc#3, but I left a lot of TODOs which I'd love for us to address so we don't carry the tech debt of cast converters.

The tests all pass now 🎊

This adds more support for:

- When converting Arrow -> Parquet containing an Arrow Dictionary,
materialize the Dictionary values and send to Parquet to be encoded with
a dictionary or not according to the Parquet settings (not supported:
converting an Arrow Dictionary directly to Parquet DictEncoding, also
only supports Int32 index types in this commit, also removes NULLs)
- When converting Parquet -> Arrow, noticing that the Arrow schema
metadata in a Parquet file has a Dictionary type and converting the data
to an Arrow dictionary (right now this only supports String dictionaries
Copy link
Contributor

@nevi-me nevi-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My pleasure Carol, and thanks for cleaning up the cast converters

And yes, as you noted, I cherry-picked the "We need a custom comparison of ArrayData" commit from your ARROW-7842-cherry branch so that more tests would work on this branch. Do you think that commit is ready to go, even if the other commits on that branch aren't?

Yea, I really should get back to #8200. We need to make breaking changes to truly fix it, so I should disable the tests that still fail, so we can merge it in. I'll prioritise that this week.

I'm happy with this PR, subject to removing the C++ comment, and a clean CI (unless it's something minor).

I think after this, we should move work to the main branch for the rest of the Parquet <> Arrow IO 👍🏾

@carols10cents
Copy link
Contributor Author

@nevi-me Rebased and fixed the last few things! I pulled the comment fix into its own PR, thanks for the tip on CI. Hoping for all greens now!

nevi-me added a commit that referenced this pull request Oct 27, 2020
This adds more support for:

- When converting Arrow -> Parquet containing an Arrow Dictionary,
materialize the Dictionary values and send to Parquet to be encoded with
a dictionary or not according to the Parquet settings (deliberately not supporting
converting an Arrow Dictionary directly to Parquet DictEncoding, and right now this only supports String dictionaries)
- When converting Parquet -> Arrow, noticing that the Arrow schema
metadata in a Parquet file has a Dictionary type and converting the data
to an Arrow dictionary (right now this only supports String dictionaries)

I'm not sure if this is in a good enough state to merge or not yet, please let me know @nevi-me !

Closes #8402 from carols10cents/dict

Lead-authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com>
Co-authored-by: Neville Dipale <nevilledips@gmail.com>
Co-authored-by: Jake Goulding <jake.goulding@gmail.com>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
@nevi-me
Copy link
Contributor

nevi-me commented Oct 27, 2020

Merged

@nevi-me nevi-me closed this Oct 27, 2020
nevi-me added a commit that referenced this pull request Oct 27, 2020
This adds more support for:

- When converting Arrow -> Parquet containing an Arrow Dictionary,
materialize the Dictionary values and send to Parquet to be encoded with
a dictionary or not according to the Parquet settings (deliberately not supporting
converting an Arrow Dictionary directly to Parquet DictEncoding, and right now this only supports String dictionaries)
- When converting Parquet -> Arrow, noticing that the Arrow schema
metadata in a Parquet file has a Dictionary type and converting the data
to an Arrow dictionary (right now this only supports String dictionaries)

I'm not sure if this is in a good enough state to merge or not yet, please let me know @nevi-me !

Closes #8402 from carols10cents/dict

Lead-authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com>
Co-authored-by: Neville Dipale <nevilledips@gmail.com>
Co-authored-by: Jake Goulding <jake.goulding@gmail.com>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
@carols10cents carols10cents deleted the dict branch October 28, 2020 13:33
nevi-me added a commit that referenced this pull request Oct 28, 2020
This adds more support for:

- When converting Arrow -> Parquet containing an Arrow Dictionary,
materialize the Dictionary values and send to Parquet to be encoded with
a dictionary or not according to the Parquet settings (deliberately not supporting
converting an Arrow Dictionary directly to Parquet DictEncoding, and right now this only supports String dictionaries)
- When converting Parquet -> Arrow, noticing that the Arrow schema
metadata in a Parquet file has a Dictionary type and converting the data
to an Arrow dictionary (right now this only supports String dictionaries)

I'm not sure if this is in a good enough state to merge or not yet, please let me know @nevi-me !

Closes #8402 from carols10cents/dict

Lead-authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com>
Co-authored-by: Neville Dipale <nevilledips@gmail.com>
Co-authored-by: Jake Goulding <jake.goulding@gmail.com>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
This adds more support for:

- When converting Arrow -> Parquet containing an Arrow Dictionary,
materialize the Dictionary values and send to Parquet to be encoded with
a dictionary or not according to the Parquet settings (deliberately not supporting
converting an Arrow Dictionary directly to Parquet DictEncoding, and right now this only supports String dictionaries)
- When converting Parquet -> Arrow, noticing that the Arrow schema
metadata in a Parquet file has a Dictionary type and converting the data
to an Arrow dictionary (right now this only supports String dictionaries)

I'm not sure if this is in a good enough state to merge or not yet, please let me know @nevi-me !

Closes apache#8402 from carols10cents/dict

Lead-authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com>
Co-authored-by: Neville Dipale <nevilledips@gmail.com>
Co-authored-by: Jake Goulding <jake.goulding@gmail.com>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants