-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for dictionary stripes #68
Conversation
In Arrow, dictionary columns are a separate data type, while in ORC they are a per-stripe encoding. This means we cannot get an Arrow schema for a whole ORC file, the Arrow schema is only valid per-stripe
Oh this is a very good pickup 👀 Thanks for this, I'll try review soon 👍 |
I'm thinking we'll have to read all String into String type array and forgo copying dictionary encoded string stripe columns directly into Arrow dictionary arrays:
So I think will need to change the logic for decoding dictionary encoded string stripes to just decode to regular StringArray |
So this means datafusion won't be able to use ORC dictionaries for predicate pushdown, right? |
I haven't thought that far ahead yet honestly. The main takeaway is that we'll always decode to StringArrays. As for how that's done internally and how it can affect predicate pushdown, that remains to be seen. |
So I realized there is an arrow kernel for casting/converting from dictionary to primitive so I used it as a quick fix: 7f66552 My understanding of arrow dictionary encoding and how it interacts with datafusion is still immature so I'm continuing to do some reading on this matter (some of my assumptions in above comments are probably incorrect). I noticed parquet had a similar issue which I have been reading through: apache/arrow-rs#171 |
I've created an issue here for tracking: datafusion-contrib/orc-rust#24 I probably won't spend too much more time investigating this until the crate becomes more feature complete (will focus on correctness over performance for now), but will appreciate any further insights/contributions 👍 |
Should be raised against https://github.com/datafusion-contrib/orc-rust now |
In Arrow, dictionary columns are a separate data type, while in ORC they are a per-stripe encoding. This means we cannot get an Arrow schema for a whole ORC file, the Arrow schema is only valid per-stripe
Unfortunately, this breaks a feature of this crate (which I assume is important for
datafusion
), and I don't see a way out. Thoughts?This changes the error in
test1
from "Incorrect datatype" error to a difference in serialized output.(Note:
test1.orc
has a binary column, so you should apply #67 first if you want to see the change.)