-
Notifications
You must be signed in to change notification settings - Fork 3.9k
ARROW-17524: [C++] Correction for fields included when reading an ORC table #13962
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-17524: [C++] Correction for fields included when reading an ORC table #13962
Conversation
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename pull request title in the following format? or See also: |
|
@LouisClt Thanks for the report and attempt at fixing! Could you open a JIRA ticket as described above? |
|
@pitrou Yes I can. |
|
|
|
C Glib tests do not pass because it test the old behaviour. I think the 2 behaviours could be understandable but the one with "include" is more coherent with the other imports. Furthermore, I do not know if there is a way in Arrow to get the list of internal ORC types, which makes selecting fields with "includeTypes" much more unreliable. |
|
So I changed the tests to reflect the new behaviour of selecting the fields. The job failure that remains seems to be unrelated. |
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
|
@LouisClt Feel free to say when this is ready for another review. |
|
Yes, this should be good now. I added a test in the ORC adapter concerning the selection of fields. Some tests did not pass, but I don't think it is related. |
| std::vector<int> selected_indices = {1, 3}; | ||
| AssertTableWriteReadEqual(table, table_selected, kDefaultSmallMemStreamSize / 16, | ||
| &selected_indices); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the test but:
- can we also test with non-empty data?
- can we test selecting a field that's after the struct (to ensure field numbering is as expected)?
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update @LouisClt . This looks good to me now.
|
CI failures are unrelated. |
|
Good to know, thanks @pitrou |
… table (#13962) I think there is a bug in the ORC reader : when we specify the fields indexes that we want to keep, it does not work correctly. Looking at the code, it seems to be because we do "includeTypes" in lieue of "include" when setting the ORC options. It can be problematic when we want to import an ORC table containing Union types as it will do an error at the import, even if we try not to import these specific fields. The definitions of the corresponding ORC methods are here : https://github.com/apache/orc/blob/72220851cbde164a22706f8d47741fd1ad3db190/c%2B%2B/src/Options.hh#L185-L191 and https://github.com/apache/orc/blob/72220851cbde164a22706f8d47741fd1ad3db190/c%2B%2B/src/Options.hh#L201-L207 Lead-authored-by: LouisClt <louis1110@hotmail.fr> Co-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>
I think there is a bug in the ORC reader : when we specify the fields indexes that we want to keep, it does not work correctly. Looking at the code, it seems to be because we do "includeTypes" in lieue of "include" when setting the ORC options.
It can be problematic when we want to import an ORC table containing Union types as it will do an error at the import, even if we try not to import these specific fields.
The definitions of the corresponding ORC methods are here :
https://github.com/apache/orc/blob/72220851cbde164a22706f8d47741fd1ad3db190/c%2B%2B/src/Options.hh#L185-L191
and
https://github.com/apache/orc/blob/72220851cbde164a22706f8d47741fd1ad3db190/c%2B%2B/src/Options.hh#L201-L207