Skip to content

Conversation

@LouisClt
Copy link
Contributor

I think there is a bug in the ORC reader : when we specify the fields indexes that we want to keep, it does not work correctly. Looking at the code, it seems to be because we do "includeTypes" in lieue of "include" when setting the ORC options.
It can be problematic when we want to import an ORC table containing Union types as it will do an error at the import, even if we try not to import these specific fields.

The definitions of the corresponding ORC methods are here :
https://github.com/apache/orc/blob/72220851cbde164a22706f8d47741fd1ad3db190/c%2B%2B/src/Options.hh#L185-L191

and
https://github.com/apache/orc/blob/72220851cbde164a22706f8d47741fd1ad3db190/c%2B%2B/src/Options.hh#L201-L207

@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@pitrou
Copy link
Member

pitrou commented Aug 25, 2022

@LouisClt Thanks for the report and attempt at fixing! Could you open a JIRA ticket as described above?

@LouisClt LouisClt changed the title Correction for fields included when reading an ORC table ARROW-17524: [C++] Correction for fields included when reading an ORC table Aug 25, 2022
@LouisClt
Copy link
Contributor Author

@pitrou Yes I can.
Here is the new Jira https://issues.apache.org/jira/browse/ARROW-17524

@LouisClt LouisClt marked this pull request as ready for review August 25, 2022 13:50
@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@LouisClt
Copy link
Contributor Author

C Glib tests do not pass because it test the old behaviour.
Looking a bit more in detail at the difference between "include" and "includeTypes", it appears that "includeTypes" is based on indices of the tables types. The root of the tree (meaning the whole table) is type of index 0, and when there are complex types such as structs would give other "child" types. See https://orc.apache.org/docs/types.html for more information.
"include" select the fields in the same way than the other imports : each table has a certain number of fields, (we do not take into account the "children" types), if we want to select the first field, we put "0" in the list.

I think the 2 behaviours could be understandable but the one with "include" is more coherent with the other imports. Furthermore, I do not know if there is a way in Arrow to get the list of internal ORC types, which makes selecting fields with "includeTypes" much more unreliable.

@LouisClt
Copy link
Contributor Author

So I changed the tests to reflect the new behaviour of selecting the fields. The job failure that remains seems to be unrelated.
There is a decision to be made whether or not the intended behaviour of SelectFields is the previous one or this one.

@pitrou
Copy link
Member

pitrou commented Oct 5, 2022

@LouisClt Feel free to say when this is ready for another review.

@LouisClt LouisClt requested a review from pitrou October 6, 2022 07:29
@LouisClt
Copy link
Contributor Author

LouisClt commented Oct 6, 2022

Yes, this should be good now. I added a test in the ORC adapter concerning the selection of fields. Some tests did not pass, but I don't think it is related.

std::vector<int> selected_indices = {1, 3};
AssertTableWriteReadEqual(table, table_selected, kDefaultSmallMemStreamSize / 16,
&selected_indices);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the test but:

  • can we also test with non-empty data?
  • can we test selecting a field that's after the struct (to ensure field numbering is as expected)?

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @LouisClt . This looks good to me now.

@pitrou
Copy link
Member

pitrou commented Oct 18, 2022

CI failures are unrelated.

@pitrou pitrou merged commit d35bf87 into apache:master Oct 18, 2022
@LouisClt
Copy link
Contributor Author

Good to know, thanks @pitrou

kou pushed a commit that referenced this pull request Oct 20, 2022
… table (#13962)

I think there is a bug in the ORC reader : when we specify the fields indexes that we want to keep, it does not work correctly. Looking at the code, it seems to be because we do "includeTypes" in lieue of "include" when setting the ORC options.
It can be problematic when we want to import an ORC table containing Union types as it will do an error at the import, even if we try not to import these specific fields.

The definitions of the corresponding ORC methods are here : 
https://github.com/apache/orc/blob/72220851cbde164a22706f8d47741fd1ad3db190/c%2B%2B/src/Options.hh#L185-L191

and 
https://github.com/apache/orc/blob/72220851cbde164a22706f8d47741fd1ad3db190/c%2B%2B/src/Options.hh#L201-L207

Lead-authored-by: LouisClt <louis1110@hotmail.fr>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants