Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Set the officially assigned id for the ORC writer implemented by cuDF #13977

Closed
guiyanakuang opened this issue Aug 27, 2023 · 5 comments
Closed
Assignees
Labels
0 - Backlog In queue waiting for assignment cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@guiyanakuang
Copy link

Is your feature request related to a problem? Please describe.
The Apache ORC community is planning to assign a writer id to cuDF in version 1.8.5.

cuDF has a standalone ORC writer implementation. If cuDF sets this id, then the official ORC reader can distinguish writers based on the file, and even provide independent compatibility methods for different writers to solve the write errors that have been persisted in the file.

Currently we have assigned ids for Java/C++/Presto/Scritchley/Trino writers.

Describe the solution you'd like
Add writerVersion field for PostScript, set value to 6.
Add writer field for FileFooter, set value to 5.

Additional context
It should be noted that if cuDF sets this id, it must be read with version 1.8.5 or higher. Lower versions will consider this a future impl and cannot support file merging.

@guiyanakuang guiyanakuang added Needs Triage Need team to review and classify feature request New feature or request labels Aug 27, 2023
@GregoryKimball
Copy link
Contributor

Thank you @guiyanakuang for sharing the assigned ids for libcudf's ORC writer.

It should be noted that if cuDF sets this id, it must be read with version 1.8.5 or higher. Lower versions will consider this a future impl and cannot support file merging.

Would you please help me learn more about this? I'm concerned that we might create compatibility issues for our users if we start setting this id. Would you please let me know if there are any known issues in libcudf's ORC writer where the official ORC reader would make use of the assigned ids?

@GregoryKimball GregoryKimball added 0 - Backlog In queue waiting for assignment libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue and removed Needs Triage Need team to review and classify labels Sep 27, 2023
@guiyanakuang
Copy link
Author

guiyanakuang commented Sep 27, 2023

cuDF(No id set) +-----write-----> ORC <------read--------+ ORC reader

This is also the current situation. Since orc 1.4 (2017), reading of the orc writer id has been supported, which means that the orc in the current production environment almost certainly supports reading of the writer id. cuDF does not set the id, and the orc reader defaults It will be regarded as created by orc java writer. This is why in the fuzzy area of the orc specification, although the files written by cuDF do not violate the specification, they are different from the official Java writer implementation, which may also cause exceptions for the official readers.

#9964
#13793

cuDF(set id) +-----write-----> ORC <------read--------+ ORC reader(>= 1.8.5 or >=1.9.2 or higher)

Recognize cuDF id and everything is fine

cuDF(set id) +-----write-----> ORC <------read--------+ ORC reader(<= 1.8.4 or <=1.9.1 or lower)

The read writer id is beyond the known range and is treated as a future id. There is no problem in reading, but the file cannot be merged. (Merge involves reading and writing)

https://github.com/apache/orc/blob/873e48f74da544b21bd0c5011f29ec8b40e0e4f9/java/core/src/java/org/apache/orc/OrcFile.java#L753-L759

https://github.com/apache/orc/blob/873e48f74da544b21bd0c5011f29ec8b40e0e4f9/java/core/src/java/org/apache/orc/OrcFile.java#L1055-L1065

@GregoryKimball
Copy link
Contributor

Thank you @guiyanakuang for adding more information here, but I'm afraid I still don't understand the situation.

If we start setting the id, would that mean that users with the official ORC reader (<= 1.8.4 or <=1.9.1 or lower) would no longer be able to merge ORC files written by cuDF with older files written by cuDF?

@guiyanakuang
Copy link
Author

Thank you @guiyanakuang for adding more information here, but I'm afraid I still don't understand the situation.

If we start setting the id, would that mean that users with the official ORC reader (<= 1.8.4 or <=1.9.1 or lower) would no longer be able to merge ORC files written by cuDF with older files written by cuDF?

Yes, for lower version readers (<= 1.8.4 or <=1.9.1 or lower), the cuDF writer id is unknown and is considered a future implementation for them, so they cannot guarantee the ability to merge these files. Apache Spark 3.4.2 started using ORC 1.8.5. If there are concerns about users being unable to merge files, unfortunately, we can only wait for users to upgrade to supported versions.

@vuule
Copy link
Contributor

vuule commented Dec 20, 2023

Implemented in #14458

@vuule vuule closed this as completed Dec 20, 2023
@github-project-automation github-project-automation bot moved this from In Progress to Done in cuDF/Dask/Numba/UCX Dec 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
Archived in project
Archived in project
Development

No branches or pull requests

3 participants