-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Set the officially assigned id for the ORC writer implemented by cuDF #13977
Comments
Thank you @guiyanakuang for sharing the assigned ids for libcudf's ORC writer.
Would you please help me learn more about this? I'm concerned that we might create compatibility issues for our users if we start setting this id. Would you please let me know if there are any known issues in libcudf's ORC writer where the official ORC reader would make use of the assigned ids? |
cuDF(No id set) +-----write-----> ORC <------read--------+ ORC reader This is also the current situation. Since orc 1.4 (2017), reading of the orc writer id has been supported, which means that the orc in the current production environment almost certainly supports reading of the writer id. cuDF does not set the id, and the orc reader defaults It will be regarded as created by orc java writer. This is why in the fuzzy area of the orc specification, although the files written by cuDF do not violate the specification, they are different from the official Java writer implementation, which may also cause exceptions for the official readers. cuDF(set id) +-----write-----> ORC <------read--------+ ORC reader(>= 1.8.5 or >=1.9.2 or higher) Recognize cuDF id and everything is fine cuDF(set id) +-----write-----> ORC <------read--------+ ORC reader(<= 1.8.4 or <=1.9.1 or lower) The read writer id is beyond the known range and is treated as a future id. There is no problem in reading, but the file cannot be merged. (Merge involves reading and writing) |
Thank you @guiyanakuang for adding more information here, but I'm afraid I still don't understand the situation. If we start setting the id, would that mean that users with the official ORC reader |
Yes, for lower version readers (<= 1.8.4 or <=1.9.1 or lower), the cuDF writer id is unknown and is considered a future implementation for them, so they cannot guarantee the ability to merge these files. Apache Spark 3.4.2 started using ORC 1.8.5. If there are concerns about users being unable to merge files, unfortunately, we can only wait for users to upgrade to supported versions. |
Implemented in #14458 |
Is your feature request related to a problem? Please describe.
The Apache ORC community is planning to assign a writer id to cuDF in version 1.8.5.
cuDF has a standalone ORC writer implementation. If cuDF sets this id, then the official ORC reader can distinguish writers based on the file, and even provide independent compatibility methods for different writers to solve the write errors that have been persisted in the file.
Currently we have assigned ids for Java/C++/Presto/Scritchley/Trino writers.
Describe the solution you'd like
Add writerVersion field for PostScript, set value to 6.
Add writer field for FileFooter, set value to 5.
Additional context
It should be noted that if cuDF sets this id, it must be read with version 1.8.5 or higher. Lower versions will consider this a future impl and cannot support file merging.
The text was updated successfully, but these errors were encountered: