[FEA] Set the officially assigned id for the ORC writer implemented by cuDF #13977

guiyanakuang · 2023-08-27T07:16:38Z

Is your feature request related to a problem? Please describe.
The Apache ORC community is planning to assign a writer id to cuDF in version 1.8.5.

cuDF has a standalone ORC writer implementation. If cuDF sets this id, then the official ORC reader can distinguish writers based on the file, and even provide independent compatibility methods for different writers to solve the write errors that have been persisted in the file.

Currently we have assigned ids for Java/C++/Presto/Scritchley/Trino writers.

Describe the solution you'd like
Add writerVersion field for PostScript, set value to 6.
Add writer field for FileFooter, set value to 5.

Additional context
It should be noted that if cuDF sets this id, it must be read with version 1.8.5 or higher. Lower versions will consider this a future impl and cannot support file merging.

GregoryKimball · 2023-09-27T02:29:50Z

Thank you @guiyanakuang for sharing the assigned ids for libcudf's ORC writer.

It should be noted that if cuDF sets this id, it must be read with version 1.8.5 or higher. Lower versions will consider this a future impl and cannot support file merging.

Would you please help me learn more about this? I'm concerned that we might create compatibility issues for our users if we start setting this id. Would you please let me know if there are any known issues in libcudf's ORC writer where the official ORC reader would make use of the assigned ids?

guiyanakuang · 2023-09-27T09:26:11Z

cuDF(No id set) +-----write-----> ORC <------read--------+ ORC reader

This is also the current situation. Since orc 1.4 (2017), reading of the orc writer id has been supported, which means that the orc in the current production environment almost certainly supports reading of the writer id. cuDF does not set the id, and the orc reader defaults It will be regarded as created by orc java writer. This is why in the fuzzy area of the orc specification, although the files written by cuDF do not violate the specification, they are different from the official Java writer implementation, which may also cause exceptions for the official readers.

#9964
#13793

cuDF(set id) +-----write-----> ORC <------read--------+ ORC reader(>= 1.8.5 or >=1.9.2 or higher)

Recognize cuDF id and everything is fine

cuDF(set id) +-----write-----> ORC <------read--------+ ORC reader(<= 1.8.4 or <=1.9.1 or lower)

The read writer id is beyond the known range and is treated as a future id. There is no problem in reading, but the file cannot be merged. (Merge involves reading and writing)

https://github.com/apache/orc/blob/873e48f74da544b21bd0c5011f29ec8b40e0e4f9/java/core/src/java/org/apache/orc/OrcFile.java#L753-L759

https://github.com/apache/orc/blob/873e48f74da544b21bd0c5011f29ec8b40e0e4f9/java/core/src/java/org/apache/orc/OrcFile.java#L1055-L1065

GregoryKimball · 2023-10-05T16:34:35Z

Thank you @guiyanakuang for adding more information here, but I'm afraid I still don't understand the situation.

If we start setting the id, would that mean that users with the official ORC reader (<= 1.8.4 or <=1.9.1 or lower) would no longer be able to merge ORC files written by cuDF with older files written by cuDF?

guiyanakuang · 2023-10-06T03:22:57Z

Thank you @guiyanakuang for adding more information here, but I'm afraid I still don't understand the situation.

If we start setting the id, would that mean that users with the official ORC reader (<= 1.8.4 or <=1.9.1 or lower) would no longer be able to merge ORC files written by cuDF with older files written by cuDF?

Yes, for lower version readers (<= 1.8.4 or <=1.9.1 or lower), the cuDF writer id is unknown and is considered a future implementation for them, so they cannot guarantee the ability to merge these files. Apache Spark 3.4.2 started using ORC 1.8.5. If there are concerns about users being unable to merge files, unfortunately, we can only wait for users to upgrade to supported versions.

vuule · 2023-12-20T21:59:44Z

Implemented in #14458

guiyanakuang added Needs Triage Need team to review and classify feature request New feature or request labels Aug 27, 2023

github-project-automation bot added this to cuDF/Dask/Numba/UCX Aug 27, 2023

github-project-automation bot moved this to In Progress in cuDF/Dask/Numba/UCX Aug 27, 2023

GregoryKimball added 0 - Backlog In queue waiting for assignment libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue and removed Needs Triage Need team to review and classify labels Sep 27, 2023

GregoryKimball added this to the ORC continuous improvement milestone Nov 13, 2023

GregoryKimball added this to libcudf Nov 13, 2023

GregoryKimball moved this to Needs owner in libcudf Nov 13, 2023

GregoryKimball assigned vuule Nov 14, 2023

vuule closed this as completed Dec 20, 2023

github-project-automation bot moved this from In Progress to Done in cuDF/Dask/Numba/UCX Dec 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Set the officially assigned id for the ORC writer implemented by cuDF #13977

[FEA] Set the officially assigned id for the ORC writer implemented by cuDF #13977

guiyanakuang commented Aug 27, 2023

GregoryKimball commented Sep 27, 2023

guiyanakuang commented Sep 27, 2023 •

edited

Loading

GregoryKimball commented Oct 5, 2023

guiyanakuang commented Oct 6, 2023

vuule commented Dec 20, 2023

[FEA] Set the officially assigned id for the ORC writer implemented by cuDF #13977

[FEA] Set the officially assigned id for the ORC writer implemented by cuDF #13977

Comments

guiyanakuang commented Aug 27, 2023

GregoryKimball commented Sep 27, 2023

guiyanakuang commented Sep 27, 2023 • edited Loading

GregoryKimball commented Oct 5, 2023

guiyanakuang commented Oct 6, 2023

vuule commented Dec 20, 2023

guiyanakuang commented Sep 27, 2023 •

edited

Loading