-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] orc file written by cudf doesn't include Column Statistics in RowIndex #9964
Comments
Is there a bug filed against Spark for this? It is one thing to work around it in CUDF, which would be nice, but if it really is optional then it is a bug in Spark itself. |
I think it's the ORC issue not a spark issue, I just filed an ORC issue https://issues.apache.org/jira/browse/ORC-1075 |
@wbo4958 can you confirm that Since |
@vuule, Yeah, columnStatistics is not present in RowIndex. I have filed an issue for the Orc reader. Suppose it is orc reader issue. Thx |
Closed this issue, since the orc file written by cudf is following ORC format. Cudf doesn't have to add statistics in RowIndex. ORC maintainer has confirmed it's the ORC java issue, and there is a PR pending to review. |
I think we can keep this open as a feature request. @wbo4958 are you okay with this option? |
sure. |
Scoped out the feature. Changes required:
|
Closes #9964 Encodes row group level stats with the rest and writes the encoded blobs into the protobuf, at the start of each stripe (other stats are in the file footer). Adds `put_bytes` to `ProtobufWriter` to optimize writing of buffers. Adds new struct to represent the encoded ORC statistics so they are separated by granularity level (instead of using a single vector). Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Mike Wilson (https://github.com/hyperbolic2346) - https://github.com/nvdbaranec URL: #10041
Both PRs are merged, closing. |
Spark 3.2 has changed the orc dependency to 1.6.11 which has different behaviors with orc 1.5.10 (spark-plugins shaded) when picking row group with filter pushed down.
In a word, Spark 3.2 will return empty when reading the orc file written by cudf with filter pushed down which is because of missing Column Statistic in RowIndex.
From the orc spec, Column Statistic of RowIndex seems not to be a required field. But if the orc file didn't include Column Statistic in RowIndex, the spark will get incorrect result.
The text was updated successfully, but these errors were encountered: