-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] ORC statistics are wrong when a double column is all NULL. #13793
Comments
Just to clarify a little after reading the java code. It appears that a default ColumnStatistics class is returned if there is no DoubleStatistics in the ColumnStatistics protocol buffers message. I am not sure if Spark just assumes it will be there incorrectly when doing predicate push down, but it does look to be explicit in the ORC java code |
From what I observe in the C++ repro code, it looks like the double statistics are just missing when there are no values in the column. This makes sense, as there is no min/max, and we (for some reason) don't include the sum of floating point columns. |
Yes, Spark just throws a warning message and skip the PPD. But from the ORC java code, it does assume |
Yes, to clarify this is not so much an issue with Spark as it is the way the ORC Java code works. Any application using ORC's Java libraries will crash in a similar manner trying to read these files. |
Do you know if ORC Java accepts partial DoubleColumnStatistics - no min and max, but sum is present. This is the change I made in the PR, to me this was the only content that makes sense for a column without valid elements. |
It looks like the crash is specific to Doubles and was introduced by ORC-629. I filed ORC-1482 to report the bug, however that doesn't change the fact that many ORC readers are broken until this is both fixed and adopted in those data frameworks.
The code appears to handle missing fields, see https://github.com/apache/orc/blob/v1.7.4/java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java#L522-L532. |
Closes #7087, closes #13793, closes #13899 This PR adds support for several cases and statistics types: - sum statistics are included even when all elements are null (no minmax); - sum statistics are included in double stats; - minimum/maximum and minimumNanos/maximumNanos are included in timestamp stats; - hasNull field is written for all columns. - decimal statistics Added tests for all supported stats. Authors: - Vukasin Milovanovic (https://github.com/vuule) - Karthikeyan (https://github.com/karthikeyann) Approvers: - Lawrence Mitchell (https://github.com/wence-) - Robert (Bobby) Evans (https://github.com/revans2) - Vyas Ramasubramani (https://github.com/vyasr) - Karthikeyan (https://github.com/karthikeyann) URL: #13848
Describe the bug
Report "Skipping ORC PPD" error when reading a ORC file with a double column is all NULL.
The statistic in the ORC file is wrong, refer to the following sections.
Note the query result is correct, but the PPD was skipped due to this error.
Steps/Code to reproduce bug
Generate a ORC file on GPU:
Read the ORC file from Spark.
Expected behavior
Fix "Skipping ORC PPD" warning.
Check other types besides the double type.
Environment overview (please complete the following information)
Environment details
cuDF 23.08 branch
Spark 3.3.0
orc-core-1.7.4.jar
Additional context
The error is:
Seems the ORC file does not contain
DoubleColumnStatistics
, so by default it's aColumnStatisticsImpl
.The text was updated successfully, but these errors were encountered: