-
Notifications
You must be signed in to change notification settings - Fork 485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incompatibility with ORC files written in version 0.12 due to missing hasNull field in C++ Reader #2079
Comments
We are temporarily fixing this problem by applying a patch. |
HI @suxiaogang223 , there is a patch #2055 to disable PPD when |
Done, please review this pr |
close: apache#2079 relate pr: apache#2055 Introduce fallback logic in the C++ reader to set hasNull to true when the field is missing, similar to the Java implementation. The Java implementation includes the following logic: ```java if (stats.hasHasNull()) { hasNull = stats.getHasNull(); } else { hasNull = true; } ``` In contrast, the C++ implementation directly uses the has_null value without any fallback logic: ```c++ ColumnStatisticsImpl::ColumnStatisticsImpl(const proto::ColumnStatistics& pb) { stats_.setNumberOfValues(pb.number_of_values()); stats_.setHasNull(pb.has_null()); } ``` We encountered an issue with the C++ implementation of the ORC reader when handling ORC files written with version 0.12. Specifically, files written in this version do not include the hasNull field in the column statistics metadata. While the Java implementation of the ORC reader handles this gracefully by defaulting hasNull to true when the field is absent, the C++ implementation does not handle this scenario correctly. **This issue prevents predicates like IS NULL from being pushed down to the ORC reader!!! As a result, all rows in the file are filtered out, leading to incorrect query results :(** I have tested this using [Doris](https://github.com/apache/doris) external pipeline: apache/doris#45104 apache/doris-thirdparty#259 No Closes apache#2082 from suxiaogang223/fix_has_null. Authored-by: Socrates <suxiaogang223@icloud.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Issue Description:
We encountered an issue with the C++ implementation of the ORC reader when handling ORC files written with version 0.12. Specifically, files written in this version do not include the hasNull field in the column statistics metadata. While the Java implementation of the ORC reader handles this gracefully by defaulting hasNull to true when the field is absent, the C++ implementation does not handle this scenario correctly.
This issue prevents predicates like IS NULL from being pushed down to the ORC reader!!! As a result, all rows in the file are filtered out, leading to incorrect query results :(
Steps to Reproduce:
Expected Behavior:
The C++ ORC reader should default the hasNull field to true when it is absent, ensuring compatibility with older file versions.
Observed Behavior:
The C++ ORC reader default the hasNull field to false, resulting in incorrect metadata interpretation.
Comparison with Java Implementation:
The Java implementation includes the following logic:
In contrast, the C++ implementation directly uses the has_null value without any fallback logic:
Suggested Fix:
Introduce fallback logic in the C++ reader to set hasNull to true when the field is missing, similar to the Java implementation.
The text was updated successfully, but these errors were encountered: