-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] ORC writer produce wrong timestamp metrics which causes spark not to do predicate push down #14325
Comments
@thirtiseven can you attach a minimal sample file that demonstrates the problem, and commands from the orc command line tool showing the error? |
@sameerz ok, a sample file: ORC_PPD_FAILED_GPU.zip run:
will get:
|
|
older orc-tools works fine (maybe no nanos support yet?)
|
FWIW, I'm also able to read the correct statistics in libcudf. |
Yes, the nanosecond support is later than ORC 1.5.2 |
Opened #14367 with A fix for nanosecond statistics. @thirtiseven can you please run the test with this branch and see if it affects the repro? |
Hi @vuule , I can still repro for both orc-tools and spark PPD with this branch. |
@thirtiseven is there any isolation regarding which timestamp values trigger the issue? |
Found that the nanoseconds are encoded as value + 1, that why CPU reader complained about the range - zero would become -1. |
@vuule Thanks! The new commit no longer crashes orc-tools, and the nanosecond values look the same! However, the predicate pushdown still somehow does not work for gpu files, seems to still have some mismatch with cpu. Any ideas? Some new result files from cpu/gpu: GPU meta:
CPU meta:
|
I did some digging and it is because the ORC reader is trying really hard to be cautious. Our writer version shows up as Original, which does not include a fix for timestamps (ORC-135). |
So I have been looking at the writer version, code and numbers that we were assigned by ORC a bit more, and I think we might be able to make this work. and are the code that holds a lot of this version information. The writer version is a little confusing because the C++ code and the java code use the similar names in slightly different ways, but I am going to go with the java code here, and then call out the C++ code when it appears to be different. The The C++ code is different. It has an explicit disallow lists for different versions of the C++ code around bloom filters. But for the part we care about it is similar to the java code. The main difference is that the check for ORC_135 does not look at the
So just from this it looks like we should be able to write out the |
Thank you for the analysis @revans2! |
@thirtiseven I've updated #14367 to exclude nanoseconds, your tests should be passing now; please verify and I'll make the PR ready for review. |
@vuule I'm afraid the push down tests still failed. Maybe it is blocked by the writer version issue? |
In which way does it fail? |
The related test cases in spark-rapids failed in the same way as before, the results indicating that predicate push down is not happening when reading GPU files.
|
) Issue #14325 Use uint when reading/writing nano stats because nanoseconds have int32 encoding (different from both unit32 and sint32, _obviously_), which does not use zigzag. sint32 uses zigzag, and unit32 does not allow negative numbers, so we can use uint since we'll never have negative nanoseconds. Also disabled the nanoseconds because it should only be written after ORC-135; we don't write the version so readers get confused if nanoseconds are there. Planning to re-enable once we start writing the version. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Nghia Truong (https://github.com/ttnghia) URL: #14367
Opened #14458 to include the writer code and the correct version. |
My test complains that:
It also fails many other ORC integration tests with a similar message. |
Updated the branch to write 0.6 instead of 0.7. I think that's in line with reader's expectation. |
Similar results:
|
@thirtiseven would you mind running the tests again with latest branch? I was working off of incorrect specs. Sorry to pull you into this so many times. |
The predicate pushdown works good with your new changes and it won't break other orc tests in spark-rapids. |
Finally! |
Closes #14325 Changes some of the metadata written to ORC file: - Include the (cuDF) writer code (5). - include writerVersion, with the value of 7; This value means that bugs up to ORC-517 are fixed. This version (6+ required) allows us to write the nanosecond statistics. This change can have unexpected impact, depending on how readers use these fields. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Robert (Bobby) Evans (https://github.com/revans2) - Nghia Truong (https://github.com/ttnghia) - Mike Wilson (https://github.com/hyperbolic2346) URL: #14458
Closes rapidsai#14325 Changes some of the metadata written to ORC file: - Include the (cuDF) writer code (5). - include writerVersion, with the value of 7; This value means that bugs up to ORC-517 are fixed. This version (6+ required) allows us to write the nanosecond statistics. This change can have unexpected impact, depending on how readers use these fields. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Robert (Bobby) Evans (https://github.com/revans2) - Nghia Truong (https://github.com/ttnghia) - Mike Wilson (https://github.com/hyperbolic2346) URL: rapidsai#14458
Describe the bug
PR #13848 added minimum/maximum and minimumNanos/maximumNanos for ORC Writer timestamp statistics. It was intended to fix #13899 that Spark does not do predicate push down for gpu generated timestamp files. However the predicate push down test is still fails after above PR was merged, see NVIDIA/spark-rapids#9075.
When trying to see the meta of related files with orc-tools, it throws
Exception in thread "main" java.lang.IllegalArgumentException: nanos > 999999999 or < 0
. And themin
max
are also mismatched with cpu-generated file with same data. I think it cause Spark to fail to do pushdown.Steps/Code to reproduce bug
spark-shell with spark-rapids:
orc-tools:
Related test cases in spark-rapids:
Support for pushing down filters for timestamp types
Expected behavior
The statistics for orc files should be reasonable and Spark should be able to do predicate push down on gpu-generated orc files.
Environment overview (please complete the following information)
The text was updated successfully, but these errors were encountered: