[BUG] ORC write does not include full timestamp metrics which causes spark to not do predicate push down #13899
Labels
0 - Backlog
In queue waiting for assignment
bug
Something isn't working
cuIO
cuIO issue
libcudf
Affects libcudf (C++/CUDA) code.
Spark
Functionality that helps Spark RAPIDS
Milestone
Describe the bug
I am not totally sure if this is a bug or a new feature, but it ends up being a performance problem for Spark.
ORC statistics for timestamps are confusing. They include a minimum, maximum, minimum UTC, maximum UTC, and then min and max nanoseconds.
CUDF only sets minimum and maximum UTC. Spark and java ORC by default use only minimum and maximum +min/max nanos for predicate push down. This results in predicate push down not working at all for timestamps in ORC.
NVIDIA/spark-rapids#9068
The Spark CPU writer fills in all of the fields. It would really be great if CUDF could do the same. The nano seconds is probably not that important, but timestamp really is important.
Steps/Code to reproduce bug
Write a file out with CUDF in ORC. Try to read the data in with spark along with a filter where the value is outside of the range for any value in a timestamp column.
This will result in all of the data being read when CUDF wrote the file, but if the CPU wrote the file none of it would have been read.
Expected behavior
Predicate push down works no matter who wrote the data.
The text was updated successfully, but these errors were encountered: