[BUG] ORC write does not include full timestamp metrics which causes spark to not do predicate push down #13899

revans2 · 2023-08-17T16:53:23Z

Describe the bug
I am not totally sure if this is a bug or a new feature, but it ends up being a performance problem for Spark.

ORC statistics for timestamps are confusing. They include a minimum, maximum, minimum UTC, maximum UTC, and then min and max nanoseconds.

CUDF only sets minimum and maximum UTC. Spark and java ORC by default use only minimum and maximum +min/max nanos for predicate push down. This results in predicate push down not working at all for timestamps in ORC.

NVIDIA/spark-rapids#9068

The Spark CPU writer fills in all of the fields. It would really be great if CUDF could do the same. The nano seconds is probably not that important, but timestamp really is important.

Steps/Code to reproduce bug
Write a file out with CUDF in ORC. Try to read the data in with spark along with a filter where the value is outside of the range for any value in a timestamp column.

This will result in all of the data being read when CUDF wrote the file, but if the CPU wrote the file none of it would have been read.

Expected behavior
Predicate push down works no matter who wrote the data.

vuule · 2023-08-24T22:01:38Z

@revans2 could you share more info about the nanosecond fields? I don't see anything about that in the specs.
I sill add min/max in #13848; IIUC, the change is trivial because we always write files in UTC.

revans2 · 2023-08-30T19:47:54Z

@vuule I don't know about the spec. i was just reading through the code and looking at the dump of the metrics form the command line tool.

https://github.com/apache/orc/blob/a9e49656558e54f3f5c1b1eb0b9d70e2ad80bc21/proto/orc_proto.proto#L70-L72

is the protocol buffer code for the nanoseconds. If they are not provided, then the default for min is 0, and the default for max is 999999. The only reason I mentioned it is because when I dumped the footer metrics I saw that the max for our file had the 999999 at the end for nanoseconds and I wanted to understand why it was different from what Spark had written out. I don't think it would change anything functionally

Closes #7087, closes #13793, closes #13899 This PR adds support for several cases and statistics types: - sum statistics are included even when all elements are null (no minmax); - sum statistics are included in double stats; - minimum/maximum and minimumNanos/maximumNanos are included in timestamp stats; - hasNull field is written for all columns. - decimal statistics Added tests for all supported stats. Authors: - Vukasin Milovanovic (https://github.com/vuule) - Karthikeyan (https://github.com/karthikeyann) Approvers: - Lawrence Mitchell (https://github.com/wence-) - Robert (Bobby) Evans (https://github.com/revans2) - Vyas Ramasubramani (https://github.com/vyasr) - Karthikeyan (https://github.com/karthikeyann) URL: #13848

revans2 added bug Something isn't working Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Aug 17, 2023

github-project-automation bot added this to cuDF/Dask/Numba/UCX Aug 17, 2023

github-project-automation bot moved this to In Progress in cuDF/Dask/Numba/UCX Aug 17, 2023

revans2 mentioned this issue Aug 17, 2023

Test ORC predicate pushdown (PPD) with timestamps decimals booleans NVIDIA/spark-rapids#9068

Merged

thirtiseven mentioned this issue Aug 18, 2023

[FEA] Enable ORC timestamp and decimal predicate push down tests NVIDIA/spark-rapids#9075

Closed

2 tasks

GregoryKimball added 0 - Backlog In queue waiting for assignment libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue and removed Needs Triage Need team to review and classify labels Aug 18, 2023

GregoryKimball added this to the ORC continuous improvement milestone Aug 18, 2023

vuule mentioned this issue Aug 24, 2023

Expand statistics support in ORC writer #13848

Merged

3 tasks

rapids-bot bot closed this as completed in #13848 Sep 18, 2023

github-project-automation bot moved this from In Progress to Done in cuDF/Dask/Numba/UCX Sep 18, 2023

thirtiseven mentioned this issue Oct 25, 2023

[BUG] ORC writer produce wrong timestamp metrics which causes spark not to do predicate push down #14325

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] ORC write does not include full timestamp metrics which causes spark to not do predicate push down #13899

[BUG] ORC write does not include full timestamp metrics which causes spark to not do predicate push down #13899

revans2 commented Aug 17, 2023

vuule commented Aug 24, 2023

revans2 commented Aug 30, 2023

[BUG] ORC write does not include full timestamp metrics which causes spark to not do predicate push down #13899

[BUG] ORC write does not include full timestamp metrics which causes spark to not do predicate push down #13899

Comments

revans2 commented Aug 17, 2023

vuule commented Aug 24, 2023

revans2 commented Aug 30, 2023