[BUG] ORC statistics are wrong when a double column is all NULL. #13793

res-life · 2023-08-01T06:36:07Z

Describe the bug
Report "Skipping ORC PPD" error when reading a ORC file with a double column is all NULL.
The statistic in the ORC file is wrong, refer to the following sections.
Note the query result is correct, but the PPD was skipped due to this error.

Steps/Code to reproduce bug

Generate a ORC file on GPU:

TEST_F(OrcWriterTest, TestAllNulls)
{
  bool mask1[] = {false, false, false};  // all are null
  
  std::vector<char const*> h_strings1{"1", "2", "3"};
  str_col strings1(h_strings1.begin(), h_strings1.end());
  
  std::vector<double> hd{1.1, 2.2, 3.3};
  float64_col d2(hd.begin(), hd.end(), mask1);  // values of this double column are all null

  cudf::table_view expected({strings1, d2});

  cudf::io::table_input_metadata expected_metadata(expected);
  expected_metadata.column_metadata[0].set_name("c1");
  expected_metadata.column_metadata[1].set_name("c2");

  auto filepath = "/tmp/TestAllNulls.orc";
  cudf::io::orc_writer_options out_opts =
    cudf::io::orc_writer_options::builder(cudf::io::sink_info{filepath}, expected)
      .metadata(expected_metadata);
  cudf::io::write_orc(out_opts);
}

Read the ORC file from Spark.

SPARK_HOME/bin/pyspark 
df = spark.read.orc("/path/to/TestAllNulls.orc")
df.createOrReplaceTempView("tab")
spark.sql("select * from tab").show()
+---+----+
| c1|  c2|
+---+----+
|  1|null|
|  2|null|
|  3|null|
+---+----+

spark.sql("select * from tab where c2 is NOT NULL").show()
23/08/01 14:14:08 WARN RecordReaderImpl: ClassCastException when evaluating predicate. Skipping ORC PPD. Stats: numberOfValues: 0
 Predicate: (IS_NULL c2)
java.lang.ClassCastException: org.apache.orc.impl.ColumnStatisticsImpl cannot be cast to org.apache.orc.DoubleColumnStatistics
	at org.apache.orc.impl.RecordReaderImpl.evaluatePredicateProto(RecordReaderImpl.java:655)
	at org.apache.orc.impl.RecordReaderImpl$SargApplier.pickRowGroups(RecordReaderImpl.java:1144)
	at org.apache.orc.impl.RecordReaderImpl.pickRowGroups(RecordReaderImpl.java:1219)
	at org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:1239)
	at org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1291)
	at org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1334)
	at org.apache.orc.impl.RecordReaderImpl.<init>(RecordReaderImpl.java:355)
	at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:879)
	at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:130)
	at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$1(OrcFileFormat.scala:185)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:209)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:270)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
	at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:553)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:136)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
+---+---+ 
| c1| c2|
+---+---+   
+---+---+

Expected behavior
Fix "Skipping ORC PPD" warning.
Check other types besides the double type.

Environment overview (please complete the following information)

Environment details
cuDF 23.08 branch
Spark 3.3.0
orc-core-1.7.4.jar

Additional context
The error is:

java.lang.ClassCastException: org.apache.orc.impl.ColumnStatisticsImpl cannot be cast to org.apache.orc.DoubleColumnStatistics

Seems the ORC file does not contain DoubleColumnStatistics, so by default it's a ColumnStatisticsImpl.

The text was updated successfully, but these errors were encountered:

revans2 · 2023-08-07T13:59:15Z

Just to clarify a little after reading the java code. It appears that a default ColumnStatistics class is returned if there is no DoubleStatistics in the ColumnStatistics protocol buffers message. I am not sure if Spark just assumes it will be there incorrectly when doing predicate push down, but it does look to be explicit in the ORC java code

https://github.com/apache/orc/blob/7010351ebe341a465dc6fa5645f1c07014f3ec90/java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java#L698-L706

res-life · 2023-08-10T01:50:53Z

Another 2 issues about ORC statistics:
#13817
#13837
Maybe can be handled togather.

vuule · 2023-08-10T19:59:05Z

From what I observe in the C++ repro code, it looks like the double statistics are just missing when there are no values in the column. This makes sense, as there is no min/max, and we (for some reason) don't include the sum of floating point columns.
Statistics are, in theory, optional, so Spark should not crash when they are omitted.
I will also look into why the sum is not included. I suspect that it's about the possible overflow.

res-life · 2023-08-16T07:39:45Z

Spark should not crash when they are omitted

Yes, Spark just throws a warning message and skip the PPD.

But from the ORC java code, it does assume DoubleColumnStatistics should exist for double column.

jlowe · 2023-08-16T14:35:43Z

Yes, to clarify this is not so much an issue with Spark as it is the way the ORC Java code works. Any application using ORC's Java libraries will crash in a similar manner trying to read these files.

vuule · 2023-08-16T18:07:25Z

Do you know if ORC Java accepts partial DoubleColumnStatistics - no min and max, but sum is present. This is the change I made in the PR, to me this was the only content that makes sense for a column without valid elements.

vuule · 2023-08-16T18:33:44Z

@jlowe @res-life do you know if statistics are required to be present for all data types?
In #13848 I removed bucketStatistics (bool columns) since they were incorrect, now I realize that it might break ORC java.

jlowe · 2023-08-16T20:19:50Z

do you know if statistics are required to be present for all data types?

It looks like the crash is specific to Doubles and was introduced by ORC-629. I filed ORC-1482 to report the bug, however that doesn't change the fact that many ORC readers are broken until this is both fixed and adopted in those data frameworks.

Do you know if ORC Java accepts partial DoubleColumnStatistics - no min and max, but sum is present.

The code appears to handle missing fields, see https://github.com/apache/orc/blob/v1.7.4/java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java#L522-L532.

Closes #7087, closes #13793, closes #13899 This PR adds support for several cases and statistics types: - sum statistics are included even when all elements are null (no minmax); - sum statistics are included in double stats; - minimum/maximum and minimumNanos/maximumNanos are included in timestamp stats; - hasNull field is written for all columns. - decimal statistics Added tests for all supported stats. Authors: - Vukasin Milovanovic (https://github.com/vuule) - Karthikeyan (https://github.com/karthikeyann) Approvers: - Lawrence Mitchell (https://github.com/wence-) - Robert (Bobby) Evans (https://github.com/revans2) - Vyas Ramasubramani (https://github.com/vyasr) - Karthikeyan (https://github.com/karthikeyann) URL: #13848

res-life added bug Something isn't working Needs Triage Need team to review and classify cuIO cuIO issue labels Aug 1, 2023

github-project-automation bot added this to cuDF/Dask/Numba/UCX Aug 1, 2023

github-project-automation bot moved this to In Progress in cuDF/Dask/Numba/UCX Aug 1, 2023

res-life mentioned this issue Aug 1, 2023

[BUG] Report "Skipping ORC PPD" when reading GPU ORC file NVIDIA/spark-rapids#8826

Open

vuule self-assigned this Aug 1, 2023

res-life mentioned this issue Aug 4, 2023

[BUG] The hasNull in ORC Statistics is incorrect #13817

Open

res-life mentioned this issue Aug 9, 2023

Statistics tests for ORC files written by GPU NVIDIA/spark-rapids#8763

Closed

vuule mentioned this issue Aug 10, 2023

Expand statistics support in ORC writer #13848

Merged

3 tasks

GregoryKimball added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Aug 18, 2023

GregoryKimball added this to the ORC continuous improvement milestone Aug 18, 2023

res-life mentioned this issue Aug 23, 2023

[FEA] Create Decimal Statistics for decimal columns to enable predicate push down for spark #13933

Closed

rapids-bot bot closed this as completed in #13848 Sep 18, 2023

github-project-automation bot moved this from In Progress to Done in cuDF/Dask/Numba/UCX Sep 18, 2023

guiyanakuang mentioned this issue Sep 27, 2023

[FEA] Set the officially assigned id for the ORC writer implemented by cuDF #13977

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] ORC statistics are wrong when a double column is all NULL. #13793

[BUG] ORC statistics are wrong when a double column is all NULL. #13793

res-life commented Aug 1, 2023 •

edited

Loading

revans2 commented Aug 7, 2023

res-life commented Aug 10, 2023

vuule commented Aug 10, 2023

res-life commented Aug 16, 2023 •

edited

Loading

jlowe commented Aug 16, 2023

vuule commented Aug 16, 2023

vuule commented Aug 16, 2023

jlowe commented Aug 16, 2023

[BUG] ORC statistics are wrong when a double column is all NULL. #13793

[BUG] ORC statistics are wrong when a double column is all NULL. #13793

Comments

res-life commented Aug 1, 2023 • edited Loading

revans2 commented Aug 7, 2023

res-life commented Aug 10, 2023

vuule commented Aug 10, 2023

res-life commented Aug 16, 2023 • edited Loading

jlowe commented Aug 16, 2023

vuule commented Aug 16, 2023

vuule commented Aug 16, 2023

jlowe commented Aug 16, 2023

res-life commented Aug 1, 2023 •

edited

Loading

res-life commented Aug 16, 2023 •

edited

Loading