Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ORC statistics are wrong when a double column is all NULL. #13793

Closed
res-life opened this issue Aug 1, 2023 · 8 comments · Fixed by #13848
Closed

[BUG] ORC statistics are wrong when a double column is all NULL. #13793

res-life opened this issue Aug 1, 2023 · 8 comments · Fixed by #13848
Assignees
Labels
2 - In Progress Currently a work in progress bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.

Comments

@res-life
Copy link
Contributor

res-life commented Aug 1, 2023

Describe the bug
Report "Skipping ORC PPD" error when reading a ORC file with a double column is all NULL.
The statistic in the ORC file is wrong, refer to the following sections.
Note the query result is correct, but the PPD was skipped due to this error.

Steps/Code to reproduce bug

Generate a ORC file on GPU:

TEST_F(OrcWriterTest, TestAllNulls)
{
  bool mask1[] = {false, false, false};  // all are null
  
  std::vector<char const*> h_strings1{"1", "2", "3"};
  str_col strings1(h_strings1.begin(), h_strings1.end());
  
  std::vector<double> hd{1.1, 2.2, 3.3};
  float64_col d2(hd.begin(), hd.end(), mask1);  // values of this double column are all null

  cudf::table_view expected({strings1, d2});

  cudf::io::table_input_metadata expected_metadata(expected);
  expected_metadata.column_metadata[0].set_name("c1");
  expected_metadata.column_metadata[1].set_name("c2");

  auto filepath = "/tmp/TestAllNulls.orc";
  cudf::io::orc_writer_options out_opts =
    cudf::io::orc_writer_options::builder(cudf::io::sink_info{filepath}, expected)
      .metadata(expected_metadata);
  cudf::io::write_orc(out_opts);
}

Read the ORC file from Spark.

SPARK_HOME/bin/pyspark 
df = spark.read.orc("/path/to/TestAllNulls.orc")
df.createOrReplaceTempView("tab")
spark.sql("select * from tab").show()
+---+----+
| c1|  c2|
+---+----+
|  1|null|
|  2|null|
|  3|null|
+---+----+

spark.sql("select * from tab where c2 is NOT NULL").show()
23/08/01 14:14:08 WARN RecordReaderImpl: ClassCastException when evaluating predicate. Skipping ORC PPD. Stats: numberOfValues: 0
 Predicate: (IS_NULL c2)
java.lang.ClassCastException: org.apache.orc.impl.ColumnStatisticsImpl cannot be cast to org.apache.orc.DoubleColumnStatistics
	at org.apache.orc.impl.RecordReaderImpl.evaluatePredicateProto(RecordReaderImpl.java:655)
	at org.apache.orc.impl.RecordReaderImpl$SargApplier.pickRowGroups(RecordReaderImpl.java:1144)
	at org.apache.orc.impl.RecordReaderImpl.pickRowGroups(RecordReaderImpl.java:1219)
	at org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:1239)
	at org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1291)
	at org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1334)
	at org.apache.orc.impl.RecordReaderImpl.<init>(RecordReaderImpl.java:355)
	at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:879)
	at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:130)
	at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$1(OrcFileFormat.scala:185)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:209)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:270)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
	at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:553)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:136)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
+---+---+ 
| c1| c2|
+---+---+   
+---+---+   

Expected behavior
Fix "Skipping ORC PPD" warning.
Check other types besides the double type.

Environment overview (please complete the following information)

Environment details
cuDF 23.08 branch
Spark 3.3.0
orc-core-1.7.4.jar

Additional context
The error is:

java.lang.ClassCastException: org.apache.orc.impl.ColumnStatisticsImpl cannot be cast to org.apache.orc.DoubleColumnStatistics

Seems the ORC file does not contain DoubleColumnStatistics, so by default it's a ColumnStatisticsImpl.

@revans2
Copy link
Contributor

revans2 commented Aug 7, 2023

Just to clarify a little after reading the java code. It appears that a default ColumnStatistics class is returned if there is no DoubleStatistics in the ColumnStatistics protocol buffers message. I am not sure if Spark just assumes it will be there incorrectly when doing predicate push down, but it does look to be explicit in the ORC java code

https://github.com/apache/orc/blob/7010351ebe341a465dc6fa5645f1c07014f3ec90/java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java#L698-L706

@res-life
Copy link
Contributor Author

Another 2 issues about ORC statistics:
#13817
#13837
Maybe can be handled togather.

@vuule
Copy link
Contributor

vuule commented Aug 10, 2023

From what I observe in the C++ repro code, it looks like the double statistics are just missing when there are no values in the column. This makes sense, as there is no min/max, and we (for some reason) don't include the sum of floating point columns.
Statistics are, in theory, optional, so Spark should not crash when they are omitted.
I will also look into why the sum is not included. I suspect that it's about the possible overflow.

@res-life
Copy link
Contributor Author

res-life commented Aug 16, 2023

Spark should not crash when they are omitted

Yes, Spark just throws a warning message and skip the PPD.

But from the ORC java code, it does assume DoubleColumnStatistics should exist for double column.

@jlowe
Copy link
Member

jlowe commented Aug 16, 2023

Yes, to clarify this is not so much an issue with Spark as it is the way the ORC Java code works. Any application using ORC's Java libraries will crash in a similar manner trying to read these files.

@vuule
Copy link
Contributor

vuule commented Aug 16, 2023

Do you know if ORC Java accepts partial DoubleColumnStatistics - no min and max, but sum is present. This is the change I made in the PR, to me this was the only content that makes sense for a column without valid elements.

@vuule
Copy link
Contributor

vuule commented Aug 16, 2023

@jlowe @res-life do you know if statistics are required to be present for all data types?
In #13848 I removed bucketStatistics (bool columns) since they were incorrect, now I realize that it might break ORC java.

@jlowe
Copy link
Member

jlowe commented Aug 16, 2023

do you know if statistics are required to be present for all data types?

It looks like the crash is specific to Doubles and was introduced by ORC-629. I filed ORC-1482 to report the bug, however that doesn't change the fact that many ORC readers are broken until this is both fixed and adopted in those data frameworks.

Do you know if ORC Java accepts partial DoubleColumnStatistics - no min and max, but sum is present.

The code appears to handle missing fields, see https://github.com/apache/orc/blob/v1.7.4/java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java#L522-L532.

@GregoryKimball GregoryKimball added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Aug 18, 2023
rapids-bot bot pushed a commit that referenced this issue Sep 18, 2023
Closes #7087, closes #13793, closes #13899

This PR adds support for several cases and statistics types:
- sum statistics are included even when all elements are null (no minmax);
- sum statistics are included in double stats;
- minimum/maximum and minimumNanos/maximumNanos are included in timestamp stats;
- hasNull field is written for all columns.
- decimal statistics

Added tests for all supported stats.

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)
  - Robert (Bobby) Evans (https://github.com/revans2)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Karthikeyan (https://github.com/karthikeyann)

URL: #13848
@github-project-automation github-project-automation bot moved this from In Progress to Done in cuDF/Dask/Numba/UCX Sep 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants