Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ORC writer produce wrong timestamp metrics which causes spark not to do predicate push down #14325

Closed
thirtiseven opened this issue Oct 25, 2023 · 25 comments · Fixed by #14458
Assignees
Labels
2 - In Progress Currently a work in progress bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.

Comments

@thirtiseven
Copy link
Contributor

Describe the bug
PR #13848 added minimum/maximum and minimumNanos/maximumNanos for ORC Writer timestamp statistics. It was intended to fix #13899 that Spark does not do predicate push down for gpu generated timestamp files. However the predicate push down test is still fails after above PR was merged, see NVIDIA/spark-rapids#9075.

When trying to see the meta of related files with orc-tools, it throws Exception in thread "main" java.lang.IllegalArgumentException: nanos > 999999999 or < 0. And the min max are also mismatched with cpu-generated file with same data. I think it cause Spark to fail to do pushdown.

Steps/Code to reproduce bug

spark-shell with spark-rapids:

scala> import java.sql.{Date, Timestamp}
import java.sql.{Date, Timestamp}

scala> val timeString = "2015-08-20 14:57:00"
timeString: String = 2015-08-20 14:57:00

scala> val data = (0 until 10).map { i =>
     |           val milliseconds = Timestamp.valueOf(timeString).getTime + i * 3600
     |           Tuple1(new Timestamp(milliseconds))
     |         }
data: scala.collection.immutable.IndexedSeq[(java.sql.Timestamp,)] = Vector((2015-08-20 14:57:00.0,), (2015-08-20 14:57:03.6,), (2015-08-20 14:57:07.2,), (2015-08-20 14:57:10.8,), (2015-08-20 14:57:14.4,), (2015-08-20 14:57:18.0,), (2015-08-20 14:57:21.6,), (2015-08-20 14:57:25.2,), (2015-08-20 14:57:28.8,), (2015-08-20 14:57:32.4,))

scala> val df = spark.createDataFrame(data).toDF("a")
df: org.apache.spark.sql.DataFrame = [a: timestamp]

scala> df.write.orc("ORC_PPD_GPU")

orc-tools:

java -jar orc-tools-1.9.1-uber.jar meta ORC_PPD_GPU/
[main] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
log4j:WARN No appenders could be found for logger (org.apache.htrace.core.Tracer).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Processing data file file:/home/haoyangl/ORC_PPD_GPU/part-00007-087a3f38-7d6f-4ddc-9cba-66fd28b24ae0-c000.snappy.orc [length: 304]
Structure for file:/home/haoyangl/ORC_PPD_GPU/part-00007-087a3f38-7d6f-4ddc-9cba-66fd28b24ae0-c000.snappy.orc
File Version: 0.12 with ORIGINAL by ORC Java
[main] INFO org.apache.orc.impl.ReaderImpl - Reading ORC rows from file:/home/haoyangl/ORC_PPD_GPU/part-00007-087a3f38-7d6f-4ddc-9cba-66fd28b24ae0-c000.snappy.orc with {include: null, offset: 0, length: 9223372036854775807, includeAcidColumns: true, allowSARGToFilter: false, useSelected: false}
[main] INFO org.apache.orc.impl.RecordReaderImpl - Reader schema not provided -- using file schema struct<a:timestamp>
Rows: 2
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<a:timestamp>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 2 hasNull: true
    Column 1: count: 2 hasNull: false min: 2015-08-20 14:57:28.799999999 max: 2015-08-20 14:57:32.399999999

File Statistics:
  Column 0: count: 2 hasNull: true
  Column 1: count: 2 hasNull: false min: 2015-08-20 14:57:28.799999999 max: 2015-08-20 14:57:32.399999999

Stripes:
  Stripe: offset: 3 data: 25 rows: 2 tail: 56 index: 64
    Stream: column 0 section ROW_INDEX start: 3 length 7
    Stream: column 1 section ROW_INDEX start: 10 length 57
    Stream: column 1 section PRESENT start: 67 length 5
    Stream: column 1 section DATA start: 72 length 13
    Stream: column 1 section SECONDARY start: 85 length 7
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2

File length: 304 bytes
Padding length: 0 bytes
Padding ratio: 0%
________________________________________________________________________________________________________________________

Processing data file file:/home/haoyangl/ORC_PPD_GPU/part-00005-087a3f38-7d6f-4ddc-9cba-66fd28b24ae0-c000.snappy.orc [length: 300]
Structure for file:/home/haoyangl/ORC_PPD_GPU/part-00005-087a3f38-7d6f-4ddc-9cba-66fd28b24ae0-c000.snappy.orc
File Version: 0.12 with ORIGINAL by ORC Java
[main] INFO org.apache.orc.impl.ReaderImpl - Reading ORC rows from file:/home/haoyangl/ORC_PPD_GPU/part-00005-087a3f38-7d6f-4ddc-9cba-66fd28b24ae0-c000.snappy.orc with {include: null, offset: 0, length: 9223372036854775807, includeAcidColumns: true, allowSARGToFilter: false, useSelected: false}
[main] INFO org.apache.orc.impl.RecordReaderImpl - Reader schema not provided -- using file schema struct<a:timestamp>
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<a:timestamp>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 1 hasNull: true
    Column 1: count: 1 hasNull: false min: 2015-08-20 14:57:21.599999999 max: 2015-08-20 14:57:21.599999999

File Statistics:
  Column 0: count: 1 hasNull: true
  Column 1: count: 1 hasNull: false min: 2015-08-20 14:57:21.599999999 max: 2015-08-20 14:57:21.599999999

Stripes:
  Stripe: offset: 3 data: 21 rows: 1 tail: 56 index: 64
    Stream: column 0 section ROW_INDEX start: 3 length 7
    Stream: column 1 section ROW_INDEX start: 10 length 57
    Stream: column 1 section PRESENT start: 67 length 5
    Stream: column 1 section DATA start: 72 length 10
    Stream: column 1 section SECONDARY start: 82 length 6
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2

File length: 300 bytes
Padding length: 0 bytes
Padding ratio: 0%
________________________________________________________________________________________________________________________

Processing data file file:/home/haoyangl/ORC_PPD_GPU/part-00004-087a3f38-7d6f-4ddc-9cba-66fd28b24ae0-c000.snappy.orc [length: 300]
Structure for file:/home/haoyangl/ORC_PPD_GPU/part-00004-087a3f38-7d6f-4ddc-9cba-66fd28b24ae0-c000.snappy.orc
File Version: 0.12 with ORIGINAL by ORC Java
[main] INFO org.apache.orc.impl.ReaderImpl - Reading ORC rows from file:/home/haoyangl/ORC_PPD_GPU/part-00004-087a3f38-7d6f-4ddc-9cba-66fd28b24ae0-c000.snappy.orc with {include: null, offset: 0, length: 9223372036854775807, includeAcidColumns: true, allowSARGToFilter: false, useSelected: false}
[main] INFO org.apache.orc.impl.RecordReaderImpl - Reader schema not provided -- using file schema struct<a:timestamp>
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<a:timestamp>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 1 hasNull: true
Exception in thread "main" java.lang.IllegalArgumentException: nanos > 999999999 or < 0
	at java.sql/java.sql.Timestamp.setNanos(Timestamp.java:336)
	at org.apache.orc.impl.ColumnStatisticsImpl$TimestampStatisticsImpl.getMinimum(ColumnStatisticsImpl.java:1764)
	at org.apache.orc.impl.ColumnStatisticsImpl$TimestampStatisticsImpl.toString(ColumnStatisticsImpl.java:1808)
	at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:363)
	at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:276)
	at org.apache.orc.tools.FileDump.main(FileDump.java:137)
	at org.apache.orc.tools.Driver.main(Driver.java:124)

Related test cases in spark-rapids:
Support for pushing down filters for timestamp types

Expected behavior
The statistics for orc files should be reasonable and Spark should be able to do predicate push down on gpu-generated orc files.

Environment overview (please complete the following information)

  • Environment location: Bare-metal
  • Method of cuDF install: from source
@thirtiseven thirtiseven added Needs Triage Need team to review and classify bug Something isn't working labels Oct 25, 2023
@sameerz
Copy link
Contributor

sameerz commented Oct 30, 2023

@thirtiseven can you attach a minimal sample file that demonstrates the problem, and commands from the orc command line tool showing the error?

@thirtiseven
Copy link
Contributor Author

@sameerz ok, a sample file: ORC_PPD_FAILED_GPU.zip

run:

java -jar orc-tools-1.9.1-uber.jar meta ORC_PPD_FAILED_GPU/

will get:

[main] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
log4j:WARN No appenders could be found for logger (org.apache.htrace.core.Tracer).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Processing data file file:/home/haoyangl/ORC_PPD_FAILED_GPU/part-00000-0c687a13-7e91-422d-af76-ce177d66dd94-c000.snappy.orc [length: 333]
Structure for file:/home/haoyangl/ORC_PPD_FAILED_GPU/part-00000-0c687a13-7e91-422d-af76-ce177d66dd94-c000.snappy.orc
File Version: 0.12 with ORIGINAL by ORC Java
[main] INFO org.apache.orc.impl.ReaderImpl - Reading ORC rows from file:/home/haoyangl/ORC_PPD_FAILED_GPU/part-00000-0c687a13-7e91-422d-af76-ce177d66dd94-c000.snappy.orc with {include: null, offset: 0, length: 9223372036854775807, includeAcidColumns: true, allowSARGToFilter: false, useSelected: false}
[main] INFO org.apache.orc.impl.RecordReaderImpl - Reader schema not provided -- using file schema struct<a:timestamp>
Rows: 10
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<a:timestamp>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 10 hasNull: true
Exception in thread "main" java.lang.IllegalArgumentException: nanos > 999999999 or < 0
	at java.sql/java.sql.Timestamp.setNanos(Timestamp.java:336)
	at org.apache.orc.impl.ColumnStatisticsImpl$TimestampStatisticsImpl.getMinimum(ColumnStatisticsImpl.java:1764)
	at org.apache.orc.impl.ColumnStatisticsImpl$TimestampStatisticsImpl.toString(ColumnStatisticsImpl.java:1808)
	at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:363)
	at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:276)
	at org.apache.orc.tools.FileDump.main(FileDump.java:137)
	at org.apache.orc.tools.Driver.main(Driver.java:124)

@vuule
Copy link
Contributor

vuule commented Nov 6, 2023

@thirtiseven do you have the CPU version of the invalid file? I'd like to debug the writer as it creates the invalid statistics.
Disregard, I can read the GPU file and write it back out.

@vuule vuule self-assigned this Nov 6, 2023
@vuule
Copy link
Contributor

vuule commented Nov 7, 2023

older orc-tools works fine (maybe no nanos support yet?)

$ java -jar orc-tools-1.5.2-uber.jar meta /home/vukasin/cudf/stats_fresh.orc 
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.hadoop.security.authentication.util.KerberosUtil (file:/snap/orc/2/share/orc-tools-1.5.2-uber.jar) to method sun.security.krb5.Config.getInstance()
WARNING: Please consider reporting this to the maintainers of org.apache.hadoop.security.authentication.util.KerberosUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Processing data file /home/vukasin/cudf/stats_fresh.orc [length: 333]
Structure for /home/vukasin/cudf/stats_fresh.orc
File Version: 0.12 with ORIGINAL
Rows: 10
Compression: SNAPPY
Compression size: 262144
Type: struct<a:timestamp>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 10 hasNull: true
    Column 1: count: 10 hasNull: false min: 2015-08-20 14:57:00.0 max: 2015-08-20 14:57:32.4

File Statistics:
  Column 0: count: 10 hasNull: true
  Column 1: count: 10 hasNull: false min: 2015-08-20 14:57:00.0 max: 2015-08-20 14:57:32.4

Stripes:
  Stripe: offset: 3 data: 54 rows: 10 tail: 56 index: 64
    Stream: column 0 section ROW_INDEX start: 3 length 7
    Stream: column 1 section ROW_INDEX start: 10 length 57
    Stream: column 1 section PRESENT start: 67 length 6
    Stream: column 1 section DATA start: 73 length 34
    Stream: column 1 section SECONDARY start: 107 length 14
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2

File length: 333 bytes
Padding length: 0 bytes
Padding ratio: 0%

@vuule
Copy link
Contributor

vuule commented Nov 7, 2023

FWIW, I'm also able to read the correct statistics in libcudf.
So, no local repro yet. Will try newer orc-tools next.

@thirtiseven
Copy link
Contributor Author

older orc-tools works fine (maybe no nanos support yet?)

Yes, the nanosecond support is later than ORC 1.5.2

@vuule
Copy link
Contributor

vuule commented Nov 7, 2023

Opened #14367 with A fix for nanosecond statistics. @thirtiseven can you please run the test with this branch and see if it affects the repro?

@thirtiseven
Copy link
Contributor Author

thirtiseven commented Nov 7, 2023

Hi @vuule , I can still repro for both orc-tools and spark PPD with this branch.

@vuule
Copy link
Contributor

vuule commented Nov 7, 2023

@thirtiseven is there any isolation regarding which timestamp values trigger the issue?

@vuule
Copy link
Contributor

vuule commented Nov 7, 2023

Found that the nanoseconds are encoded as value + 1, that why CPU reader complained about the range - zero would become -1.
Pushed a fix for the off by one to #14367, @thirtiseven please verify if it fixes the issue.
Statistics should now be correct for any nanosecond value.

@thirtiseven
Copy link
Contributor Author

@vuule Thanks! The new commit no longer crashes orc-tools, and the nanosecond values look the same!

However, the predicate pushdown still somehow does not work for gpu files, seems to still have some mismatch with cpu. Any ideas?

Some new result files from cpu/gpu:
ORC_PPD_FAILED.zip

GPU meta:

java -jar orc-tools-1.9.1-uber.jar meta ORC_PPD_FAILED_GPU/
[main] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
log4j:WARN No appenders could be found for logger (org.apache.htrace.core.Tracer).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Processing data file file:/home/haoyangl/ORC_PPD_FAILED_GPU/part-00000-68ec6cc3-036b-4e78-945a-f542e184914d-c000.snappy.orc [length: 327]
Structure for file:/home/haoyangl/ORC_PPD_FAILED_GPU/part-00000-68ec6cc3-036b-4e78-945a-f542e184914d-c000.snappy.orc
File Version: 0.12 with ORIGINAL by ORC Java
[main] INFO org.apache.orc.impl.ReaderImpl - Reading ORC rows from file:/home/haoyangl/ORC_PPD_FAILED_GPU/part-00000-68ec6cc3-036b-4e78-945a-f542e184914d-c000.snappy.orc with {include: null, offset: 0, length: 9223372036854775807, includeAcidColumns: true, allowSARGToFilter: false, useSelected: false}
[main] INFO org.apache.orc.impl.RecordReaderImpl - Reader schema not provided -- using file schema struct<a:timestamp>
Rows: 10
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<a:timestamp>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 10 hasNull: true
    Column 1: count: 10 hasNull: false min: 2015-08-20 14:57:00.0 max: 2015-08-20 14:57:32.4

File Statistics:
  Column 0: count: 10 hasNull: true
  Column 1: count: 10 hasNull: false min: 2015-08-20 14:57:00.0 max: 2015-08-20 14:57:32.4

Stripes:
  Stripe: offset: 3 data: 54 rows: 10 tail: 56 index: 62
    Stream: column 0 section ROW_INDEX start: 3 length 7
    Stream: column 1 section ROW_INDEX start: 10 length 55
    Stream: column 1 section PRESENT start: 65 length 6
    Stream: column 1 section DATA start: 71 length 34
    Stream: column 1 section SECONDARY start: 105 length 14
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2

File length: 327 bytes
Padding length: 0 bytes
Padding ratio: 0%
________________________________________________________________________________________________________________________

CPU meta:

java -jar orc-tools-1.9.1-uber.jar meta ORC_PPD_FAILED_CPU/
[main] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
log4j:WARN No appenders could be found for logger (org.apache.htrace.core.Tracer).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Processing data file file:/home/haoyangl/ORC_PPD_FAILED_CPU/part-00000-f0d75466-fca0-400e-b6dc-08e6f9c570a9-c000.snappy.orc [length: 345]
Structure for file:/home/haoyangl/ORC_PPD_FAILED_CPU/part-00000-f0d75466-fca0-400e-b6dc-08e6f9c570a9-c000.snappy.orc
File Version: 0.12 with ORC_14 by ORC Java 1.8.4
[main] INFO org.apache.orc.impl.ReaderImpl - Reading ORC rows from file:/home/haoyangl/ORC_PPD_FAILED_CPU/part-00000-f0d75466-fca0-400e-b6dc-08e6f9c570a9-c000.snappy.orc with {include: null, offset: 0, length: 9223372036854775807, includeAcidColumns: true, allowSARGToFilter: false, useSelected: false}
[main] INFO org.apache.orc.impl.RecordReaderImpl - Reader schema not provided -- using file schema struct<a:timestamp>
Rows: 10
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<a:timestamp>
Attributes on root.a
  spark.sql.catalyst.type: timestamp

Stripe Statistics:
  Stripe 1:
    Column 0: count: 10 hasNull: false
    Column 1: count: 10 hasNull: false bytesOnDisk: 29 min: 2015-08-20 14:57:00.0 max: 2015-08-20 14:57:32.4

File Statistics:
  Column 0: count: 10 hasNull: false
  Column 1: count: 10 hasNull: false bytesOnDisk: 29 min: 2015-08-20 14:57:00.0 max: 2015-08-20 14:57:32.4

Stripes:
  Stripe: offset: 3 data: 29 rows: 10 tail: 48 index: 48
    Stream: column 0 section ROW_INDEX start: 3 length 11
    Stream: column 1 section ROW_INDEX start: 14 length 37
    Stream: column 1 section DATA start: 51 length 14
    Stream: column 1 section SECONDARY start: 65 length 15
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2

File length: 345 bytes
Padding length: 0 bytes
Padding ratio: 0%

User Metadata:
  org.apache.spark.version=3.4.1
________________________________________________________________________________________________________________________

@revans2
Copy link
Contributor

revans2 commented Nov 8, 2023

I did some digging and it is because the ORC reader is trying really hard to be cautious.

https://github.com/apache/orc/blob/7c839256470690b6b1a415a784bd924236c426a4/java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java#L678-L683

Our writer version shows up as Original, which does not include a fix for timestamps (ORC-135).

@revans2
Copy link
Contributor

revans2 commented Nov 8, 2023

So I have been looking at the writer version, code and numbers that we were assigned by ORC a bit more, and I think we might be able to make this work.

https://github.com/apache/orc/blob/7c839256470690b6b1a415a784bd924236c426a4/java/core/src/java/org/apache/orc/OrcFile.java#L165-L197

and

https://github.com/apache/orc/blob/7c839256470690b6b1a415a784bd924236c426a4/c%2B%2B/include/orc/Common.hh#L94-L106

are the code that holds a lot of this version information.

The writer version is a little confusing because the C++ code and the java code use the similar names in slightly different ways, but I am going to go with the java code here, and then call out the C++ code when it appears to be different. The WriterVersion is made up of two parts. One is the FileTail.postscript.writerVersion in the protocol buffers, it is really a capability number more than anything. The other is FailTail.footer.writer in the protocol buffers. In java this is the WriterImplementation class. The numeric value says what piece of code did the write. Looking at the java code there are not a lot of places where the writerVersion is used. HIVE_12055 and ORC_135 are the only ones that are in the java code base that are not just for tests. In both of those cases the check to see if the bug is present or not assumes that the bug is only in the java WriterImplementation. So if we just say that we are CUDF we should be good to go from a java perspective.

The C++ code is different. It has an explicit disallow lists for different versions of the C++ code around bloom filters. But for the part we care about it is similar to the java code. The main difference is that the check for ORC_135 does not look at the WriterImplementation/writer at all. It only looks at the writerVersion/capabilities number. So as long as we write a writerVersion that is at least 6, which is the one that we were assigned, then we are good to go. The main thing we need to do is to make sure that we don't have any of the bugs in our code that this writerVersion number is encoding, just because the C++ code will not handle it properly. Also because we just don't want bugs in our code.

  • Version of 0 is what we get today by not writing anything out. It says we have all kinds of bugs in our code and some features are not supported.
  • Version 1 says that HIVE-8732 was fixed. Which from the comments means that we write string min/max out as UTF-8, which we already do. Oddly no reader appears to check for this.
  • Version 2 says that HIVE-4243 was fixed, Which adds real column names to the ORC files. We already do this too, and there are no checks about this because it is more of a feature than a bug.
  • Version 3 is for HIVE-12055, which is for a vectorized writer, which is a new feature, but is used to check for a bug in bloom filter reads for string data. We don't do bloom filter writes so we don't really care about it.
  • Version 4 is for HIVE-13083 is a bug in writing out some decimals, but there is no check for it that I found, and it looks like there is no good way to work around the bug on a read.
  • Version 5 is for ORC-101 which normalizes string data for bloom filters to be UTF-8. Again no checks for this in any of the readers that I saw, and we don't do bloom filters.
  • Version 6 is the one we care about and is for ORC-135 where min/max timestamps are written out in UTC.
  • Version 7 is a fix for decimal64 min and max ORC-517 where if all the numbers are negative a 0 was stored as the max. I don't think we have that problem.
  • Versions 8 and 9 are features we don't support and I don't think we need to worry about.

So just from this it looks like we should be able to write out the writer/writerVersion info for CUDF and version 6 and get away with it. No need to worry about breaking existing readers. But if we want to run some tests I am happy to do that.

@vuule
Copy link
Contributor

vuule commented Nov 8, 2023

Thank you for the analysis @revans2!
We decided in an offline discussion to disable nanosecond statistics in 23.12 and look into writing the correct version starting from 24.02, so that we can re-enable nanoseconds.

@vuule
Copy link
Contributor

vuule commented Nov 8, 2023

@thirtiseven I've updated #14367 to exclude nanoseconds, your tests should be passing now; please verify and I'll make the PR ready for review.

@thirtiseven
Copy link
Contributor Author

@vuule I'm afraid the push down tests still failed. Maybe it is blocked by the writer version issue?

@vuule
Copy link
Contributor

vuule commented Nov 9, 2023

In which way does it fail?

@thirtiseven
Copy link
Contributor Author

thirtiseven commented Nov 9, 2023

The related test cases in spark-rapids failed in the same way as before, the results indicating that predicate push down is not happening when reading GPU files.

OrcFilterSuite:
- Support for pushing down filters for boolean types gpu write gpu read
- Support for pushing down filters for boolean types gpu write cpu read
- Support for pushing down filters for boolean types cpu write gpu read
- Support for pushing down filters for decimal types gpu write gpu read !!! CANCELED !!!
  https://github.com/rapidsai/cudf/issues/13933 (OrcFilterSuite.scala:78)
- Support for pushing down filters for decimal types gpu write cpu read !!! CANCELED !!!
  https://github.com/rapidsai/cudf/issues/13933 (OrcFilterSuite.scala:90)
- Support for pushing down filters for decimal types cpu write gpu read
- Support for pushing down filters for timestamp types cpu write gpu read
- Support for pushing down filters for timestamp types gpu write cpu read *** FAILED ***
  0 was less than 10, but 10 was not less than 10 (OrcFilterSuite.scala:37)
- Support for pushing down filters for timestamp types gpu write gpu read *** FAILED ***
  0 was less than 10, but 10 was not less than 10 (OrcFilterSuite.scala:37)

@GregoryKimball GregoryKimball added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue and removed Needs Triage Need team to review and classify labels Nov 9, 2023
rapids-bot bot pushed a commit that referenced this issue Nov 15, 2023
)

Issue #14325

Use uint when reading/writing nano stats because nanoseconds have int32 encoding (different from both unit32 and sint32, _obviously_), which does not use zigzag. 
sint32 uses zigzag, and unit32 does not allow negative numbers, so we can use uint since we'll never have negative nanoseconds.

Also disabled the nanoseconds because it should only be written after ORC-135; we don't write the version so readers get confused if nanoseconds are there. Planning to re-enable once we start writing the version.

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Nghia Truong (https://github.com/ttnghia)

URL: #14367
@vuule
Copy link
Contributor

vuule commented Nov 20, 2023

Opened #14458 to include the writer code and the correct version.
@thirtiseven can you please test with this branch?

@thirtiseven
Copy link
Contributor Author

My test complains that:

  java.io.IOException: file:/home/haoyangl/spark-rapids/tests/target/spark341/tmp/spark-test-6c5d822b-f6bf-42f5-a08b-524746fba019/part-00002-88031666-3d16-4756-8fba-68c42a781d26-c000.snappy.orc was written by a future ORC version 0.7. This file is not readable by this version of ORC.
Postscript: footerLength: 85 compression: SNAPPY compressionBlockSize: 262144 version: 0 version: 7 metadataLength: 47 magic: "ORC"
  at org.apache.orc.impl.ReaderImpl.checkOrcVersion(ReaderImpl.java:525)
  at org.apache.orc.impl.ReaderImpl.extractPostScript(ReaderImpl.java:645)
  at org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:814)
  at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:567)
  at org.apache.orc.OrcFile.createReader(OrcFile.java:385)
  at org.apache.spark.sql.execution.datasources.orc.OrcUtils$.$anonfun$readSchema$1(OrcUtils.scala:77)
  at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2785)
  at org.apache.spark.sql.execution.datasources.orc.OrcUtils$.readSchema(OrcUtils.scala:77)
  at org.apache.spark.sql.execution.datasources.orc.OrcUtils$.$anonfun$readSchema$4(OrcUtils.scala:147)
  at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
  ...

It also fails many other ORC integration tests with a similar message.

@vuule
Copy link
Contributor

vuule commented Nov 21, 2023

Updated the branch to write 0.6 instead of 0.7. I think that's in line with reader's expectation.
Please try again if you get a chance.
I just don't understand how everything worked before, when we wrote 0.12.

@thirtiseven
Copy link
Contributor Author

Similar results:

- Support for pushing down filters for timestamp types gpu write cpu read *** FAILED ***
  java.io.IOException: file:/home/haoyangl/spark-rapids/tests/target/spark341/tmp/spark-test-df3a6c8c-0a18-484d-9551-c1b2c6fc5225/part-00006-9c4bbfd8-bc1e-414d-986f-2ccd1c54962e-c000.snappy.orc was written by a future ORC version 0.6. This file is not readable by this version of ORC.
Postscript: footerLength: 85 compression: SNAPPY compressionBlockSize: 262144 version: 0 version: 6 metadataLength: 47 magic: "ORC"
  at org.apache.orc.impl.ReaderImpl.checkOrcVersion(ReaderImpl.java:525)
  at org.apache.orc.impl.ReaderImpl.extractPostScript(ReaderImpl.java:645)
  at org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:814)
  at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:567)
  at org.apache.orc.OrcFile.createReader(OrcFile.java:385)
  at org.apache.spark.sql.execution.datasources.orc.OrcUtils$.$anonfun$readSchema$1(OrcUtils.scala:77)
  at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2785)
  at org.apache.spark.sql.execution.datasources.orc.OrcUtils$.readSchema(OrcUtils.scala:77)
  at org.apache.spark.sql.execution.datasources.orc.OrcUtils$.$anonfun$readSchema$4(OrcUtils.scala:147)
  at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
  ...

@vuule
Copy link
Contributor

vuule commented Nov 22, 2023

@thirtiseven would you mind running the tests again with latest branch? I was working off of incorrect specs. Sorry to pull you into this so many times.

@thirtiseven
Copy link
Contributor Author

The predicate pushdown works good with your new changes and it won't break other orc tests in spark-rapids.
Thank you!

@vuule
Copy link
Contributor

vuule commented Nov 22, 2023

Finally!
Thank you for running these tests again and again :)

rapids-bot bot pushed a commit that referenced this issue Dec 7, 2023
Closes #14325
Changes some of the metadata written to ORC file:

- Include the (cuDF) writer code (5).
- include writerVersion, with the value of 7; This value means that bugs up to ORC-517 are fixed. This version (6+ required) allows us to write the nanosecond statistics.

This change can have unexpected impact, depending on how readers use these fields.

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Robert (Bobby) Evans (https://github.com/revans2)
  - Nghia Truong (https://github.com/ttnghia)
  - Mike Wilson (https://github.com/hyperbolic2346)

URL: #14458
@github-project-automation github-project-automation bot moved this from In Progress to Done in cuDF/Dask/Numba/UCX Dec 7, 2023
karthikeyann pushed a commit to karthikeyann/cudf that referenced this issue Dec 12, 2023
Closes rapidsai#14325
Changes some of the metadata written to ORC file:

- Include the (cuDF) writer code (5).
- include writerVersion, with the value of 7; This value means that bugs up to ORC-517 are fixed. This version (6+ required) allows us to write the nanosecond statistics.

This change can have unexpected impact, depending on how readers use these fields.

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Robert (Bobby) Evans (https://github.com/revans2)
  - Nghia Truong (https://github.com/ttnghia)
  - Mike Wilson (https://github.com/hyperbolic2346)

URL: rapidsai#14458
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants