Use the ORC version that corresponds to the Spark version [databricks] #4408

res-life · 2021-12-21T10:11:22Z

[BUG] Spark 3.3.0 test failure: NoSuchMethodError org.apache.orc.TypeDescription.getAttributeValue #4031

Root cause
Rapids plugin uses a constant ORC version 1.5.10 and Spark 3.3.0 begins to use ORC 1.7.x which is not compatible with 1.5.10.
For the unit test cases, it uses ORC 1.5.10, because of directly specified the dependency which overides the ORC 1.7.x dependency, Spark invoking the "getAttributeValue" which resides in ORC 1.7.x will fail.
But for the integration test cases, because of it uses the jars resides in the $SPARK_HOME/jars which includes ORC 1.7.x, it will not fail.
There are different behaviors between integration tests and unit tests.

Solution

Use the ORC version that corresponds to the Spark version and stop shading ORC.
This will make the behaviors of IT and UT consistent.
Will make the aggregator jar small because of stopping shading ORC.
Will upgrade ORC accordingly with Spark.
Move the failed test case into IT and keep shading ORC 1.5.10
This will make us confusing why we can't put this case into UT.

Here we use option 1.

Another change, added a common module:
In this PR it contains a util class ThreadFactoryBuilder that is used to replace the Guava dependency. The guava dependency is removed because of it's a messy jar in practice.
The ThreadFactoryBuilder is used in both sql-plugin and tools modules, so put it into the new common module.

This fixes #4031

res-life · 2021-12-21T10:18:02Z

Use the ORC version that corresponds to the Spark version.
Make the test module depend on the aggregator jar instead of the dist jar.
Temp comment a test case in OrcScanSuite, will check it later.
Currently keep shading the guava, ORC, etc, will update the shade scope in the issue #1398 .
Move the dependency management of ORC to sql-plugin and omit the version in order to dynamically set the ORC version.
Will add some comments later.

res-life · 2021-12-21T10:19:01Z

build

res-life · 2021-12-21T10:46:27Z

build

res-life · 2021-12-21T14:54:25Z

build

jlowe

I'm not clear on the end goal of this PR. It looks like it is going to compile against different versions of ORC yet still pull ORC into the dist jar. How is that going to work in practice -- won't the different ORC versions from different aggregator jars conflict when we try to pull it all together into the dist jar?

sql-plugin/pom.xml

jlowe · 2021-12-21T15:04:56Z

Additionally, how are the concerns about varying ORC classifiers pulled in by different Spark builds, as detailed in #4031 (comment), being addressed?

res-life · 2021-12-22T14:30:46Z

I'm not clear on the end goal of this PR. It looks like it is going to compile against different versions of ORC yet still pull ORC into the dist jar. How is that going to work in practice -- won't the different ORC versions from different aggregator jars conflict when we try to pull it all together into the dist jar?

"binary-dedupe.sh" is used to compare the class binary, so there is no conflict for multiple ORC versions. Of cause, there will be more class files.

If the intent is to use the same version of ORC and Hive that Spark is using, why are we pulling in these dependencies explicitly rather than getting them transitively from the Spark artifacts? The risk of putting them explicitly here is that they change versions for Spark snapshot artifacts and we fail to notice.

The scope of Spark core jar is provided, the maven shade plugin only shade compile jars, so explicitly specify orc jar as compile scope.
Sometimes the transitive jar is not the same as the Spark runtime, so explicitly specify.
You are right, for the Spark snapshot, it's better to use the transitive jar.

Additionally, how are the concerns about varying ORC classifiers pulled in by different Spark builds, as detailed in #4031 (comment), being addressed?

To address this concern, I have no good idea, it's better to use the current strategy of shading the ORC with the hive code.
The PR of #1398 will create a new module to collect the necessary ORC and hive classes to minimum the shading scope.
Seems the rapids plugin only used the orc-core, hive-storage-api and protobuf-java, I'll shade the ORC input/output class.

shims/pom.xml

.../spark311cdh/src/main/scala/com/nvidia/spark/rapids/shims/spark311cdh/Spark311CDHShims.scala

sql-plugin/pom.xml

res-life · 2022-01-10T08:52:55Z

build

res-life · 2022-01-10T11:17:49Z

build

sql-plugin/pom.xml

sql-plugin/src/main/301until320-all/scala/com/nvidia/spark/rapids/shims/v2/OrcShims.scala

sql-plugin/src/main/320+/scala/com/nvidia/spark/rapids/shims/v2/OrcShims.scala

sql-plugin/src/main/scala/com/nvidia/spark/rapids/VersionUtils.scala

tests/pom.xml

tests/src/test/scala/com/nvidia/spark/rapids/OrcScanSuite.scala

tests/src/test/scala/com/nvidia/spark/rapids/SparkQueryCompareTestSuite.scala

sameerz · 2022-01-11T21:47:22Z

Can we retarget this to 22.04? Reason being we ought to wait for the fix for rapidsai/cudf#9964 to go in before we merge this.

res-life · 2022-01-12T12:07:10Z

Updated, but still a draft, need to find a way to compute the version by examining the Spark jar dependencies.
Sent a mail to discuss If shade is needed.

res-life · 2022-01-12T12:12:42Z

build

tgravescs · 2022-01-12T14:24:59Z

@res-life Please put the issue description for the PR in the first post. ie (where you put This fixes #4031)

res-life · 2022-01-15T03:54:02Z

build

res-life · 2022-01-15T03:55:00Z

Changed the solution of aligning Orc versions PR after investigated the history of shading ORC.
Now do not shade ORC again, updated for all the Spark versions: 3.0.1 to 3.3.0, 301db, 312db and 311cdh.
After this change, the aggregator jar reduced size from 13M to 6.8M.
And do not need to handle the refine Orc shade issue.

res-life · 2022-01-15T10:01:25Z

build

res-life · 2022-01-17T02:57:42Z

After 11/24/2019 and from ORC-1.5.7 Spark no longer use "nohive" classifier ORC uber jar again for Hive 2.0+.
If our product aims Spark v3.0.0+ and Hive 2.0+ without Hive 1.x, then it's safe to remove the shade configuration of ORC.
So, let's trying to remove it and do some regression tests before merging the code.

Details:
cd spark-source-path
git checkout v3.0.0
git log -p ./dev/deps | grep -20 orc-core-1.5

res-life · 2022-01-17T06:35:25Z

ORC 1.6.11+ failed to prune when reading ORC file in Proleptic calendar which was written in Hybrid calendar.
The ORC file in the issue was created by Spark CPU version and ORC 1.5.10 behavior is correct.
Filed a ORC bug: https://issues.apache.org/jira/browse/ORC-1083

res-life · 2022-01-17T07:51:01Z

build

pom.xml

sql-plugin/pom.xml

shuffle-plugin/pom.xml

Signed-off-by: Chong Gao <res_life@163.com>

jlowe

I think this is pretty close, just a small nit in the test.

@tgravescs it would be good to have you take another look.

common/src/test/scala/com/nvidia/spark/rapids/ThreadFactoryBuilderTest.scala

res-life · 2022-01-22T09:14:45Z

build

tools/pom.xml

res-life · 2022-01-26T10:11:12Z

build

jlowe

Just minor nits remain, so this looks good to me. Would like to hear from @tgravescs before merging.

common/src/test/scala/com/nvidia/spark/rapids/ThreadFactoryBuilderTest.scala

res-life · 2022-01-27T10:57:56Z

build

common/pom.xml

res-life · 2022-01-28T01:58:46Z

build

jlowe

Thanks for all the hard work, @res-life! This looks good to me. @tgravescs can you take a look?

tgravescs

overall looks good, a couple of nits. It would also be nice to update the description to talk about the new common module and why it was added.

tgravescs · 2022-01-31T14:19:22Z

jenkins/databricks/build.sh

+    ORC_CORE_JAR=----workspace_${SPARK_MAJOR_VERSION_STRING}--maven-trees--hive-2.3__hadoop-2.7--org.apache.orc--orc-core--org.apache.orc__orc-core__1.5.12.jar
+    ORC_SHIM_JAR=----workspace_${SPARK_MAJOR_VERSION_STRING}--maven-trees--hive-2.3__hadoop-2.7--org.apache.orc--orc-shims--org.apache.orc__orc-shims__1.5.12.jar
+    ORC_MAPREDUCE_JAR=----workspace_${SPARK_MAJOR_VERSION_STRING}--maven-trees--hive-2.3__hadoop-2.7--org.apache.orc--orc-mapreduce--org.apache.orc__orc-mapreduce__1.5.12.jar
+    PROTOBUF_JAR=----workspace_${SPARK_MAJOR_VERSION_STRING}--maven-trees--hive-2.3__hadoop-2.7--com.google.protobuf--protobuf-java--com.google.protobuf__protobuf-java__2.6.1.jar


this jar looks the same the one in the else statement, if so move out

tgravescs · 2022-01-31T14:22:18Z

...rc/main/301until320-all/scala/com/nvidia/spark/rapids/shims/v2/OrcShims301until320Base.scala

+import org.apache.orc.impl.{DataReaderProperties, OutStream, SchemaEvolution}
+import org.apache.orc.impl.RecordReaderImpl.SargApplier
+
+// [301, 320) ORC shims


nit - not sure if the ) is supposed to be until, I think we can just remove this comment as the name should be pretty clear

res-life · 2022-02-07T04:06:16Z

build

res-life changed the title ~~Use the ORC version that corresponds to the Spark version~~ Use the ORC version that corresponds to the Spark version [databricks] Dec 21, 2021

jlowe reviewed Dec 21, 2021

View reviewed changes

sql-plugin/pom.xml Outdated Show resolved Hide resolved

res-life mentioned this pull request Dec 28, 2021

Improve shading scope #1398

Closed

sameerz added the audit_3.3.0 Audit related tasks for 3.3.0 label Dec 29, 2021

jlowe reviewed Jan 5, 2022

View reviewed changes

shims/pom.xml Outdated Show resolved Hide resolved

.../spark311cdh/src/main/scala/com/nvidia/spark/rapids/shims/spark311cdh/Spark311CDHShims.scala Outdated Show resolved Hide resolved

sql-plugin/pom.xml Outdated Show resolved Hide resolved

sql-plugin/pom.xml Outdated Show resolved Hide resolved

jlowe reviewed Jan 10, 2022

View reviewed changes

res-life force-pushed the fix-orc-dep branch from 282d02b to a5a3168 Compare January 15, 2022 03:53

pxLi mentioned this pull request Jan 17, 2022

Fix CVE-2021-22569 #4545

Merged

tgravescs mentioned this pull request Jan 18, 2022

[BUG] protobuf-java version changed to 3.x #4551

Closed

jlowe reviewed Jan 18, 2022

View reviewed changes

pom.xml Outdated Show resolved Hide resolved

sql-plugin/pom.xml Outdated Show resolved Hide resolved

sql-plugin/pom.xml Outdated Show resolved Hide resolved

shuffle-plugin/pom.xml Outdated Show resolved Hide resolved

andygrove and others added 2 commits January 19, 2022 16:15

try to remove shade

60a3d67

312db build

be5bb79

Signed-off-by: Chong Gao <res_life@163.com>

Add dependency

cd377b0

jlowe reviewed Jan 21, 2022

View reviewed changes

common/src/test/scala/com/nvidia/spark/rapids/ThreadFactoryBuilderTest.scala Outdated Show resolved Hide resolved

Update test case

6157f29

jlowe reviewed Jan 24, 2022

View reviewed changes

tools/pom.xml Show resolved Hide resolved

Fix shading overlapping resource warning

374432b

jlowe reviewed Jan 26, 2022

View reviewed changes

common/src/test/scala/com/nvidia/spark/rapids/ThreadFactoryBuilderTest.scala Outdated Show resolved Hide resolved

common/src/test/scala/com/nvidia/spark/rapids/ThreadFactoryBuilderTest.scala Outdated Show resolved Hide resolved

Update the test cases

774fa4c

jlowe reviewed Jan 27, 2022

View reviewed changes

common/pom.xml Outdated Show resolved Hide resolved

common/pom.xml Outdated Show resolved Hide resolved

Chong Gao added 2 commits January 28, 2022 09:55

Merge branch 'branch-22.04' into fix-orc-dep

f9a3a9f

Update version after 22.02 changed to 22.04 accordingly

3678770

jlowe previously approved these changes Jan 28, 2022

View reviewed changes

tgravescs reviewed Jan 31, 2022

View reviewed changes

sameerz assigned res-life Feb 4, 2022

Fix nits

22f3222

res-life dismissed jlowe’s stale review via 22f3222 February 7, 2022 02:59

res-life mentioned this pull request Feb 7, 2022

[BUG] Update the premerge and nightly tests after moving the UDF example to external repository #4704

Closed

Merge branch 22.04

956c358

tgravescs approved these changes Feb 7, 2022

View reviewed changes

jlowe approved these changes Feb 7, 2022

View reviewed changes

sameerz added this to the Jan 31 - Feb 11 milestone Feb 8, 2022

res-life merged commit 215541a into NVIDIA:branch-22.04 Feb 8, 2022

jlowe mentioned this pull request Feb 9, 2022

[BUG] test_mixed_compress_read orc_test.py failures #4728

Closed

res-life deleted the fix-orc-dep branch March 13, 2022 05:07

tgravescs mentioned this pull request Apr 13, 2022

[BUG] rapids-tools v22.04.0 release jar reports maven dependency issue : rapids-4-spark-common_2.12:jar:22.04.0 NOT FOUND #5233

Closed

jlowe mentioned this pull request May 8, 2023

Remove redundant open of ORC file #8245

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use the ORC version that corresponds to the Spark version [databricks] #4408

Use the ORC version that corresponds to the Spark version [databricks] #4408

res-life commented Dec 21, 2021 •

edited

Loading

res-life commented Dec 21, 2021 •

edited

Loading

res-life commented Dec 21, 2021

res-life commented Dec 21, 2021

res-life commented Dec 21, 2021

jlowe left a comment

jlowe commented Dec 21, 2021

res-life commented Dec 22, 2021 •

edited

Loading

res-life commented Jan 10, 2022

res-life commented Jan 10, 2022

sameerz commented Jan 11, 2022

res-life commented Jan 12, 2022

res-life commented Jan 12, 2022

tgravescs commented Jan 12, 2022 •

edited

Loading

res-life commented Jan 15, 2022

res-life commented Jan 15, 2022 •

edited

Loading

res-life commented Jan 15, 2022

res-life commented Jan 17, 2022 •

edited

Loading

res-life commented Jan 17, 2022

res-life commented Jan 17, 2022

jlowe left a comment

res-life commented Jan 22, 2022

res-life commented Jan 26, 2022

jlowe left a comment

res-life commented Jan 27, 2022

res-life commented Jan 28, 2022

jlowe left a comment

tgravescs left a comment

tgravescs Jan 31, 2022

tgravescs Jan 31, 2022

res-life commented Feb 7, 2022

Use the ORC version that corresponds to the Spark version [databricks] #4408

Use the ORC version that corresponds to the Spark version [databricks] #4408

Conversation

res-life commented Dec 21, 2021 • edited Loading

res-life commented Dec 21, 2021 • edited Loading

res-life commented Dec 21, 2021

res-life commented Dec 21, 2021

res-life commented Dec 21, 2021

jlowe left a comment

Choose a reason for hiding this comment

jlowe commented Dec 21, 2021

res-life commented Dec 22, 2021 • edited Loading

res-life commented Jan 10, 2022

res-life commented Jan 10, 2022

sameerz commented Jan 11, 2022

res-life commented Jan 12, 2022

res-life commented Jan 12, 2022

tgravescs commented Jan 12, 2022 • edited Loading

res-life commented Jan 15, 2022

res-life commented Jan 15, 2022 • edited Loading

res-life commented Jan 15, 2022

res-life commented Jan 17, 2022 • edited Loading

res-life commented Jan 17, 2022

res-life commented Jan 17, 2022

jlowe left a comment

Choose a reason for hiding this comment

res-life commented Jan 22, 2022

res-life commented Jan 26, 2022

jlowe left a comment

Choose a reason for hiding this comment

res-life commented Jan 27, 2022

res-life commented Jan 28, 2022

jlowe left a comment

Choose a reason for hiding this comment

tgravescs left a comment

Choose a reason for hiding this comment

tgravescs Jan 31, 2022

Choose a reason for hiding this comment

tgravescs Jan 31, 2022

Choose a reason for hiding this comment

res-life commented Feb 7, 2022

res-life commented Dec 21, 2021 •

edited

Loading

res-life commented Dec 21, 2021 •

edited

Loading

res-life commented Dec 22, 2021 •

edited

Loading

tgravescs commented Jan 12, 2022 •

edited

Loading

res-life commented Jan 15, 2022 •

edited

Loading

res-life commented Jan 17, 2022 •

edited

Loading