[SPARK-25635][SQL][BUILD] Support selective direct encoding in native ORC write #22622

dongjoon-hyun · 2018-10-03T21:12:02Z

What changes were proposed in this pull request?

Before ORC 1.5.3, orc.dictionary.key.threshold and hive.exec.orc.dictionary.key.size.threshold are applied for all columns. This has been a big huddle to enable dictionary encoding. From ORC 1.5.3, orc.column.encoding.direct is added to enforce direct encoding selectively in a column-wise manner. This PR aims to add that feature by upgrading ORC from 1.5.2 to 1.5.3.

The followings are the patches in ORC 1.5.3 and this feature is the only one related to Spark directly.

ORC-406: ORC: Char(n) and Varchar(n) writers truncate to n bytes & corrupts multi-byte data (gopalv)
ORC-403: [C++] Add checks to avoid invalid offsets in InputStream
ORC-405: Remove calcite as a dependency from the benchmarks.
ORC-375: Fix libhdfs on gcc7 by adding #include <functional> two places.
ORC-383: Parallel builds fails with ConcurrentModificationException
ORC-382: Apache rat exclusions + add rat check to travis
ORC-401: Fix incorrect quoting in specification.
ORC-385: Change RecordReader to extend Closeable.
ORC-384: [C++] fix memory leak when loading non-ORC files
ORC-391: [c++] parseType does not accept underscore in the field name
ORC-397: Allow selective disabling of dictionary encoding. Original patch was by Mithun Radhakrishnan.
ORC-389: Add ability to not decode Acid metadata columns

How was this patch tested?

Pass the Jenkins with newly added test cases.

… ORC write

SparkQA · 2018-10-04T02:10:46Z

Test build #96907 has finished for PR 22622 at commit 39b7fd6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-10-04T04:33:30Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala

+          // Check the kind
+          val stripe = recordReader.readStripeFooter(reader.getStripes.get(0))
+          if (isSelective) {
+            assert(stripe.getColumns(1).getKind === DICTIONARY_V2)


@dongjoon-hyun, how about:

assert(stripe.getColumns(1).getKind === DICTIONARY_V2) assert(stripe.getColumns(3).getKind === DIRECT) if (isSelective) { assert(stripe.getColumns(2).getKind === DIRECT_V2) } else { assert(stripe.getColumns(2).getKind === DICTIONARY_V2) }

For this, I will update like the following.

assert(stripe.getColumns(1).getKind === DICTIONARY_V2) if (isSelective) { assert(stripe.getColumns(2).getKind === DIRECT_V2) } else { assert(stripe.getColumns(2).getKind === DICTIONARY_V2) } assert(stripe.getColumns(3).getKind === DIRECT)

HyukjinKwon · 2018-10-04T04:34:27Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala

  }
+
+  test("Enforce direct encoding column-wise selectively") {
+    testSelectiveDictionaryEncoding(true)


how about testSelectiveDictionaryEncoding(isSelective = true)

HyukjinKwon · 2018-10-04T04:37:21Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala

+               |OPTIONS (
+               |  path '${dir.toURI}',
+               |  orc.dictionary.key.threshold '1.0',
+               |  orc.column.encoding.direct 'uuid'


How about changing column name? I thought it's some kind of enum to represent encoding stuff.

dongjoon-hyun · 2018-10-04T16:56:18Z

Thank you for review, @HyukjinKwon . Sure, I'll update like that.

dongjoon-hyun · 2018-10-04T18:41:36Z

Could you review this, @gatorsmile and @cloud-fan ?

SparkQA · 2018-10-04T21:35:26Z

Test build #96949 has finished for PR 22622 at commit 65ac786.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-10-04T22:33:16Z

Retest this please.

gatorsmile · 2018-10-04T23:22:35Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala

+
+          // Check the kind
+          val stripe = recordReader.readStripeFooter(reader.getStripes.get(0))
+          assert(stripe.getColumns(1).getKind === DICTIONARY_V2)


Could you write some comments to explain what DICTIONARY_V2 , DIRECT_V2 and DIRECT are?

viirya · 2018-10-05T00:32:50Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcSourceSuite.scala

+  test("Enforce direct encoding column-wise selectively") {
+    Seq(true, false).foreach { convertMetastore =>
+      withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> s"$convertMetastore") {
+        testSelectiveDictionaryEncoding(isSelective = false)


So even with CONVERT_METASTORE_ORC as true, we still can't use selective direct encoding?

Yep. This is based on the current behavior which is a little related to your CTAS PR. Only read-path works as expected.

When we change Spark behavior later, this test will be adapted according to it.

Ok. I see. Thanks.

gatorsmile · 2018-10-05T02:16:54Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala

+               |OPTIONS (
+               |  path '${dir.toURI}',
+               |  orc.dictionary.key.threshold '1.0',
+               |  orc.column.encoding.direct 'uniqColumn'


This new feature needs a doc update. We need to let our end users how to use it.

Ur, Apache ORC is an independent Apache project which has its own website and documents. We should respect that. If we introduce new ORC configuration one by one in Apache Spark website, it will eventually duplicate Apache ORC document in Apache Spark document.

We had better guide ORC fans to Apache ORC website. If something is missing there, they can file an ORC JIRA, not SPARK JIRA.

I am fine either way. However, our current doc does not explain we are passing the data source specific options to the underlying data source:

https://spark.apache.org/docs/latest/sql-programming-guide.html#manually-specifying-options

Could you help improve it?

Also give an example?

That sounds like a different issue. This PR covers both TBLPROPERTIES and OPTIONS syntaxes where are designed for that configuration-purpose historically. I mean this is not about data-source specific PR. Also, the scope of this PR is only write-side configurations.

In any way, +1 for adding some introduction section for both Parquet/ORC examples there. We had better give both read/write side configuration examples, too. Could you file a JIRA issue for that?

Maybe, dictionary encoding could be a good candidate; parquet.enable.dictionary and orc.dictionary.key.threshold et al.

https://issues.apache.org/jira/browse/SPARK-25656 is created for that.

SparkQA · 2018-10-05T03:11:24Z

Test build #96963 has finished for PR 22622 at commit 70016e4.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-10-05T03:18:10Z

Build failure is irrelevant to this PR.

[error] (hive-thriftserver/compile:compileIncremental) javac returned nonzero exit code
[error] Total time: 587 s, completed Oct 4, 2018 8:11:23 PM

HyukjinKwon · 2018-10-05T03:18:12Z

retest this please

HyukjinKwon · 2018-10-05T03:18:35Z

Which looks now fixed in 5ae20cf

dongjoon-hyun · 2018-10-05T03:22:13Z

Thank you, @HyukjinKwon !

SparkQA · 2018-10-05T03:28:00Z

Test build #96957 has finished for PR 22622 at commit 65ac786.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-05T07:05:01Z

Test build #96964 has finished for PR 22622 at commit 70016e4.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2018-10-05T07:26:24Z

retest this please

SparkQA · 2018-10-05T12:50:30Z

Test build #96977 has finished for PR 22622 at commit 70016e4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-10-05T23:08:32Z

@gatorsmile . Could you review this again? For your comment, I filed SPARK-25656 Add an example section about how to use Parquet/ORC library options.

gatorsmile · 2018-10-05T23:41:49Z

LGTM

Thanks! Merged to master.

dongjoon-hyun · 2018-10-06T00:12:46Z

Thank you, @gatorsmile, @HyukjinKwon , @viirya , @dilipbiswal !

…ata source options ## What changes were proposed in this pull request? Our current doc does not explain how we are passing the data source specific options to the underlying data source. According to [the review comment](#22622 (comment)), this PR aims to add more detailed information and examples ## How was this patch tested? Manual. Closes #22801 from dongjoon-hyun/SPARK-25656. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…bout extra data source options ## What changes were proposed in this pull request? Our current doc does not explain how we are passing the data source specific options to the underlying data source. According to [the review comment](#22622 (comment)), this PR aims to add more detailed information and examples. This is a backport of #22801. `orc.column.encoding.direct` is removed since it's not supported in ORC 1.5.2. ## How was this patch tested? Manual. Closes #22839 from dongjoon-hyun/SPARK-25656-2.4. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

… ORC write ## What changes were proposed in this pull request? Before ORC 1.5.3, `orc.dictionary.key.threshold` and `hive.exec.orc.dictionary.key.size.threshold` are applied for all columns. This has been a big huddle to enable dictionary encoding. From ORC 1.5.3, `orc.column.encoding.direct` is added to enforce direct encoding selectively in a column-wise manner. This PR aims to add that feature by upgrading ORC from 1.5.2 to 1.5.3. The followings are the patches in ORC 1.5.3 and this feature is the only one related to Spark directly. ``` ORC-406: ORC: Char(n) and Varchar(n) writers truncate to n bytes & corrupts multi-byte data (gopalv) ORC-403: [C++] Add checks to avoid invalid offsets in InputStream ORC-405: Remove calcite as a dependency from the benchmarks. ORC-375: Fix libhdfs on gcc7 by adding #include <functional> two places. ORC-383: Parallel builds fails with ConcurrentModificationException ORC-382: Apache rat exclusions + add rat check to travis ORC-401: Fix incorrect quoting in specification. ORC-385: Change RecordReader to extend Closeable. ORC-384: [C++] fix memory leak when loading non-ORC files ORC-391: [c++] parseType does not accept underscore in the field name ORC-397: Allow selective disabling of dictionary encoding. Original patch was by Mithun Radhakrishnan. ORC-389: Add ability to not decode Acid metadata columns ``` ## How was this patch tested? Pass the Jenkins with newly added test cases. Closes apache#22622 from dongjoon-hyun/SPARK-25635. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: gatorsmile <gatorsmile@gmail.com>

…ata source options ## What changes were proposed in this pull request? Our current doc does not explain how we are passing the data source specific options to the underlying data source. According to [the review comment](apache#22622 (comment)), this PR aims to add more detailed information and examples ## How was this patch tested? Manual. Closes apache#22801 from dongjoon-hyun/SPARK-25656. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

[SPARK-25635][SQL][BUILD] Support selective direct encoding in native…

39b7fd6

… ORC write

HyukjinKwon reviewed Oct 4, 2018

View reviewed changes

HyukjinKwon approved these changes Oct 4, 2018

View reviewed changes

Address comments

65ac786

gatorsmile reviewed Oct 4, 2018

View reviewed changes

viirya reviewed Oct 5, 2018

View reviewed changes

gatorsmile reviewed Oct 5, 2018

View reviewed changes

Add comments

70016e4

asfgit closed this in 1c9486c Oct 5, 2018

dongjoon-hyun deleted the SPARK-25635 branch October 6, 2018 00:12

dongjoon-hyun mentioned this pull request Oct 23, 2018

[SPARK-25656][SQL][DOC][EXAMPLE] Add a doc and examples about extra data source options #22801

Closed

dongjoon-hyun mentioned this pull request Oct 25, 2018

[SPARK-25656][SQL][DOC][EXAMPLE][BRANCH-2.4] Add a doc and examples about extra data source options #22839

Closed

[SPARK-25635][SQL][BUILD] Support selective direct encoding in native ORC write #22622

[SPARK-25635][SQL][BUILD] Support selective direct encoding in native ORC write #22622

Uh oh!

Conversation

dongjoon-hyun commented Oct 3, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Oct 4, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Oct 4, 2018

Uh oh!

dongjoon-hyun commented Oct 4, 2018

Uh oh!

SparkQA commented Oct 4, 2018

Uh oh!

dongjoon-hyun commented Oct 4, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 5, 2018

Uh oh!

dongjoon-hyun commented Oct 5, 2018

Uh oh!

HyukjinKwon commented Oct 5, 2018

Uh oh!

HyukjinKwon commented Oct 5, 2018

Uh oh!

dongjoon-hyun commented Oct 5, 2018

Uh oh!

SparkQA commented Oct 5, 2018

Uh oh!

SparkQA commented Oct 5, 2018

Uh oh!

dilipbiswal commented Oct 5, 2018

Uh oh!

SparkQA commented Oct 5, 2018

Uh oh!

dongjoon-hyun commented Oct 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gatorsmile commented Oct 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Oct 6, 2018

Uh oh!

Reviewers

Assignees

dongjoon-hyun commented Oct 5, 2018 •

edited

Loading

gatorsmile commented Oct 5, 2018 •

edited

Loading