Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

What changes were proposed in this pull request?

Before ORC 1.5.3, orc.dictionary.key.threshold and hive.exec.orc.dictionary.key.size.threshold are applied for all columns. This has been a big huddle to enable dictionary encoding. From ORC 1.5.3, orc.column.encoding.direct is added to enforce direct encoding selectively in a column-wise manner. This PR aims to add that feature by upgrading ORC from 1.5.2 to 1.5.3.

The followings are the patches in ORC 1.5.3 and this feature is the only one related to Spark directly.

ORC-406: ORC: Char(n) and Varchar(n) writers truncate to n bytes & corrupts multi-byte data (gopalv)
ORC-403: [C++] Add checks to avoid invalid offsets in InputStream
ORC-405: Remove calcite as a dependency from the benchmarks.
ORC-375: Fix libhdfs on gcc7 by adding #include <functional> two places.
ORC-383: Parallel builds fails with ConcurrentModificationException
ORC-382: Apache rat exclusions + add rat check to travis
ORC-401: Fix incorrect quoting in specification.
ORC-385: Change RecordReader to extend Closeable.
ORC-384: [C++] fix memory leak when loading non-ORC files
ORC-391: [c++] parseType does not accept underscore in the field name
ORC-397: Allow selective disabling of dictionary encoding. Original patch was by Mithun Radhakrishnan.
ORC-389: Add ability to not decode Acid metadata columns

How was this patch tested?

Pass the Jenkins with newly added test cases.

@SparkQA
Copy link

SparkQA commented Oct 4, 2018

Test build #96907 has finished for PR 22622 at commit 39b7fd6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// Check the kind
val stripe = recordReader.readStripeFooter(reader.getStripes.get(0))
if (isSelective) {
assert(stripe.getColumns(1).getKind === DICTIONARY_V2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun, how about:

assert(stripe.getColumns(1).getKind === DICTIONARY_V2)
assert(stripe.getColumns(3).getKind === DIRECT)
if (isSelective) {
  assert(stripe.getColumns(2).getKind === DIRECT_V2)
} else {
  assert(stripe.getColumns(2).getKind === DICTIONARY_V2)
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this, I will update like the following.

          assert(stripe.getColumns(1).getKind === DICTIONARY_V2)
          if (isSelective) {
            assert(stripe.getColumns(2).getKind === DIRECT_V2)
          } else {
            assert(stripe.getColumns(2).getKind === DICTIONARY_V2)
          }
          assert(stripe.getColumns(3).getKind === DIRECT)

}

test("Enforce direct encoding column-wise selectively") {
testSelectiveDictionaryEncoding(true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about testSelectiveDictionaryEncoding(isSelective = true)

|OPTIONS (
| path '${dir.toURI}',
| orc.dictionary.key.threshold '1.0',
| orc.column.encoding.direct 'uuid'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about changing column name? I thought it's some kind of enum to represent encoding stuff.

@dongjoon-hyun
Copy link
Member Author

Thank you for review, @HyukjinKwon . Sure, I'll update like that.

@dongjoon-hyun
Copy link
Member Author

Could you review this, @gatorsmile and @cloud-fan ?

@SparkQA
Copy link

SparkQA commented Oct 4, 2018

Test build #96949 has finished for PR 22622 at commit 65ac786.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Retest this please.


// Check the kind
val stripe = recordReader.readStripeFooter(reader.getStripes.get(0))
assert(stripe.getColumns(1).getKind === DICTIONARY_V2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you write some comments to explain what DICTIONARY_V2 , DIRECT_V2 and DIRECT are?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

test("Enforce direct encoding column-wise selectively") {
Seq(true, false).foreach { convertMetastore =>
withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> s"$convertMetastore") {
testSelectiveDictionaryEncoding(isSelective = false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So even with CONVERT_METASTORE_ORC as true, we still can't use selective direct encoding?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. This is based on the current behavior which is a little related to your CTAS PR. Only read-path works as expected.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we change Spark behavior later, this test will be adapted according to it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I see. Thanks.

|OPTIONS (
| path '${dir.toURI}',
| orc.dictionary.key.threshold '1.0',
| orc.column.encoding.direct 'uniqColumn'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new feature needs a doc update. We need to let our end users how to use it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ur, Apache ORC is an independent Apache project which has its own website and documents. We should respect that. If we introduce new ORC configuration one by one in Apache Spark website, it will eventually duplicate Apache ORC document in Apache Spark document.

We had better guide ORC fans to Apache ORC website. If something is missing there, they can file an ORC JIRA, not SPARK JIRA.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine either way. However, our current doc does not explain we are passing the data source specific options to the underlying data source:

https://spark.apache.org/docs/latest/sql-programming-guide.html#manually-specifying-options

Could you help improve it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also give an example?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a different issue. This PR covers both TBLPROPERTIES and OPTIONS syntaxes where are designed for that configuration-purpose historically. I mean this is not about data-source specific PR. Also, the scope of this PR is only write-side configurations.

In any way, +1 for adding some introduction section for both Parquet/ORC examples there. We had better give both read/write side configuration examples, too. Could you file a JIRA issue for that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, dictionary encoding could be a good candidate; parquet.enable.dictionary and orc.dictionary.key.threshold et al.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SparkQA
Copy link

SparkQA commented Oct 5, 2018

Test build #96963 has finished for PR 22622 at commit 70016e4.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Build failure is irrelevant to this PR.

[error] (hive-thriftserver/compile:compileIncremental) javac returned nonzero exit code
[error] Total time: 587 s, completed Oct 4, 2018 8:11:23 PM

@HyukjinKwon
Copy link
Member

retest this please

@HyukjinKwon
Copy link
Member

Which looks now fixed in 5ae20cf

@dongjoon-hyun
Copy link
Member Author

Thank you, @HyukjinKwon !

@SparkQA
Copy link

SparkQA commented Oct 5, 2018

Test build #96957 has finished for PR 22622 at commit 65ac786.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 5, 2018

Test build #96964 has finished for PR 22622 at commit 70016e4.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Oct 5, 2018

Test build #96977 has finished for PR 22622 at commit 70016e4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Oct 5, 2018

@gatorsmile . Could you review this again? For your comment, I filed SPARK-25656 Add an example section about how to use Parquet/ORC library options.

@gatorsmile
Copy link
Member

gatorsmile commented Oct 5, 2018

LGTM

Thanks! Merged to master.

@asfgit asfgit closed this in 1c9486c Oct 5, 2018
@dongjoon-hyun
Copy link
Member Author

Thank you, @gatorsmile, @HyukjinKwon , @viirya , @dilipbiswal !

asfgit pushed a commit that referenced this pull request Oct 23, 2018
…ata source options

## What changes were proposed in this pull request?

Our current doc does not explain how we are passing the data source specific options to the underlying data source. According to [the review comment](#22622 (comment)), this PR aims to add more detailed information and examples

## How was this patch tested?

Manual.

Closes #22801 from dongjoon-hyun/SPARK-25656.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
asfgit pushed a commit that referenced this pull request Oct 25, 2018
…bout extra data source options

## What changes were proposed in this pull request?

Our current doc does not explain how we are passing the data source specific options to the underlying data source. According to [the review comment](#22622 (comment)), this PR aims to add more detailed information and examples. This is a backport of #22801. `orc.column.encoding.direct` is removed since it's not supported in ORC 1.5.2.

## How was this patch tested?

Manual.

Closes #22839 from dongjoon-hyun/SPARK-25656-2.4.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
… ORC write

## What changes were proposed in this pull request?

Before ORC 1.5.3, `orc.dictionary.key.threshold` and `hive.exec.orc.dictionary.key.size.threshold` are applied for all columns. This has been a big huddle to enable dictionary encoding. From ORC 1.5.3, `orc.column.encoding.direct` is added to enforce direct encoding selectively in a column-wise manner. This PR aims to add that feature by upgrading ORC from 1.5.2 to 1.5.3.

The followings are the patches in ORC 1.5.3 and this feature is the only one related to Spark directly.
```
ORC-406: ORC: Char(n) and Varchar(n) writers truncate to n bytes & corrupts multi-byte data (gopalv)
ORC-403: [C++] Add checks to avoid invalid offsets in InputStream
ORC-405: Remove calcite as a dependency from the benchmarks.
ORC-375: Fix libhdfs on gcc7 by adding #include <functional> two places.
ORC-383: Parallel builds fails with ConcurrentModificationException
ORC-382: Apache rat exclusions + add rat check to travis
ORC-401: Fix incorrect quoting in specification.
ORC-385: Change RecordReader to extend Closeable.
ORC-384: [C++] fix memory leak when loading non-ORC files
ORC-391: [c++] parseType does not accept underscore in the field name
ORC-397: Allow selective disabling of dictionary encoding. Original patch was by Mithun Radhakrishnan.
ORC-389: Add ability to not decode Acid metadata columns
```

## How was this patch tested?

Pass the Jenkins with newly added test cases.

Closes apache#22622 from dongjoon-hyun/SPARK-25635.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
…ata source options

## What changes were proposed in this pull request?

Our current doc does not explain how we are passing the data source specific options to the underlying data source. According to [the review comment](apache#22622 (comment)), this PR aims to add more detailed information and examples

## How was this patch tested?

Manual.

Closes apache#22801 from dongjoon-hyun/SPARK-25656.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants