-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-25635][SQL][BUILD] Support selective direct encoding in native ORC write #22622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #96907 has finished for PR 22622 at commit
|
| // Check the kind | ||
| val stripe = recordReader.readStripeFooter(reader.getStripes.get(0)) | ||
| if (isSelective) { | ||
| assert(stripe.getColumns(1).getKind === DICTIONARY_V2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun, how about:
assert(stripe.getColumns(1).getKind === DICTIONARY_V2)
assert(stripe.getColumns(3).getKind === DIRECT)
if (isSelective) {
assert(stripe.getColumns(2).getKind === DIRECT_V2)
} else {
assert(stripe.getColumns(2).getKind === DICTIONARY_V2)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this, I will update like the following.
assert(stripe.getColumns(1).getKind === DICTIONARY_V2)
if (isSelective) {
assert(stripe.getColumns(2).getKind === DIRECT_V2)
} else {
assert(stripe.getColumns(2).getKind === DICTIONARY_V2)
}
assert(stripe.getColumns(3).getKind === DIRECT)
| } | ||
|
|
||
| test("Enforce direct encoding column-wise selectively") { | ||
| testSelectiveDictionaryEncoding(true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about testSelectiveDictionaryEncoding(isSelective = true)
| |OPTIONS ( | ||
| | path '${dir.toURI}', | ||
| | orc.dictionary.key.threshold '1.0', | ||
| | orc.column.encoding.direct 'uuid' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about changing column name? I thought it's some kind of enum to represent encoding stuff.
|
Thank you for review, @HyukjinKwon . Sure, I'll update like that. |
|
Could you review this, @gatorsmile and @cloud-fan ? |
|
Test build #96949 has finished for PR 22622 at commit
|
|
Retest this please. |
|
|
||
| // Check the kind | ||
| val stripe = recordReader.readStripeFooter(reader.getStripes.get(0)) | ||
| assert(stripe.getColumns(1).getKind === DICTIONARY_V2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you write some comments to explain what DICTIONARY_V2 , DIRECT_V2 and DIRECT are?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure!
| test("Enforce direct encoding column-wise selectively") { | ||
| Seq(true, false).foreach { convertMetastore => | ||
| withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> s"$convertMetastore") { | ||
| testSelectiveDictionaryEncoding(isSelective = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So even with CONVERT_METASTORE_ORC as true, we still can't use selective direct encoding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. This is based on the current behavior which is a little related to your CTAS PR. Only read-path works as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we change Spark behavior later, this test will be adapted according to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I see. Thanks.
| |OPTIONS ( | ||
| | path '${dir.toURI}', | ||
| | orc.dictionary.key.threshold '1.0', | ||
| | orc.column.encoding.direct 'uniqColumn' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new feature needs a doc update. We need to let our end users how to use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ur, Apache ORC is an independent Apache project which has its own website and documents. We should respect that. If we introduce new ORC configuration one by one in Apache Spark website, it will eventually duplicate Apache ORC document in Apache Spark document.
We had better guide ORC fans to Apache ORC website. If something is missing there, they can file an ORC JIRA, not SPARK JIRA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine either way. However, our current doc does not explain we are passing the data source specific options to the underlying data source:
https://spark.apache.org/docs/latest/sql-programming-guide.html#manually-specifying-options
Could you help improve it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also give an example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds like a different issue. This PR covers both TBLPROPERTIES and OPTIONS syntaxes where are designed for that configuration-purpose historically. I mean this is not about data-source specific PR. Also, the scope of this PR is only write-side configurations.
In any way, +1 for adding some introduction section for both Parquet/ORC examples there. We had better give both read/write side configuration examples, too. Could you file a JIRA issue for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, dictionary encoding could be a good candidate; parquet.enable.dictionary and orc.dictionary.key.threshold et al.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://issues.apache.org/jira/browse/SPARK-25656 is created for that.
|
Test build #96963 has finished for PR 22622 at commit
|
|
Build failure is irrelevant to this PR. |
|
retest this please |
|
Which looks now fixed in 5ae20cf |
|
Thank you, @HyukjinKwon ! |
|
Test build #96957 has finished for PR 22622 at commit
|
|
Test build #96964 has finished for PR 22622 at commit
|
|
retest this please |
|
Test build #96977 has finished for PR 22622 at commit
|
|
@gatorsmile . Could you review this again? For your comment, I filed SPARK-25656 Add an example section about how to use Parquet/ORC library options. |
|
LGTM Thanks! Merged to master. |
|
Thank you, @gatorsmile, @HyukjinKwon , @viirya , @dilipbiswal ! |
…ata source options ## What changes were proposed in this pull request? Our current doc does not explain how we are passing the data source specific options to the underlying data source. According to [the review comment](#22622 (comment)), this PR aims to add more detailed information and examples ## How was this patch tested? Manual. Closes #22801 from dongjoon-hyun/SPARK-25656. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…bout extra data source options ## What changes were proposed in this pull request? Our current doc does not explain how we are passing the data source specific options to the underlying data source. According to [the review comment](#22622 (comment)), this PR aims to add more detailed information and examples. This is a backport of #22801. `orc.column.encoding.direct` is removed since it's not supported in ORC 1.5.2. ## How was this patch tested? Manual. Closes #22839 from dongjoon-hyun/SPARK-25656-2.4. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
… ORC write ## What changes were proposed in this pull request? Before ORC 1.5.3, `orc.dictionary.key.threshold` and `hive.exec.orc.dictionary.key.size.threshold` are applied for all columns. This has been a big huddle to enable dictionary encoding. From ORC 1.5.3, `orc.column.encoding.direct` is added to enforce direct encoding selectively in a column-wise manner. This PR aims to add that feature by upgrading ORC from 1.5.2 to 1.5.3. The followings are the patches in ORC 1.5.3 and this feature is the only one related to Spark directly. ``` ORC-406: ORC: Char(n) and Varchar(n) writers truncate to n bytes & corrupts multi-byte data (gopalv) ORC-403: [C++] Add checks to avoid invalid offsets in InputStream ORC-405: Remove calcite as a dependency from the benchmarks. ORC-375: Fix libhdfs on gcc7 by adding #include <functional> two places. ORC-383: Parallel builds fails with ConcurrentModificationException ORC-382: Apache rat exclusions + add rat check to travis ORC-401: Fix incorrect quoting in specification. ORC-385: Change RecordReader to extend Closeable. ORC-384: [C++] fix memory leak when loading non-ORC files ORC-391: [c++] parseType does not accept underscore in the field name ORC-397: Allow selective disabling of dictionary encoding. Original patch was by Mithun Radhakrishnan. ORC-389: Add ability to not decode Acid metadata columns ``` ## How was this patch tested? Pass the Jenkins with newly added test cases. Closes apache#22622 from dongjoon-hyun/SPARK-25635. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: gatorsmile <gatorsmile@gmail.com>
…ata source options ## What changes were proposed in this pull request? Our current doc does not explain how we are passing the data source specific options to the underlying data source. According to [the review comment](apache#22622 (comment)), this PR aims to add more detailed information and examples ## How was this patch tested? Manual. Closes apache#22801 from dongjoon-hyun/SPARK-25656. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
What changes were proposed in this pull request?
Before ORC 1.5.3,
orc.dictionary.key.thresholdandhive.exec.orc.dictionary.key.size.thresholdare applied for all columns. This has been a big huddle to enable dictionary encoding. From ORC 1.5.3,orc.column.encoding.directis added to enforce direct encoding selectively in a column-wise manner. This PR aims to add that feature by upgrading ORC from 1.5.2 to 1.5.3.The followings are the patches in ORC 1.5.3 and this feature is the only one related to Spark directly.
How was this patch tested?
Pass the Jenkins with newly added test cases.