[SPARK-22833][EXAMPLE] Improvement SparkHive Scala Examples #20081

cloud-fan · 2017-12-26T04:42:51Z

What changes were proposed in this pull request?

Some improvements:

Point out we are using both Spark SQ native syntax and HQL syntax in the example
Avoid using the same table name with temp view, to not confuse users.
Create the external hive table with a directory that already has data, which is a more common use case.
Remove the usage of spark.sql.parquet.writeLegacyFormat. This config was introduced by [SPARK-10400] [SQL] Renames SQLConf.PARQUET_FOLLOW_PARQUET_FORMAT_SPEC #8566 and has nothing to do with Hive.
Remove repartition and coalesce example. These 2 are not Hive specific, we should put them in a different example file. BTW they can't accurately control the number of output files, spark.sql.files.maxRecordsPerFile also controls it.

How was this patch tested?

N/A

cloud-fan · 2017-12-26T04:43:52Z

@chetkhatri @srowen @gatorsmile

chetkhatri · 2017-12-26T05:06:42Z

@cloud-fan Thanks for PR
4. spark.sql.parquet.writeLegacyFormat - if you don't use this configuration, hive external table won't be able to access parquet data.
5. repartition and coalesce is most common use case in Industry to control N Number of files under directory while doing partitioning data.
i.e If Data volume is very huge, then every partitions would have many small-small files which may harm
downstream query performance due to File I/O, Bandwidth I/O, Network I/O, Disk I/O.
Else I am good this your approach.

chetkhatri · 2017-12-26T05:29:32Z

@cloud-fan spark.sql.files.maxRecordsPerFile didn't worked out when i was working with mine 30 TB of Spark Hive workload whereas repartition and coalesce made sense.

cloud-fan · 2017-12-26T07:36:44Z

spark.sql.parquet.writeLegacyFormat - if you don't use this configuration, hive external table won't be able to access parquet data.

Well, that's really an undocumented feature... Can you submit a PR to update the description of SQLConf.PARQUET_WRITE_LEGACY_FORMAT and add a test?

repartition and coalesce is most common use case in Industry to control N Number of files under directory while doing partitioning data.

Yea I know, but that's not accurate. It assumes each task would output one file, which is not true if spark.sql.files.maxRecordsPerFile is set to a small number. Anyway this is not a Hive feature, we should probably put it in the SQL Programming Guide.

HyukjinKwon · 2017-12-26T07:46:24Z

FYI, there is a JIRA for a doc about spark.sql.parquet.writeLegacyFormat - https://issues-test.apache.org/jira/plugins/servlet/mobile#issue/SPARK-20937

SparkQA · 2017-12-26T07:51:17Z

Test build #85392 has finished for PR 20081 at commit 10a80b2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

chetkhatri · 2017-12-26T16:37:59Z

@cloud-fan @srowen I am good with changes proposed. please do merge.

gatorsmile

LGTM

gatorsmile · 2017-12-26T17:37:55Z

Thanks! Merged to master.

clean up

10a80b2

srowen approved these changes Dec 26, 2017

View reviewed changes

HyukjinKwon approved these changes Dec 26, 2017

View reviewed changes

gatorsmile reviewed Dec 26, 2017

View reviewed changes

asfgit closed this in 9348e68 Dec 26, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-22833][EXAMPLE] Improvement SparkHive Scala Examples #20081

[SPARK-22833][EXAMPLE] Improvement SparkHive Scala Examples #20081

Uh oh!

cloud-fan commented Dec 26, 2017

Uh oh!

cloud-fan commented Dec 26, 2017

Uh oh!

chetkhatri commented Dec 26, 2017

Uh oh!

chetkhatri commented Dec 26, 2017

Uh oh!

cloud-fan commented Dec 26, 2017

Uh oh!

HyukjinKwon commented Dec 26, 2017

Uh oh!

SparkQA commented Dec 26, 2017

Uh oh!

chetkhatri commented Dec 26, 2017

Uh oh!

gatorsmile left a comment

Uh oh!

gatorsmile commented Dec 26, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[SPARK-22833][EXAMPLE] Improvement SparkHive Scala Examples #20081

[SPARK-22833][EXAMPLE] Improvement SparkHive Scala Examples #20081

Uh oh!

Conversation

cloud-fan commented Dec 26, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

cloud-fan commented Dec 26, 2017

Uh oh!

chetkhatri commented Dec 26, 2017

Uh oh!

chetkhatri commented Dec 26, 2017

Uh oh!

cloud-fan commented Dec 26, 2017

Uh oh!

HyukjinKwon commented Dec 26, 2017

Uh oh!

SparkQA commented Dec 26, 2017

Uh oh!

chetkhatri commented Dec 26, 2017

Uh oh!

gatorsmile left a comment

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Dec 26, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants