Skip to content

Conversation

@cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

Some improvements:

  1. Point out we are using both Spark SQ native syntax and HQL syntax in the example
  2. Avoid using the same table name with temp view, to not confuse users.
  3. Create the external hive table with a directory that already has data, which is a more common use case.
  4. Remove the usage of spark.sql.parquet.writeLegacyFormat. This config was introduced by [SPARK-10400] [SQL] Renames SQLConf.PARQUET_FOLLOW_PARQUET_FORMAT_SPEC #8566 and has nothing to do with Hive.
  5. Remove repartition and coalesce example. These 2 are not Hive specific, we should put them in a different example file. BTW they can't accurately control the number of output files, spark.sql.files.maxRecordsPerFile also controls it.

How was this patch tested?

N/A

@cloud-fan
Copy link
Contributor Author

@chetkhatri
Copy link
Contributor

@cloud-fan Thanks for PR
4. spark.sql.parquet.writeLegacyFormat - if you don't use this configuration, hive external table won't be able to access parquet data.
5. repartition and coalesce is most common use case in Industry to control N Number of files under directory while doing partitioning data.
i.e If Data volume is very huge, then every partitions would have many small-small files which may harm
downstream query performance due to File I/O, Bandwidth I/O, Network I/O, Disk I/O.
Else I am good this your approach.

@chetkhatri
Copy link
Contributor

@cloud-fan spark.sql.files.maxRecordsPerFile didn't worked out when i was working with mine 30 TB of Spark Hive workload whereas repartition and coalesce made sense.

@cloud-fan
Copy link
Contributor Author

spark.sql.parquet.writeLegacyFormat - if you don't use this configuration, hive external table won't be able to access parquet data.

Well, that's really an undocumented feature... Can you submit a PR to update the description of SQLConf.PARQUET_WRITE_LEGACY_FORMAT and add a test?

repartition and coalesce is most common use case in Industry to control N Number of files under directory while doing partitioning data.

Yea I know, but that's not accurate. It assumes each task would output one file, which is not true if spark.sql.files.maxRecordsPerFile is set to a small number. Anyway this is not a Hive feature, we should probably put it in the SQL Programming Guide.

@HyukjinKwon
Copy link
Member

FYI, there is a JIRA for a doc about spark.sql.parquet.writeLegacyFormat - https://issues-test.apache.org/jira/plugins/servlet/mobile#issue/SPARK-20937

@SparkQA
Copy link

SparkQA commented Dec 26, 2017

Test build #85392 has finished for PR 20081 at commit 10a80b2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@chetkhatri
Copy link
Contributor

@cloud-fan @srowen I am good with changes proposed. please do merge.

Copy link
Member

@gatorsmile gatorsmile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gatorsmile
Copy link
Member

Thanks! Merged to master.

@asfgit asfgit closed this in 9348e68 Dec 26, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants