-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-22833][EXAMPLE] Improvement SparkHive Scala Examples #20081
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@cloud-fan Thanks for PR |
|
@cloud-fan spark.sql.files.maxRecordsPerFile didn't worked out when i was working with mine 30 TB of Spark Hive workload whereas repartition and coalesce made sense. |
Well, that's really an undocumented feature... Can you submit a PR to update the description of
Yea I know, but that's not accurate. It assumes each task would output one file, which is not true if |
|
FYI, there is a JIRA for a doc about |
|
Test build #85392 has finished for PR 20081 at commit
|
|
@cloud-fan @srowen I am good with changes proposed. please do merge. |
gatorsmile
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Thanks! Merged to master. |
What changes were proposed in this pull request?
Some improvements:
spark.sql.parquet.writeLegacyFormat. This config was introduced by [SPARK-10400] [SQL] Renames SQLConf.PARQUET_FOLLOW_PARQUET_FORMAT_SPEC #8566 and has nothing to do with Hive.repartitionandcoalesceexample. These 2 are not Hive specific, we should put them in a different example file. BTW they can't accurately control the number of output files,spark.sql.files.maxRecordsPerFilealso controls it.How was this patch tested?
N/A