Skip to content

Conversation

@tejasapatil
Copy link
Contributor

What changes were proposed in this pull request?

Currently Spark does not respect bucketing for Hive tables. This PR includes following changes:

  • will extract table's bucketing information in HiveClientImpl
  • while writing table info to metastore, MetastoreRelation now populates the bucketing information in the hive Table object
  • HiveTableScanExec now exposes outputPartitioning and outputOrdering as per bucketing spec.
  • InsertIntoHiveTable now exposes requiredChildDistribution and requiredChildOrdering based on the target table's bucketing spec.

TODOs (which will be done in linked PRs and not this one):

  • ClusteredDistribution does not guarantee the number of partitions (which corresponds to output bucket files created) generated. This will require adding strict guarantees to ClusteredDistribution. I think it will need more thought and better to do incrementally and not packing in this PR.
  • While writing to bucketed files, Hive's hashing function should be used. I have a PR open to implement Hive hashing native in Spark : [SPARK-17495] [SQL] Add Hash capability semantically equivalent to Hive's #15047
  • Allow creating Hive bucketed tables

How was this patch tested?

Tested with Hive tables created locally. Adding a new test case will need implementing bucketed table creation which is not supported :( Suggestions welcome.

@tejasapatil tejasapatil force-pushed the SPARK-17654_hive_extract_bucketing branch from caef89a to 7c38252 Compare September 24, 2016 04:31
@SparkQA
Copy link

SparkQA commented Sep 24, 2016

Test build #65857 has finished for PR 15228 at commit caef89a.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@tejasapatil tejasapatil deleted the SPARK-17654_hive_extract_bucketing branch September 24, 2016 05:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants