Skip to content

Conversation

@tejasapatil
Copy link
Contributor

What changes were proposed in this pull request?

I am looking for early feedback about this change wrt approach.

Here is what this PR contains:

  • Introduced BucketingInfoExtractor which has functions for
    • extract bucket id from a given filename
    • given bucket id and other info, generate the filename
  • BucketingInfoExtractor is a part of BucketSpec since the same session can process native Spark tables and pure Hive tables
  • Provided a default impl which adheres with Spark's current naming scheme
  • All codepaths which write a file now use the BucketingInfoExtractor to get a filename as per the bucketing scheme

TODO

  • Get rid of BucketingUtils completely
  • Introduce impl for Hive
  • Add tests

How was this patch tested?

TODO

@SparkQA
Copy link

SparkQA commented Sep 10, 2016

Test build #65189 has finished for PR 15040 at commit fcf37f7.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class BucketingInfoExtractor extends Serializable
    • class DefaultBucketingInfoExtractor extends BucketingInfoExtractor

@tejasapatil tejasapatil changed the title [WIP] [SPARK-17487] [SQL] Configuragble bucketing info extraction [WIP] [SPARK-17487] [SQL] Configurable bucketing info extraction Sep 10, 2016
@tejasapatil
Copy link
Contributor Author

@cloud-fan : cc'ing you as you have lot of context about bucketing in Spark. I am looking for early feedback about this change wrt approach. I have included details in the PR description.

@cloud-fan
Copy link
Contributor

BucketingInfoExtractor maybe a too flexible concept, we only need a boolean flag to indicate it's a spark native bucketing or hive bucketing, and I'm sure how soon we need to support bucketed table from other systems.

@tejasapatil
Copy link
Contributor Author

@cloud-fan : Would it be ok to add a field in CatalogTable to indicate if a table is from Hive ? For Hive tables, the hashing function also needs to be different while doing bucketing so having such field will help in that case as well.

@cloud-fan
Copy link
Contributor

We are trying to remove hive dependency in Spark SQL, I'm not sure if we should do this, cc @yhuai

@tejasapatil
Copy link
Contributor Author

@cloud-fan : Ok. Looks like "add a field in CatalogTable" option won't be viable then. So should I move on with your advice of "boolean flag to indicate it's a spark native bucketing or hive bucketing" OR stick with "BucketingInfoExtractor" ?

cc @yhuai

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants