[WIP] [SPARK-17487] [SQL] Configurable bucketing info extraction #15040

tejasapatil · 2016-09-10T03:24:16Z

What changes were proposed in this pull request?

I am looking for early feedback about this change wrt approach.

Here is what this PR contains:

Introduced BucketingInfoExtractor which has functions for
- extract bucket id from a given filename
- given bucket id and other info, generate the filename
BucketingInfoExtractor is a part of BucketSpec since the same session can process native Spark tables and pure Hive tables
Provided a default impl which adheres with Spark's current naming scheme
All codepaths which write a file now use the BucketingInfoExtractor to get a filename as per the bucketing scheme

TODO

Get rid of BucketingUtils completely
Introduce impl for Hive
Add tests

How was this patch tested?

TODO

SparkQA · 2016-09-10T03:31:59Z

Test build #65189 has finished for PR 15040 at commit fcf37f7.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- abstract class BucketingInfoExtractor extends Serializable
- class DefaultBucketingInfoExtractor extends BucketingInfoExtractor

tejasapatil · 2016-09-10T23:28:19Z

@cloud-fan : cc'ing you as you have lot of context about bucketing in Spark. I am looking for early feedback about this change wrt approach. I have included details in the PR description.

cloud-fan · 2016-09-12T08:35:50Z

BucketingInfoExtractor maybe a too flexible concept, we only need a boolean flag to indicate it's a spark native bucketing or hive bucketing, and I'm sure how soon we need to support bucketed table from other systems.

tejasapatil · 2016-09-12T16:20:29Z

@cloud-fan : Would it be ok to add a field in CatalogTable to indicate if a table is from Hive ? For Hive tables, the hashing function also needs to be different while doing bucketing so having such field will help in that case as well.

cloud-fan · 2016-09-13T08:25:35Z

We are trying to remove hive dependency in Spark SQL, I'm not sure if we should do this, cc @yhuai

tejasapatil · 2016-09-13T21:52:42Z

@cloud-fan : Ok. Looks like "add a field in CatalogTable" option won't be viable then. So should I move on with your advice of "boolean flag to indicate it's a spark native bucketing or hive bucketing" OR stick with "BucketingInfoExtractor" ?

cc @yhuai

Configuragble bucketing info extraction

fcf37f7

tejasapatil changed the title ~~[WIP] [SPARK-17487] [SQL] Configuragble bucketing info extraction~~ [WIP] [SPARK-17487] [SQL] Configurable bucketing info extraction Sep 10, 2016

tejasapatil mentioned this pull request Sep 29, 2016

[SPARK-17729] [SQL] Enable creating hive bucketed tables #15300

Closed

tejasapatil closed this Jan 22, 2017

tejasapatil mentioned this pull request Apr 15, 2017

[SPARK-17729] [SQL] Enable creating hive bucketed tables #17644

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] [SPARK-17487] [SQL] Configurable bucketing info extraction #15040

[WIP] [SPARK-17487] [SQL] Configurable bucketing info extraction #15040

Uh oh!

tejasapatil commented Sep 10, 2016

Uh oh!

SparkQA commented Sep 10, 2016

Uh oh!

tejasapatil commented Sep 10, 2016

Uh oh!

cloud-fan commented Sep 12, 2016

Uh oh!

tejasapatil commented Sep 12, 2016

Uh oh!

cloud-fan commented Sep 13, 2016

Uh oh!

tejasapatil commented Sep 13, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[WIP] [SPARK-17487] [SQL] Configurable bucketing info extraction #15040

[WIP] [SPARK-17487] [SQL] Configurable bucketing info extraction #15040

Uh oh!

Conversation

tejasapatil commented Sep 10, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Sep 10, 2016

Uh oh!

tejasapatil commented Sep 10, 2016

Uh oh!

cloud-fan commented Sep 12, 2016

Uh oh!

tejasapatil commented Sep 12, 2016

Uh oh!

cloud-fan commented Sep 13, 2016

Uh oh!

tejasapatil commented Sep 13, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants