Skip to content

Conversation

@rxin
Copy link
Contributor

@rxin rxin commented May 2, 2016

What changes were proposed in this pull request?

This patch creates a builder pattern for creating SparkSession. The new code is unused and mostly deadcode. I'm putting it up here for feedback.

There are a few TODOs that can be done as follow-up pull requests:

  • Update tests to use this
  • Update examples to use this
  • Clean up SQLContext code w.r.t. this one (i.e. SparkSession shouldn't call into SQLContext.getOrCreate; it should be the other way around)
  • Remove SparkSession.withHiveSupport
  • Disable the old constructor (by making it private) so the only way to start a SparkSession is through this builder pattern

How was this patch tested?

Part of the future pull request is to clean this up and switch existing tests to use this.

@rxin
Copy link
Contributor Author

rxin commented May 2, 2016

cc @yhuai @andrewor14 and also cc @dongjoon-hyun since you have been working on the example files

@dongjoon-hyun
Copy link
Member

Thank you for notifying me. It looks good to me. Then, the three-line pattern will be replace into one factory statement, right?

Spark 1.x

val conf = new SparkConf().setMaster("local[4]").setAppName("App")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)

Spark 2.0

val spark = SparkSession.builder().master("local").config("spark.some.config.option", "some-value").getOrCreate()

@rxin
Copy link
Contributor Author

rxin commented May 2, 2016

Yes. Technically we don't really reduce the line length, but definitely reduces the number of concepts people need to use if they are just using DataFrame/Dataset.

@dongjoon-hyun
Copy link
Member

Yes, right. And, this can reduce the import statement for SparkConf and SparkContext for those people. It become much simpler. Cool. I will update my PR accordingly.

*/
class Builder {

private[this] val options = new scala.collection.mutable.HashMap[String, String]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about using j.u.c.ConcurrentHashMap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It creates a lot of garbage for something that's not expected to be concurrent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, what I meant was moving locking point from Building instance into options. I thought only getOrCreate needs locking on Builder instance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, forget about my comments. Builder is so simple and current implementation is solid, too.

@SparkQA
Copy link

SparkQA commented May 2, 2016

Test build #57501 has finished for PR 12830 at commit 8172d91.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* Enables Hive support, including connectivity to a persistent Hive metastore, support for
* Hive serdes, and Hive user-defined functions.
*
* @return 2.0.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope!

@andrewor14
Copy link
Contributor

LGTM. This is beautiful.

@rxin
Copy link
Contributor Author

rxin commented May 2, 2016

Thanks - going to merge this. I added removing the existing withHiveSupport as a TODO in the pr description.

@asfgit asfgit closed this in ca1b219 May 2, 2016
asfgit pushed a commit that referenced this pull request May 2, 2016
## What changes were proposed in this pull request?
This patch creates a builder pattern for creating SparkSession. The new code is unused and mostly deadcode. I'm putting it up here for feedback.

There are a few TODOs that can be done as follow-up pull requests:
- [ ] Update tests to use this
- [ ] Update examples to use this
- [ ] Clean up SQLContext code w.r.t. this one (i.e. SparkSession shouldn't call into SQLContext.getOrCreate; it should be the other way around)
- [ ] Remove SparkSession.withHiveSupport
- [ ] Disable the old constructor (by making it private) so the only way to start a SparkSession is through this builder pattern

## How was this patch tested?
Part of the future pull request is to clean this up and switch existing tests to use this.

Author: Reynold Xin <rxin@databricks.com>

Closes #12830 from rxin/sparksession-builder.

(cherry picked from commit ca1b219)
Signed-off-by: Reynold Xin <rxin@databricks.com>
*
* @since 2.0.0
*/
def config(key: String, value: Long): Builder = synchronized {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about other primitive types for the value: Int, Float, Short ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They don't matter as they just map into Long / Double.

@SparkQA
Copy link

SparkQA commented May 3, 2016

Test build #57566 has finished for PR 12830 at commit 0005a3d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

/**
* Builder for [[SparkSession]].
*/
class Builder {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding a clear() method so that Builder instance can be reused ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants