[SPARK-7478][SQL] Added SQLContext.getOrCreate #6006

tdas · 2015-05-08T08:27:08Z

Having a SQLContext singleton would make it easier for applications to use a lazily instantiated single shared instance of SQLContext when needed. It would avoid problems like

In REPL/notebook environment, rerunning the line {{val sqlContext = new SQLContext}} multiple times created different contexts while overriding the reference to previous context, leading to issues like registered temp tables going missing.
In Streaming, creating SQLContext directly leads to serialization/deserialization issues when attempting to recover from DStream checkpoints. See [SPARK-6770]. Also to get around this problem I had to suggest creating a singleton instance - https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/SqlNetworkWordCount.scala

This can be solved by {{SQLContext.getOrCreate}} which get or creates a new singleton instance of SQLContext using either a given SparkContext or a given SparkConf.

@rxin @marmbrus

tdas · 2015-05-08T08:46:17Z

sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala

@rxin Design question. Should contexts created directly (that is, not through getOrCreate) also be automatically considered for the singleton?

SparkQA · 2015-05-08T08:52:44Z

Test build #32220 has finished for PR 6006 at commit bc72868.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-05-08T10:21:32Z

Test build #32224 has finished for PR 6006 at commit f82ae81.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2015-05-08T19:02:30Z

What about HiveContext?

tdas · 2015-05-08T19:19:35Z

@marmbrus Good you raised the question. @rxin and I had half-a-discussion offline about it. Correct me if I am wrong but HiveContext is a superset of the SQLContext in terms of functionality. So the last HiveContext created should also be accessible through this interface. Since the SQLContext constructor register itself with singleton, the HiveContext will also do the same. This part made sense to me.

Going beyond that if you think there should also be a HiveContext.getOrCreate, which would return the last instantiated HiveContext (not SQLContext), I can add that too in this PR.

marmbrus · 2015-05-08T19:30:15Z

Yeah the first part seems reasonable. But I think you also need a way to force the getOrCreate command to create a hive context.

tdas · 2015-05-08T20:19:37Z

test this please.

tdas · 2015-05-08T20:20:08Z

So HiveContext.getOrCreate sounds good?

SparkQA · 2015-05-08T20:43:04Z

Test build #32264 has finished for PR 6006 at commit f82ae81.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-05-16T05:10:15Z

Test build #32880 has finished for PR 6006 at commit 83bc950.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Conflicts: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala

SparkQA · 2015-05-16T08:35:53Z

Test build #32894 has finished for PR 6006 at commit b4e9721.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-05-16T09:01:10Z

Test build #32891 has finished for PR 6006 at commit d3ea8e4.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2015-05-16T10:25:36Z

Test build #32903 has finished for PR 6006 at commit dec5594.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-05-16T10:28:14Z

Test build #32904 has finished for PR 6006 at commit bf8cf50.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2015-05-17T02:17:24Z

@marmbrus Can you help me debug this? HiveContextSuite seems to be failing due to JDBC connection. What are the constraints of creating a HiveContext?

scwf · 2015-05-17T03:16:11Z

HiveContext use derby as the metastore database defaultly which only support single instance for one metastore path.

I think you need config the javax.jdo.option.ConnectionURL before creating the HiveContext
you can change getorcreate like this:

  def getOrCreate(sparkContext: SparkContext, config: Map[String, String] = Map.empty): HiveContext   = {
    INSTANTIATION_LOCK.synchronized {
      if (lastInstantiatedContext.get() == null) {
        new HiveContext(sparkContext) {
          override def configure(): Map[String, String] = config
        }
      }
    }
    lastInstantiatedContext.get()
  }

then when you create hivecontext in HiveContextSuite, you can configure it like this

    val hiveContext = HiveContext.getOrCreate(testSparkContext, HiveContext.newTemporaryConfiguration)

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala

SparkQA · 2015-05-21T01:22:51Z

Test build #33199 has started for PR 6006 at commit 48adb14.

tdas · 2015-05-21T02:58:31Z

@marmbrus Could you take a look?

SparkQA · 2015-05-21T03:22:36Z

Test build #33201 has finished for PR 6006 at commit c66ca76.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-05-21T04:30:12Z

Test build #33207 has finished for PR 6006 at commit 79fe069.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2015-05-21T17:51:33Z

sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala

using the given SparkContext.

Good catch.

marmbrus · 2015-05-21T17:56:14Z

Minor comments, otherwise LGTM.

SparkQA · 2015-05-21T20:53:23Z

Test build #33264 has finished for PR 6006 at commit 25f4da9.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- public class TaskMemoryManager

tdas · 2015-05-21T20:59:02Z

I addressed your comments and I am merging this. Thanks @marmbrus for reviewing!

Having a SQLContext singleton would make it easier for applications to use a lazily instantiated single shared instance of SQLContext when needed. It would avoid problems like 1. In REPL/notebook environment, rerunning the line {{val sqlContext = new SQLContext}} multiple times created different contexts while overriding the reference to previous context, leading to issues like registered temp tables going missing. 2. In Streaming, creating SQLContext directly leads to serialization/deserialization issues when attempting to recover from DStream checkpoints. See [SPARK-6770]. Also to get around this problem I had to suggest creating a singleton instance - https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/SqlNetworkWordCount.scala This can be solved by {{SQLContext.getOrCreate}} which get or creates a new singleton instance of SQLContext using either a given SparkContext or a given SparkConf. rxin marmbrus Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #6006 from tdas/SPARK-7478 and squashes the following commits: 25f4da9 [Tathagata Das] Addressed comments. 79fe069 [Tathagata Das] Added comments. c66ca76 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-7478 48adb14 [Tathagata Das] Removed HiveContext.getOrCreate bf8cf50 [Tathagata Das] Fix more bug dec5594 [Tathagata Das] Fixed bug b4e9721 [Tathagata Das] Remove unnecessary import 4ef513b [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-7478 d3ea8e4 [Tathagata Das] Added HiveContext 83bc950 [Tathagata Das] Updated tests f82ae81 [Tathagata Das] Fixed test bc72868 [Tathagata Das] Added SQLContext.getOrCreate (cherry picked from commit 3d0cccc) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

Having a SQLContext singleton would make it easier for applications to use a lazily instantiated single shared instance of SQLContext when needed. It would avoid problems like 1. In REPL/notebook environment, rerunning the line {{val sqlContext = new SQLContext}} multiple times created different contexts while overriding the reference to previous context, leading to issues like registered temp tables going missing. 2. In Streaming, creating SQLContext directly leads to serialization/deserialization issues when attempting to recover from DStream checkpoints. See [SPARK-6770]. Also to get around this problem I had to suggest creating a singleton instance - https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/SqlNetworkWordCount.scala This can be solved by {{SQLContext.getOrCreate}} which get or creates a new singleton instance of SQLContext using either a given SparkContext or a given SparkConf. rxin marmbrus Author: Tathagata Das <tathagata.das1565@gmail.com> Closes apache#6006 from tdas/SPARK-7478 and squashes the following commits: 25f4da9 [Tathagata Das] Addressed comments. 79fe069 [Tathagata Das] Added comments. c66ca76 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-7478 48adb14 [Tathagata Das] Removed HiveContext.getOrCreate bf8cf50 [Tathagata Das] Fix more bug dec5594 [Tathagata Das] Fixed bug b4e9721 [Tathagata Das] Remove unnecessary import 4ef513b [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-7478 d3ea8e4 [Tathagata Das] Added HiveContext 83bc950 [Tathagata Das] Updated tests f82ae81 [Tathagata Das] Fixed test bc72868 [Tathagata Das] Added SQLContext.getOrCreate

jelez · 2016-03-03T18:30:49Z

Can you guys point me what is the proper way to use getOrCreate with HiveContext?

mwws · 2016-03-04T07:14:59Z

@jelez you can create an HiveContextSingleton to workaround it. Refer to example "SqlNetWorkWordCount"

@tdas Why you removed HiveContext.getOrCreate? I can't find obvious reasons from the conversation.

Added SQLContext.getOrCreate

bc72868

tdas reviewed May 8, 2015
View reviewed changes

Fixed test

f82ae81

Updated tests

83bc950

tdas added 3 commits May 15, 2015 23:54

Added HiveContext

d3ea8e4

Merge remote-tracking branch 'apache-github/master' into SPARK-7478

4ef513b

Conflicts: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala

Remove unnecessary import

b4e9721

tdas added 2 commits May 16, 2015 02:18

Fixed bug

dec5594

Fix more bug

bf8cf50

tdas added 2 commits May 20, 2015 18:19

Removed HiveContext.getOrCreate

48adb14

Merge remote-tracking branch 'apache-github/master' into SPARK-7478

c66ca76

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala

Added comments.

79fe069

marmbrus reviewed May 21, 2015
View reviewed changes

Addressed comments.

25f4da9

asfgit closed this in 3d0cccc May 21, 2015

[SPARK-7478][SQL] Added SQLContext.getOrCreate #6006

[SPARK-7478][SQL] Added SQLContext.getOrCreate #6006

Uh oh!

Conversation

tdas commented May 8, 2015

Uh oh!

tdas May 8, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 8, 2015

Uh oh!

SparkQA commented May 8, 2015

Uh oh!

marmbrus commented May 8, 2015

Uh oh!

tdas commented May 8, 2015

Uh oh!

marmbrus commented May 8, 2015

Uh oh!

tdas commented May 8, 2015

Uh oh!

tdas commented May 8, 2015

Uh oh!

SparkQA commented May 8, 2015

Uh oh!

SparkQA commented May 16, 2015

Uh oh!

SparkQA commented May 16, 2015

Uh oh!

SparkQA commented May 16, 2015

Uh oh!

SparkQA commented May 16, 2015

Uh oh!

SparkQA commented May 16, 2015

Uh oh!

tdas commented May 17, 2015

Uh oh!

scwf commented May 17, 2015

Uh oh!

SparkQA commented May 21, 2015

Uh oh!

tdas commented May 21, 2015

Uh oh!

SparkQA commented May 21, 2015

Uh oh!

SparkQA commented May 21, 2015

Uh oh!

marmbrus May 21, 2015

Choose a reason for hiding this comment

Uh oh!

tdas May 21, 2015

Choose a reason for hiding this comment

Uh oh!

marmbrus commented May 21, 2015

Uh oh!

SparkQA commented May 21, 2015

Uh oh!

tdas commented May 21, 2015

Uh oh!

jelez commented Mar 3, 2016

Uh oh!

mwws commented Mar 4, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants