Skip to content

Conversation

@tdas
Copy link
Contributor

@tdas tdas commented May 8, 2015

Having a SQLContext singleton would make it easier for applications to use a lazily instantiated single shared instance of SQLContext when needed. It would avoid problems like

  1. In REPL/notebook environment, rerunning the line {{val sqlContext = new SQLContext}} multiple times created different contexts while overriding the reference to previous context, leading to issues like registered temp tables going missing.
  2. In Streaming, creating SQLContext directly leads to serialization/deserialization issues when attempting to recover from DStream checkpoints. See [SPARK-6770]. Also to get around this problem I had to suggest creating a singleton instance - https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/SqlNetworkWordCount.scala

This can be solved by {{SQLContext.getOrCreate}} which get or creates a new singleton instance of SQLContext using either a given SparkContext or a given SparkConf.

@rxin @marmbrus

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rxin Design question. Should contexts created directly (that is, not through getOrCreate) also be automatically considered for the singleton?

@SparkQA
Copy link

SparkQA commented May 8, 2015

Test build #32220 has finished for PR 6006 at commit bc72868.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 8, 2015

Test build #32224 has finished for PR 6006 at commit f82ae81.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@marmbrus
Copy link
Contributor

marmbrus commented May 8, 2015

What about HiveContext?

@tdas
Copy link
Contributor Author

tdas commented May 8, 2015

@marmbrus Good you raised the question. @rxin and I had half-a-discussion offline about it. Correct me if I am wrong but HiveContext is a superset of the SQLContext in terms of functionality. So the last HiveContext created should also be accessible through this interface. Since the SQLContext constructor register itself with singleton, the HiveContext will also do the same. This part made sense to me.

Going beyond that if you think there should also be a HiveContext.getOrCreate, which would return the last instantiated HiveContext (not SQLContext), I can add that too in this PR.

@marmbrus
Copy link
Contributor

marmbrus commented May 8, 2015

Yeah the first part seems reasonable. But I think you also need a way to force the getOrCreate command to create a hive context.

@tdas
Copy link
Contributor Author

tdas commented May 8, 2015

test this please.

@tdas
Copy link
Contributor Author

tdas commented May 8, 2015

So HiveContext.getOrCreate sounds good?

@SparkQA
Copy link

SparkQA commented May 8, 2015

Test build #32264 has finished for PR 6006 at commit f82ae81.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 16, 2015

Test build #32880 has finished for PR 6006 at commit 83bc950.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

tdas added 3 commits May 15, 2015 23:54
@SparkQA
Copy link

SparkQA commented May 16, 2015

Test build #32894 has finished for PR 6006 at commit b4e9721.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 16, 2015

Test build #32891 has finished for PR 6006 at commit d3ea8e4.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 16, 2015

Test build #32903 has finished for PR 6006 at commit dec5594.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 16, 2015

Test build #32904 has finished for PR 6006 at commit bf8cf50.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tdas
Copy link
Contributor Author

tdas commented May 17, 2015

@marmbrus Can you help me debug this? HiveContextSuite seems to be failing due to JDBC connection. What are the constraints of creating a HiveContext?

@scwf
Copy link
Contributor

scwf commented May 17, 2015

HiveContext use derby as the metastore database defaultly which only support single instance for one metastore path.

I think you need config the javax.jdo.option.ConnectionURL before creating the HiveContext
you can change getorcreate like this:

  def getOrCreate(sparkContext: SparkContext, config: Map[String, String] = Map.empty): HiveContext   = {
    INSTANTIATION_LOCK.synchronized {
      if (lastInstantiatedContext.get() == null) {
        new HiveContext(sparkContext) {
          override def configure(): Map[String, String] = config
        }
      }
    }
    lastInstantiatedContext.get()
  }

then when you create hivecontext in HiveContextSuite, you can configure it like this

    val hiveContext = HiveContext.getOrCreate(testSparkContext, HiveContext.newTemporaryConfiguration)

tdas added 2 commits May 20, 2015 18:19
Conflicts:
	sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala
@SparkQA
Copy link

SparkQA commented May 21, 2015

Test build #33199 has started for PR 6006 at commit 48adb14.

@tdas
Copy link
Contributor Author

tdas commented May 21, 2015

@marmbrus Could you take a look?

@SparkQA
Copy link

SparkQA commented May 21, 2015

Test build #33201 has finished for PR 6006 at commit c66ca76.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 21, 2015

Test build #33207 has finished for PR 6006 at commit 79fe069.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using the given SparkContext.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

@marmbrus
Copy link
Contributor

Minor comments, otherwise LGTM.

@SparkQA
Copy link

SparkQA commented May 21, 2015

Test build #33264 has finished for PR 6006 at commit 25f4da9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public class TaskMemoryManager

@tdas
Copy link
Contributor Author

tdas commented May 21, 2015

I addressed your comments and I am merging this. Thanks @marmbrus for reviewing!

asfgit pushed a commit that referenced this pull request May 21, 2015
Having a SQLContext singleton would make it easier for applications to use a lazily instantiated single shared instance of SQLContext when needed. It would avoid problems like

1. In REPL/notebook environment, rerunning the line {{val sqlContext = new SQLContext}} multiple times created different contexts while overriding the reference to previous context, leading to issues like registered temp tables going missing.

2. In Streaming, creating SQLContext directly leads to serialization/deserialization issues when attempting to recover from DStream checkpoints. See [SPARK-6770]. Also to get around this problem I had to suggest creating a singleton instance - https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/SqlNetworkWordCount.scala

This can be solved by {{SQLContext.getOrCreate}} which get or creates a new singleton instance of SQLContext using either a given SparkContext or a given SparkConf.

rxin marmbrus

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #6006 from tdas/SPARK-7478 and squashes the following commits:

25f4da9 [Tathagata Das] Addressed comments.
79fe069 [Tathagata Das] Added comments.
c66ca76 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-7478
48adb14 [Tathagata Das] Removed HiveContext.getOrCreate
bf8cf50 [Tathagata Das] Fix more bug
dec5594 [Tathagata Das] Fixed bug
b4e9721 [Tathagata Das] Remove unnecessary import
4ef513b [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-7478
d3ea8e4 [Tathagata Das] Added HiveContext
83bc950 [Tathagata Das] Updated tests
f82ae81 [Tathagata Das] Fixed test
bc72868 [Tathagata Das] Added SQLContext.getOrCreate

(cherry picked from commit 3d0cccc)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
@asfgit asfgit closed this in 3d0cccc May 21, 2015
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 28, 2015
Having a SQLContext singleton would make it easier for applications to use a lazily instantiated single shared instance of SQLContext when needed. It would avoid problems like

1. In REPL/notebook environment, rerunning the line {{val sqlContext = new SQLContext}} multiple times created different contexts while overriding the reference to previous context, leading to issues like registered temp tables going missing.

2. In Streaming, creating SQLContext directly leads to serialization/deserialization issues when attempting to recover from DStream checkpoints. See [SPARK-6770]. Also to get around this problem I had to suggest creating a singleton instance - https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/SqlNetworkWordCount.scala

This can be solved by {{SQLContext.getOrCreate}} which get or creates a new singleton instance of SQLContext using either a given SparkContext or a given SparkConf.

rxin marmbrus

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes apache#6006 from tdas/SPARK-7478 and squashes the following commits:

25f4da9 [Tathagata Das] Addressed comments.
79fe069 [Tathagata Das] Added comments.
c66ca76 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-7478
48adb14 [Tathagata Das] Removed HiveContext.getOrCreate
bf8cf50 [Tathagata Das] Fix more bug
dec5594 [Tathagata Das] Fixed bug
b4e9721 [Tathagata Das] Remove unnecessary import
4ef513b [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-7478
d3ea8e4 [Tathagata Das] Added HiveContext
83bc950 [Tathagata Das] Updated tests
f82ae81 [Tathagata Das] Fixed test
bc72868 [Tathagata Das] Added SQLContext.getOrCreate
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
Having a SQLContext singleton would make it easier for applications to use a lazily instantiated single shared instance of SQLContext when needed. It would avoid problems like

1. In REPL/notebook environment, rerunning the line {{val sqlContext = new SQLContext}} multiple times created different contexts while overriding the reference to previous context, leading to issues like registered temp tables going missing.

2. In Streaming, creating SQLContext directly leads to serialization/deserialization issues when attempting to recover from DStream checkpoints. See [SPARK-6770]. Also to get around this problem I had to suggest creating a singleton instance - https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/SqlNetworkWordCount.scala

This can be solved by {{SQLContext.getOrCreate}} which get or creates a new singleton instance of SQLContext using either a given SparkContext or a given SparkConf.

rxin marmbrus

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes apache#6006 from tdas/SPARK-7478 and squashes the following commits:

25f4da9 [Tathagata Das] Addressed comments.
79fe069 [Tathagata Das] Added comments.
c66ca76 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-7478
48adb14 [Tathagata Das] Removed HiveContext.getOrCreate
bf8cf50 [Tathagata Das] Fix more bug
dec5594 [Tathagata Das] Fixed bug
b4e9721 [Tathagata Das] Remove unnecessary import
4ef513b [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-7478
d3ea8e4 [Tathagata Das] Added HiveContext
83bc950 [Tathagata Das] Updated tests
f82ae81 [Tathagata Das] Fixed test
bc72868 [Tathagata Das] Added SQLContext.getOrCreate
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
Having a SQLContext singleton would make it easier for applications to use a lazily instantiated single shared instance of SQLContext when needed. It would avoid problems like

1. In REPL/notebook environment, rerunning the line {{val sqlContext = new SQLContext}} multiple times created different contexts while overriding the reference to previous context, leading to issues like registered temp tables going missing.

2. In Streaming, creating SQLContext directly leads to serialization/deserialization issues when attempting to recover from DStream checkpoints. See [SPARK-6770]. Also to get around this problem I had to suggest creating a singleton instance - https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/SqlNetworkWordCount.scala

This can be solved by {{SQLContext.getOrCreate}} which get or creates a new singleton instance of SQLContext using either a given SparkContext or a given SparkConf.

rxin marmbrus

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes apache#6006 from tdas/SPARK-7478 and squashes the following commits:

25f4da9 [Tathagata Das] Addressed comments.
79fe069 [Tathagata Das] Added comments.
c66ca76 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-7478
48adb14 [Tathagata Das] Removed HiveContext.getOrCreate
bf8cf50 [Tathagata Das] Fix more bug
dec5594 [Tathagata Das] Fixed bug
b4e9721 [Tathagata Das] Remove unnecessary import
4ef513b [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-7478
d3ea8e4 [Tathagata Das] Added HiveContext
83bc950 [Tathagata Das] Updated tests
f82ae81 [Tathagata Das] Fixed test
bc72868 [Tathagata Das] Added SQLContext.getOrCreate
@jelez
Copy link

jelez commented Mar 3, 2016

Can you guys point me what is the proper way to use getOrCreate with HiveContext?

@mwws
Copy link

mwws commented Mar 4, 2016

@jelez you can create an HiveContextSingleton to workaround it. Refer to example "SqlNetWorkWordCount"

@tdas Why you removed HiveContext.getOrCreate? I can't find obvious reasons from the conversation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants