Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to create pyspark's SparkSession object via embedding python code in scala application using Jep #324

Closed
bhupesh-simpledatalabs opened this issue Apr 28, 2021 · 4 comments

Comments

@bhupesh-simpledatalabs
Copy link

I am trying to run sample python code inside Scala code using Jep. In my python code i am simply creating SparkSession object via "SparkSession.builder.appName('name').master('local[1]').getOrCreate()" and executing this python code via Jep using SubInterpreter. I have also added pyspark as shared module in JepConfig to be used while creating SubInterpreter instance. My entire scala python code looks like below

val jepConfig = new JepConfig
jepConfig.addSharedModules("pyspark")
val interpreter = new SubInterpreter(jepConfig)
interpreter.exec(
      """import site
        |import pyspark
        |from pyspark.sql import *
        |from pyspark.sql.functions import *
        |import threading
        |spark = SparkSession.builder.appName('name').master('local[1]').getOrCreate()
        |sc = spark.sparkContext()
        |contextString = sc.getConf().toDebugString()
        |""".stripMargin)
print(interpreter2.getValue("contextString"))

I am passing necessary environment variables as mentioned below as well as jep.jar is in classpath

-Djava.library.path=/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/jep/
PYSPARK_PYTHON=/usr/local/bin/python3
SPARK_HOME=/usr/local/Cellar/apache-spark/3.0.1
PYTHONPATH=/usr/local/Cellar/apache-spark/3.0.1/libexec/python:/usr/local/Cellar/apache-spark/3.0.1/libexec/python/lib/py4j-0.10.9-src.zip
PATH=/usr/local/Cellar/apache-spark/3.0.1/bin:/Library/Frameworks/Python.framework/Versions/3.9/bin:/usr/local/opt/scala@2.12/bin:/Users/bhupeshgoel/Documents/apache-maven-3.6.3/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin

But still when i run the above scala code i am getting segmentation fault error

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x000000012d633bc9, pid=92789, tid=0x0000000000001603
#
# JRE version: OpenJDK Runtime Environment (8.0_275-b01) (build 1.8.0_275-b01)
# Java VM: OpenJDK 64-Bit Server VM (25.275-b01 mixed mode bsd-amd64 compressed oops)
# Problematic frame:
# C  [Python+0x7bbc9]  PyModule_GetState+0x9
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /Users/bhupeshgoel/Documents/codebase/prophecy/hs_err_pid92789.log
#
# If you would like to submit a bug report, please visit:
#   https://github.com/AdoptOpenJDK/openjdk-support/issues
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

Detailed error report file is also attached with this ticket.

I wanted to know if pyspark is supported with Jep specially when running pyspark code inside scala/java code? I was able to execute and create instance of SparkSession in Jep interactive session

Bhupeshs-MacBook-Pro:~ bhupeshgoel$ jep
>>> import pyspark
>>> from pyspark.sql import *
>>> from pyspark.sql.functions import *
>>> spark = SparkSession.builder.appName('name').master("local[1]").getOrCreate()
21/04/28 13:03:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
>>> lit(1)
Column<b'1'>

Other Environment Details are

  • OS Platform, Distribution, and Version: MacOS Catalina v10.15.7
  • Python Distribution and Version: python3.9
  • Java Distribution and Version: OpenJDK 1.8
  • Jep Version: 3.9.1
  • Python packages used (e.g. numpy, pandas, tensorflow): pyspark v3.1.1

hs_err_pid92789.log

@bsteffensmeier
Copy link
Member

It is unusual that it works in the jep interactive session but not in your application. The most major difference is that the interactive session is using a SharedInterpreter, Have you tried using a SharedInterpreter instead of a SubInterpreter?

@bhupesh-simpledatalabs
Copy link
Author

when i simply switch to SharedInterpreter then i get below error. I haven't changed any other environment variable and same environment setup was used with SubInterpreter.

<class 'ModuleNotFoundError'>: No module named 'py4j.protocol'
jep.JepException: <class 'ModuleNotFoundError'>: No module named 'py4j.protocol'
	at /usr/local/Cellar/apache-spark/3.0.1/libexec/python/pyspark/context.<module>(context.py:27)
	at /usr/local/Cellar/apache-spark/3.0.1/libexec/python/pyspark/__init__.<module>(__init__.py:51)
	at <string>.<module>(<string>:5)
	at jep.Jep.exec(Native Method)
	at jep.Jep.exec(Jep.java:478)

My code looks like below with SharedInterpreter

val interpreter = new SharedInterpreter()
interpreter.exec(
      """import site
        |import pyspark
        |from pyspark.sql import *
        |from pyspark.sql.functions import *
        |import threading
        |spark = SparkSession.builder.appName('name').master('local[1]').getOrCreate()
        |sc = spark.sparkContext()
        |contextString = sc.getConf().toDebugString()
        |""".stripMargin)
print(interpreter.getValue("contextString"))

@bsteffensmeier
Copy link
Member

Closing due to inactivity.

@ndjensen
Copy link
Member

ndjensen commented Sep 5, 2024

Related to #548.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants