Unable to create pyspark's SparkSession object via embedding python code in scala application using Jep #324

bhupesh-simpledatalabs · 2021-04-28T09:58:49Z

I am trying to run sample python code inside Scala code using Jep. In my python code i am simply creating SparkSession object via "SparkSession.builder.appName('name').master('local[1]').getOrCreate()" and executing this python code via Jep using SubInterpreter. I have also added pyspark as shared module in JepConfig to be used while creating SubInterpreter instance. My entire scala python code looks like below

val jepConfig = new JepConfig
jepConfig.addSharedModules("pyspark")
val interpreter = new SubInterpreter(jepConfig)
interpreter.exec(
      """import site
        |import pyspark
        |from pyspark.sql import *
        |from pyspark.sql.functions import *
        |import threading
        |spark = SparkSession.builder.appName('name').master('local[1]').getOrCreate()
        |sc = spark.sparkContext()
        |contextString = sc.getConf().toDebugString()
        |""".stripMargin)
print(interpreter2.getValue("contextString"))

I am passing necessary environment variables as mentioned below as well as jep.jar is in classpath

-Djava.library.path=/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/jep/
PYSPARK_PYTHON=/usr/local/bin/python3
SPARK_HOME=/usr/local/Cellar/apache-spark/3.0.1
PYTHONPATH=/usr/local/Cellar/apache-spark/3.0.1/libexec/python:/usr/local/Cellar/apache-spark/3.0.1/libexec/python/lib/py4j-0.10.9-src.zip
PATH=/usr/local/Cellar/apache-spark/3.0.1/bin:/Library/Frameworks/Python.framework/Versions/3.9/bin:/usr/local/opt/scala@2.12/bin:/Users/bhupeshgoel/Documents/apache-maven-3.6.3/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin

But still when i run the above scala code i am getting segmentation fault error

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x000000012d633bc9, pid=92789, tid=0x0000000000001603
#
# JRE version: OpenJDK Runtime Environment (8.0_275-b01) (build 1.8.0_275-b01)
# Java VM: OpenJDK 64-Bit Server VM (25.275-b01 mixed mode bsd-amd64 compressed oops)
# Problematic frame:
# C  [Python+0x7bbc9]  PyModule_GetState+0x9
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /Users/bhupeshgoel/Documents/codebase/prophecy/hs_err_pid92789.log
#
# If you would like to submit a bug report, please visit:
#   https://github.com/AdoptOpenJDK/openjdk-support/issues
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

Detailed error report file is also attached with this ticket.

I wanted to know if pyspark is supported with Jep specially when running pyspark code inside scala/java code? I was able to execute and create instance of SparkSession in Jep interactive session

Bhupeshs-MacBook-Pro:~ bhupeshgoel$ jep
>>> import pyspark
>>> from pyspark.sql import *
>>> from pyspark.sql.functions import *
>>> spark = SparkSession.builder.appName('name').master("local[1]").getOrCreate()
21/04/28 13:03:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
>>> lit(1)
Column<b'1'>

Other Environment Details are

OS Platform, Distribution, and Version: MacOS Catalina v10.15.7
Python Distribution and Version: python3.9
Java Distribution and Version: OpenJDK 1.8
Jep Version: 3.9.1
Python packages used (e.g. numpy, pandas, tensorflow): pyspark v3.1.1

hs_err_pid92789.log

The text was updated successfully, but these errors were encountered:

bsteffensmeier · 2021-04-28T14:23:43Z

It is unusual that it works in the jep interactive session but not in your application. The most major difference is that the interactive session is using a SharedInterpreter, Have you tried using a SharedInterpreter instead of a SubInterpreter?

bhupesh-simpledatalabs · 2021-04-28T14:34:04Z

when i simply switch to SharedInterpreter then i get below error. I haven't changed any other environment variable and same environment setup was used with SubInterpreter.

<class 'ModuleNotFoundError'>: No module named 'py4j.protocol'
jep.JepException: <class 'ModuleNotFoundError'>: No module named 'py4j.protocol'
	at /usr/local/Cellar/apache-spark/3.0.1/libexec/python/pyspark/context.<module>(context.py:27)
	at /usr/local/Cellar/apache-spark/3.0.1/libexec/python/pyspark/__init__.<module>(__init__.py:51)
	at <string>.<module>(<string>:5)
	at jep.Jep.exec(Native Method)
	at jep.Jep.exec(Jep.java:478)

My code looks like below with SharedInterpreter

val interpreter = new SharedInterpreter()
interpreter.exec(
      """import site
        |import pyspark
        |from pyspark.sql import *
        |from pyspark.sql.functions import *
        |import threading
        |spark = SparkSession.builder.appName('name').master('local[1]').getOrCreate()
        |sc = spark.sparkContext()
        |contextString = sc.getConf().toDebugString()
        |""".stripMargin)
print(interpreter.getValue("contextString"))

bsteffensmeier · 2022-11-01T21:15:47Z

Closing due to inactivity.

ndjensen · 2024-09-05T03:35:21Z

Related to #548.

ndjensen added extension scala labels May 6, 2021

bsteffensmeier closed this as completed Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to create pyspark's SparkSession object via embedding python code in scala application using Jep #324

Unable to create pyspark's SparkSession object via embedding python code in scala application using Jep #324

bhupesh-simpledatalabs commented Apr 28, 2021

bsteffensmeier commented Apr 28, 2021

bhupesh-simpledatalabs commented Apr 28, 2021

bsteffensmeier commented Nov 1, 2022

ndjensen commented Sep 5, 2024

Unable to create pyspark's SparkSession object via embedding python code in scala application using Jep #324

Unable to create pyspark's SparkSession object via embedding python code in scala application using Jep #324

Comments

bhupesh-simpledatalabs commented Apr 28, 2021

bsteffensmeier commented Apr 28, 2021

bhupesh-simpledatalabs commented Apr 28, 2021

bsteffensmeier commented Nov 1, 2022

ndjensen commented Sep 5, 2024