Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

updated ClassEnquirer.java to use py4j python package and this will solve py4j error when running pyspark script #548

Merged
merged 1 commit into from
Sep 16, 2024

Conversation

NavodPeiris
Copy link
Contributor

this will solve following error when running pyspark script from Jep:

jep.JepException: <class 'ImportError'>: py4j.protocol

@bsteffensmeier
Copy link
Member

It's my understanding that py4j and jep are incompatible since they both expect to initialize python and the JVM in different ways so I don't understand the use case where this helps. Does this change enable you to use py4j and jep in the same process?

@NavodPeiris
Copy link
Contributor Author

yes. I was able to successfully create a pyspark session using Jep and run it in the same process. without this change you can still run it using custom ClassEnquirer:

package com.tests.unit

import jep._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.json.JSONObject

object CustomClassEnquirer extends NamingConventionClassEnquirer {
  // Define the restricted package names
  val RESTRICTED_PKG_NAMES: Array[String] = Array("io", "re", "py4j")
}

object TestJep {

  val spark: SparkSession = SparkSession.builder()
    .appName("TestJep")
    .master("local[*]")
    .getOrCreate()
  
  val jepConfig = new JepConfig()
      .setClassLoader(getClass.getClassLoader)
      .addIncludePaths("D:\\Users\\npeiris2\\project\\python_scripts")
      .setClassEnquirer(CustomClassEnquirer)

    SharedInterpreter.setConfig(jepConfig)
    var jep: SharedInterpreter = null

    try {
      jep = new SharedInterpreter()
      val df: DataFrame = spark.read.option("header", "true").csv("D:\\Users\\npeiris2\\project\\csvs\\data.csv")

      // Convert DataFrame to JSON string
      val jsonString = df.toJSON.collect().mkString("[", ",", "]")
      jep.set("jsonString", jsonString)

      jep.eval("import json")
      jep.eval("import pandas as pd")
      jep.eval("from pyspark.sql import SparkSession")
      jep.eval("data = json.loads(jsonString)")
      jep.eval("pandas_df = pd.DataFrame(data)")
      jep.eval("spark = SparkSession.builder.appName('TestJep').getOrCreate()")
      jep.eval("spark_df = spark.createDataFrame(pandas_df)")
      jep.eval("import process_dataframe")
      val result = jep.getValue("process_dataframe(spark_df)")
      println(s"result: $result")
      jep.eval("spark.stop()")
    } catch {
      case e: JepException => e.printStackTrace()
    } finally {
      if (jep != null) jep.close()
    }
  
}

but i think this is a bit of labor and adding py4j to RESTRICTED_PKG_NAMES solves this

@bsteffensmeier
Copy link
Member

Thank you for providing an example and helping me understand the background for this change. I do not think you need a custom class enquirer in your example. The NamingConventionClassEnquirer only includes a predefined list of top level packages and since py4j is not in that list it would not be included. You should be able to run your example using the NamingConventionClassEnquirer without any extension. In that case it would only require one line of extra code to set the ClassEnquirer to solve the py4j import error.

I am currently on the fence as to whether I think this change should be merged to Jep. This is not the first time pyspark has come up and it would be really nice to have that use case work with the default settings. But I am always hesitant to add code to Jep for specific use cases that we cannot test and will not officially support. The jep maintainers do not use pyspark or py4j and we would not test this compatibility in any future releases. Changes that support use cases we don't test make it hard to fix or refactor code because we won't know if we break something.

I'll keep thinking over it and also see if @ndjensen has an opinion.

@ndjensen
Copy link
Member

I am not opposed to this changeset since it helps some of our users, but I don't want to get in the habit of hardcoding modules to be excluded. This brings back to attention that we really need a ClassEnquirer that allows you to pass it a set of modules to exclude as Java packages, even if there are Java packages with the same package name on the classpath. Or put another way, a ClassEnquirer where you can explicitly declare certain packages as python packages, so they are not attempted to be imported from Java, regardless of what's on the classpath. I would like to write this new ClassEnquirer for Jep 4.2.1. I think we should keep this pull request open a bit longer and see if I can get to that new ClassEnquirer, cause it would make this change somewhat obsolete. But if you want to merge this change anyway, I am not opposed.

@bsteffensmeier bsteffensmeier changed the base branch from master to dev_4.2 August 26, 2024 02:05
Copy link
Member

@bsteffensmeier bsteffensmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is worth merging this change so that spark users can have a better experience using jep.

@ndjensen
Copy link
Member

I added #553 to make it easier to force the imports to the Python packages and not the Java packages when using ClassList. Please let me know what you think.

@NavodPeiris
Copy link
Contributor Author

can we merge this? @ndjensen @bsteffensmeier

@bsteffensmeier bsteffensmeier merged commit 79b6fd2 into ninia:dev_4.2 Sep 16, 2024
1 check passed
@bsteffensmeier
Copy link
Member

Merged, thanks. This will be included in the 4.2.1 release which will come out in the next month.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants