-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to use existing Spark installation #137
Comments
Looking into this more, it seems that the issue with my first attempt is that the |
Yes, we don't support glob patterns for dependencies.
You may find additional info in Readme. |
Great, option 2 works if I add all of the individual JARs to the
|
Dependencies are JVM dependencies, not configs and other files. It seems that your Spark server should load this config itself on starting the session |
Alright, so it seems that I made progress. I'm now able to create a
However, now I'm running into the problem that the worker nodes don't seem to be able to load the code that I'm writing in my notebook. So my question is, how can I distribute the compiled classes from my notebook to the workers? This is the code in my cell: import org.apache.spark.SparkConf
import org.apache.spark.api.java.JavaSparkContext
val conf = SparkConf().setAppName("Kotlin Notebook").setMaster("yarn")
val sc = JavaSparkContext(conf)
sc.parallelize(listOf(1,2,3)).map { it + 1 }.collect() And exception:
|
Ah, I think I just learned what the magic line |
Ok, I think this actually did not end up solving the problem. Whenever I map custom code over an rdd, like here sc.parallelize(listOf(1,2,3)).map { it + 1 }.collect() I get the ClassNotFoundException:
And it also seems to have issues with sending closures to the executors: val x = 3
sc.parallelize(listOf(1,2,3)).map { it + x }.collect() Exception
Is this an expected limitation of the way the cells are being interpreted, or is it an issue on my end? I guess I'm also still a bit unclear about what exactly |
I could fix the basic However, some snippets still don't work: val x = 3
sc.parallelize(listOf(1,2,3)).map {
it + x
}.collect() Exception
This works, however: val go = {
val x = 3
sc.parallelize(listOf(1,2,3)).map {
it + x
}.collect()
}
go() This again doesn't work: val go = {
fun f(i: Int) = i + 3
sc.parallelize(listOf(1,2,3)).map {
f(it)
}.collect()
}
go() Exception
|
I have an existing Spark installation on a Hadoop cluster that I'd like to use from the Kotlin notebook and have trouble depending on the existing jar files on the system. I've tried two things:
Modify Config
I have tried adding the respective folders to the config.json:
The first 4 additional lines in the
classPath
section correspond to the class path that's used when I dospark-shell
on the cluster:$ SPARK_PRINT_LAUNCH_COMMAND=true spark-shell Spark Command: /usr/java/openjdk-1.8.0_252/bin/java -Dhdp.version=3.1.0.53-1 -cp /etc/spark2/conf/alpaca-spark23-1.0.1.127-full.jar:/etc/spark2/conf/:/usr/hdp/3.1.0.53-1/spark2/jars/*:/etc/hadoop/conf/ -Dscala.usejavacp=true -Xmx1g -Dhdp.version=3.1.0.53-1 org.apache.spark.deploy.SparkSubmit --class org.apache.spark.repl.Main --name Spark shell spark-shell
Based on the kernel console output, the class path config seems to be picked up:
But in my notebook I am unable to import anything from
org.apache.spark
.DependsOn
The other thing I tried (based on an answer in #49) is directly importing the jars in the notebook:
@file:DependsOn("/usr/hdp/3.1.0.53-1/spark2/jars/spark-core_2.11-2.3.2.3.1.0.53-1.jar")
This allows me to import eg
org.apache.spark.SparkConf
, but of course all of the transitive dependencies are missing. I'd like to write something like@file:DependsOn("/usr/hdp/3.1.0.53-1/spark2/jars/*")
but that doesn't work either because the glob pattern isn't recognized.
Any help greatly appreciated!
The text was updated successfully, but these errors were encountered: