Skip to content
This repository has been archived by the owner on Jan 9, 2020. It is now read-only.

Modified Python Dockerfiles to allow for the submission of Python app… #464

Open
wants to merge 4 commits into
base: branch-2.2-kubernetes
Choose a base branch
from

Conversation

sahilprasad
Copy link

@sahilprasad sahilprasad commented Aug 25, 2017

…lications without the --jars parameter. Fixes #409.

Not sure why the singular path provided to -cp does not work, but this fixes it. Also may provide extensibility in that a SPARK_CLASSPATH could be provided in a custom-built spark-base image without requiring modification of the individual driver and executor images.

cc @erikerlandson @ifilonenko @mccheah

@mccheah
Copy link

mccheah commented Aug 25, 2017

Not sure why the singular path provided to -cp does not work, but this fixes it. Also may provide extensibility in that a SPARK_CLASSPATH could be provided in a custom-built spark-base image without requiring modification of the individual driver and executor images.

This should be handled by SPARK_EXTRA_CLASSPATH - SPARK_CLASSPATH isn't expected to be overridden by other Docker images. Although it looks like SPARK_EXTRA_CLASSPATH isn't supported in executors - this was supposed to have been done in https://github.com/apache-spark-on-k8s/spark/pull/383/files but I think we lost it in a bad merge somewhere.

@mccheah
Copy link

mccheah commented Aug 25, 2017

I'd also like to trace down the specific problem the existing shell command is hitting.

@sahilprasad
Copy link
Author

@mccheah If it helps, I was able to replicate the error by using kubectl run on a driver pod and executing a stripped-down version of the existing shell command.

${JAVA_HOME}/bin/java -cp /opt/spark/jars/* com.google.gson.Gson gives me the error that we are currently seeing. SPARK_CLASSPATH resolves to /opt/spark/jars/* and the gson class is just a random class I picked out of the selected jars.

After playing around with this command, I got it to work by just adding a colon in front of the classpath — :/opt/spark/jars/*.

@mccheah
Copy link

mccheah commented Aug 25, 2017

I suppose wildcards aren't being expanded out properly. I wonder if we can instead use ls to list the directory and then join the paths together with :, and pass that as SPARK_CLASSPATH. Alternatively, if we know exactly why prepending : works and have it documented then that will suffice.

@sahilprasad
Copy link
Author

Yeah, that's the problem. Just tried again with -cp /opt/spark/jars/"*" (just putting quotes around the asterisk) and it resolves properly. Should I forgo the colon for this solution?

@mccheah
Copy link

mccheah commented Aug 25, 2017

I'd like to know the rules of thumb for how Java is treating these inputs and how Bash is expanding them, etc. Though we had a tough time understanding issues like this in e.g. #444

@sahilprasad
Copy link
Author

I'm not too sure myself. It might be a difference in Bash vs. JRE handling of wildcards. Quoting the wildcard most likely delegates to the JRE for expansion, and adding a colon necessitates parsing of the provided classpath by the JRE and the subsequent expansion of parsed directories.

Maybe @ash211 or @erikerlandson could provide some insight on why this is happening and/or what the best way to patch this would be (colons vs. asterisk)?

@erikerlandson
Copy link
Member

FYI, with #462 we should be able to see exactly what command is ultimately being executed by bash from the pod console outputs

@sahilprasad
Copy link
Author

@erikerlandson I may be wrong, but I don't think set -x resolves environment variable values.

@ifilonenko
Copy link
Member

@erikerlandson @mccheah @sahilprasad any progress on this PR?

@sahilprasad
Copy link
Author

@ifilonenko @mccheah @erikerlandson I'm partial to just adding quotes. Cleaner and solves the submission errors.

@@ -38,7 +38,7 @@ ENV PYSPARK_PYTHON python
ENV PYSPARK_DRIVER_PYTHON python
ENV PYTHONPATH ${SPARK_HOME}/python/:${SPARK_HOME}/python/lib/py4j-0.10.4-src.zip:${PYTHONPATH}

CMD SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && \
CMD SPARK_CLASSPATH="$SPARK_CLASSPATH:${SPARK_HOME}/jars/*" && \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe put {} around SPARK_CLASSPATH?

@sahilprasad
Copy link
Author

@ssuchter @erikerlandson @ifilonenko thoughts on my latest commit?

Copy link
Member

@ssuchter ssuchter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a fine way to do it to me, assuming the testing works.

@mccheah
Copy link

mccheah commented Sep 28, 2017

Integration tests are broken - "Could not find or load main class ...PythonRunner..." I wonder what's going on there?

@sahilprasad
Copy link
Author

@mccheah @ssuchter just pushed a version that should fix the failing integration tests. I opted for pre-pending a :, since it seems assigning a wildcard to a variable in bash is a whole different mess than simply evaluating.

@tmckayus
Copy link

tmckayus commented Nov 3, 2017

Hey all, I ran into this issue independently yesterday and tracked it down before I saw #409

I think there is a better solution than prepending a ":" to the $SPARK_CLASSPATH value, and that is to quote $SPARK_CLASSPATH when it is used as the value for the -cp flag in the various Dockerfiles under resource-managers/kubernetes/docker-minimal-bundle/src/main/docker (not just driver-py and executor-py, this is potentially a general problem).

This is the standard way of preventing the shell from expanding a variable before passing it to a command. So in the driver-py case as an example, the last line of the Dockerfile just becomes:

${JAVA_HOME}/bin/java "${SPARK_DRIVER_JAVA_OPTS[@]}" -cp "$SPARK_CLASSPATH" -Xms$SPARK_DRIVER_MEMORY -Xmx$SPARK_DRIVER_MEMORY -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS $PYSPARK_PRIMARY $PYSPARK_FILES $SPARK_DRIVER_ARGS

That's it, no other change to SPARK_CLASSPATH needed.

What's happening with the unmodified Dockerfile (to recap) is that the wildcard list is being expanded as a space seperated list after the -cp flag, and the second value is taken by java as the target to execute. The rest of the line is thrown away. (You can see this by rebuilding the image and inserting an "echo" of the java invocation before the actual command, like so):

...
if ! [ -z ${SPARK_MOUNTED_FILES_FROM_SECRET_DIR+x} ]; then cp -R "$SPARK_MOUNTED_FILES_FROM_SECRET_DIR/." .; fi && \
echo ${JAVA_HOME}/bin/java "${SPARK_DRIVER_JAVA_OPTS[@]}" -cp $SPARK_CLASSPATH-Xms$SPARK_DRIVER_MEMORY -Xmx$SPARK_DRIVER_MEMORY -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS $PYSPARK_PRIMARY $PYSPARK_FILES $SPARK_DRIVER_ARGS && \
...

You can also verify this behavior easily by downloading the spark distro and playing from a shell:

[machine spark-2.2.0-k8s-0.5.0-bin-2.7.3]$ SPARK_CONTEXT=jars/*
[machine spark-2.2.0-k8s-0.5.0-bin-2.7.3]$ echo $SPARK_CONTEXT
jars/activation-1.1.1.jar jars/antlr-2.7.7.jar jars/antlr4-runtime-4.5.3.jar jars/antlr-runtime-3.4.jar jars/aopalliance-1.0.jar jars/aopalliance-repackaged-2.4.0-b34.jar jars/apacheds-i18n-2.0.0-M15.jar jars/apacheds-kerberos-codec-2.0.0-M15.jar ...
[machine spark-2.2.0-k8s-0.5.0-bin-2.7.3]$ echo "$SPARK_CONTEXT"
jars/*

The trouble I see with prepending a ":" is that yes, it happens to prevent the shell expansion because the shell doesn't recognize it as a valid path any longer, but it is also undefined behavior for java. It literally seems to imply an empty path on the classpath list. This might confuse noobs and I think clouds the actual issue (and could perpetuate confusion in other code :) ) Why not defeat the wildcard expansion with the standard shell mechanism?

I have a PR with changes to the set of docker files that use -cp SPARK_CONTEXT in this way, but I thought I would comment on the original issue here first. I can push that PR if you like.

Best, Trevor

@erikerlandson
Copy link
Member

I've been conferring with @tmckayus on this, and the evidence suggests we should just quote the classpath instead of prepending a colon. Either way we ought to get this nit resolved and close out the issue. I'm in favor of pushing the alternate PR for comparison.

@tmckayus
Copy link

tmckayus commented Nov 3, 2017

Okay, for comparison the other PR is #541

@adelbertc
Copy link

This can be closed right? I hit #409 as well and just tried the fix in #541 which fixes it for me.

ifilonenko pushed a commit to ifilonenko/spark that referenced this pull request Feb 26, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PySpark Submission fails without --jars
7 participants