-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-16110][YARN][PYSPARK] Fix allowing python version to be specified per submit for cluster mode. #13824
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
propogated to YARN application master in cluster mode before applying spark.yarn.appMasterEnv.* conf settings so that they can be set per submission. This allows submit of pyspark jobs via livy to specify python 2 or 3.
|
Is it possible to move "spark.yarn.appMasterEnv.*" related codes to the bottom of the function? Looks like other env variables may also have the same problem like you mentioned for |
|
That is possible. It was done as is to be a narrower fix and less likely to cause other behavior to change. It does not look like env hashmap is read by the rest of the function (only written or values appended to) so the appMasterEnv could be moved down to give it precedence. I'll update the pull request. |
avoiding all issues with overriding env var's per submit.
|
Also would you please add a unit test about it. |
|
Unit tests involving env vars can get ugly: |
|
@KevinGrealish you can read env vars through |
|
ok to test |
|
Test build #61407 has finished for PR 13824 at commit
|
|
@KevinGrealish were you planning to add a unit test? |
|
ping @KevinGrealish? |
|
Was on vacation. Looking again today. |
|
@vanzin the function setupLaunchEnv reads env vars directly using sys.env.get so I don't understand how SparkConf helps you setup values for this function to read. See my earlier link to setting environment variables from Java/Scala code. Without the ability to set the values for test cases, unit testing that environment vars are overridden by conf is difficult to unit test. |
|
You could change that code to use |
|
Test build #62855 has finished for PR 13824 at commit
|
|
@vanzin I looked at refactoring the client to be more amenable to more granular unit testing, including using the mockable env var on conf you mentioned, but concluded it would be too disruptive to be part of this simple bug fix. I have included a test (that found an issue) and believe this should be good to go now. I'm not sure why the spark.yarn.appMasterEnv was do java path style appending, but that appears to be simply incorrect and was changed to a plain set. |
|
Test build #62899 has finished for PR 13824 at commit
|
|
I'm ok with the behavior change. Maybe other people have more background on that behavior though; @tgravescs? |
|
I'm fine with the change to allow this overriding the existing PYSPARK_PYTHON and others, but on the change from addPathToEnvironment I'm against. The problem is that some env variables don't take this but others do. I think it was this way just incase the variable required a classpath type combining, for instance LD_LIBRARY_PATH, or I think any other env variables that have "PATH" in the name of them. We don't have a current way to differentiate when they should be combined with existing vs just overwritten. If you want to change that I would say we need a way to differentiate. So whether that is looking for "PATH" in the env variable name and appending or having a config or something that says for these variables use append else don't. Probably should be a separate jira for this one though. Note here is jira very similar in MapReduce: https://issues.apache.org/jira/browse/MAPREDUCE-6491 |
|
Even when working with env vars values that will be interpreted as lists of paths, the user intent could still be append or override. There could be a spark.yarn.appMasterAppendEnv.* config setting for appending (with the existing assumption of the list delimiter). Tom, how are you proposing we fix the PYSPARK_PYTHON need without the change? That environment variable does not work like a java path. Different behavior based on var name contains "PATH" seems like a temp hack at best. |
|
@tgravescs do you have a preference for this bug fix to use a hard coded list of PATH, LD_LIBRARY_PATH, CLASSPATH, APP_CLASSPATH (an append list), or a list of PYSPARK_PYTHON, PYSPARK_DRIVER_PYTHON (a override list). The later would narrow this fix, and reduce risk of it breaking anyone, so I'll propose that. A distinct issue/JIRA should be added for making the append/override work more generally long term, likely aligning with the Hadoop one. That should be done but I don't think it should block fixing this bug. |
|
Created https://issues.apache.org/jira/browse/SPARK-16744 for the override/append issue. linked to 16110. This fix remains just about being able to run Python 3. |
|
Test build #62905 has finished for PR 13824 at commit
|
|
I think there are two ways to solve the problem that might be a little better... The first is to try to keep the current behavior. If, in L734 (where you're removing the current code that checks the The second is to read certain env variables using a special method that first looks at I think the latter is a better solution than you currently have, since it avoids hardcoding these env variable names in more places. |
|
How about just this: |
|
That is fine too. I'd just do something like: To avoid the repetition. BTW I just noticed you have a typo in the PR title. |
|
sorry missed that this wouldn't work without not appending, yeah I like @vanzin latter suggestion if this change is needed soon (just fix it for these 2 variables). then figure out long term way to support with SPARK-16744. |
|
Test build #62933 has finished for PR 13824 at commit
|
|
@vanzin, I think we are good to good now. Agree? |
|
|
||
| private def testPySpark(clientMode: Boolean): Unit = { | ||
| private def testPySpark(clientMode: Boolean, | ||
| extraConf: Map[String, String] = Map(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, one last nit. style for multi-line args is:
def method(
arg1: Foo,
arg2: Bar): Unit = {
}
|
Test build #62935 has finished for PR 13824 at commit
|
|
Cool, LGTM. Merging to master. |
What changes were proposed in this pull request?
This fix allows submit of pyspark jobs to specify python 2 or 3.
Change ordering in setup for application master environment so env vars PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON can be overridden by spark.yarn.appMasterEnv.* conf settings. This applies to YARN in cluster mode. This allows them to be set per submission without needing the unset the env vars (which is not always possible - e.g. batch submit with LIVY only exposes the arguments to spark-submit)
How was this patch tested?
Manual and existing unit tests.