[SPARK-16110][YARN][PYSPARK] Fix allowing python version to be specified per submit for cluster mode. #13824

KevinGrealish · 2016-06-21T21:55:26Z

What changes were proposed in this pull request?

This fix allows submit of pyspark jobs to specify python 2 or 3.

Change ordering in setup for application master environment so env vars PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON can be overridden by spark.yarn.appMasterEnv.* conf settings. This applies to YARN in cluster mode. This allows them to be set per submission without needing the unset the env vars (which is not always possible - e.g. batch submit with LIVY only exposes the arguments to spark-submit)

How was this patch tested?

Manual and existing unit tests.

propogated to YARN application master in cluster mode before applying spark.yarn.appMasterEnv.* conf settings so that they can be set per submission. This allows submit of pyspark jobs via livy to specify python 2 or 3.

jerryshao · 2016-06-22T02:45:46Z

Is it possible to move "spark.yarn.appMasterEnv.*" related codes to the bottom of the function? Looks like other env variables may also have the same problem like you mentioned for PYSPARK_DRIVER_PATH.

KevinGrealish · 2016-06-22T03:00:12Z

That is possible. It was done as is to be a narrower fix and less likely to cause other behavior to change. It does not look like env hashmap is read by the rest of the function (only written or values appended to) so the appMasterEnv could be moved down to give it precedence. I'll update the pull request.

avoiding all issues with overriding env var's per submit.

jerryshao · 2016-06-22T05:41:16Z

Also would you please add a unit test about it.

KevinGrealish · 2016-06-22T22:34:08Z

Unit tests involving env vars can get ugly:
http://stackoverflow.com/questions/318239/how-do-i-set-environment-variables-from-java. I'm assuming such hacks would not be acceptable in this codebase.

vanzin · 2016-06-28T20:25:36Z

@KevinGrealish you can read env vars through SparkConf.getenv and that allows you to override them in tests somewhat easily (look for SparkConfWithEnv).

vanzin · 2016-06-28T20:25:41Z

ok to test

SparkQA · 2016-06-28T20:58:52Z

Test build #61407 has finished for PR 13824 at commit 4ad7904.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2016-07-05T21:27:42Z

@KevinGrealish were you planning to add a unit test?

vanzin · 2016-07-12T18:12:46Z

ping @KevinGrealish?

KevinGrealish · 2016-07-25T18:03:54Z

Was on vacation. Looking again today.

KevinGrealish · 2016-07-25T18:51:16Z

@vanzin the function setupLaunchEnv reads env vars directly using sys.env.get so I don't understand how SparkConf helps you setup values for this function to read. See my earlier link to setting environment variables from Java/Scala code. Without the ability to set the values for test cases, unit testing that environment vars are overridden by conf is difficult to unit test.

vanzin · 2016-07-25T18:52:21Z

You could change that code to use sparkConf.getenv, since there's a SparkConf instance available.

SparkQA · 2016-07-26T01:56:39Z

Test build #62855 has finished for PR 13824 at commit cb41e9e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

KevinGrealish · 2016-07-26T19:34:40Z

@vanzin I looked at refactoring the client to be more amenable to more granular unit testing, including using the mockable env var on conf you mentioned, but concluded it would be too disruptive to be part of this simple bug fix. I have included a test (that found an issue) and believe this should be good to go now. I'm not sure why the spark.yarn.appMasterEnv was do java path style appending, but that appears to be simply incorrect and was changed to a plain set.

SparkQA · 2016-07-26T19:46:33Z

Test build #62899 has finished for PR 13824 at commit 6791edc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2016-07-26T20:14:41Z

I'm ok with the behavior change. Maybe other people have more background on that behavior though; @tgravescs?

tgravescs · 2016-07-26T20:46:59Z

I'm fine with the change to allow this overriding the existing PYSPARK_PYTHON and others, but on the change from addPathToEnvironment I'm against. The problem is that some env variables don't take this but others do. I think it was this way just incase the variable required a classpath type combining, for instance LD_LIBRARY_PATH, or I think any other env variables that have "PATH" in the name of them. We don't have a current way to differentiate when they should be combined with existing vs just overwritten.

If you want to change that I would say we need a way to differentiate. So whether that is looking for "PATH" in the env variable name and appending or having a config or something that says for these variables use append else don't. Probably should be a separate jira for this one though.

Note here is jira very similar in MapReduce: https://issues.apache.org/jira/browse/MAPREDUCE-6491

KevinGrealish · 2016-07-26T22:36:22Z

Even when working with env vars values that will be interpreted as lists of paths, the user intent could still be append or override. There could be a spark.yarn.appMasterAppendEnv.* config setting for appending (with the existing assumption of the list delimiter). Tom, how are you proposing we fix the PYSPARK_PYTHON need without the change? That environment variable does not work like a java path. Different behavior based on var name contains "PATH" seems like a temp hack at best.

KevinGrealish · 2016-07-26T23:23:52Z

@tgravescs do you have a preference for this bug fix to use a hard coded list of PATH, LD_LIBRARY_PATH, CLASSPATH, APP_CLASSPATH (an append list), or a list of PYSPARK_PYTHON, PYSPARK_DRIVER_PYTHON (a override list). The later would narrow this fix, and reduce risk of it breaking anyone, so I'll propose that.

A distinct issue/JIRA should be added for making the append/override work more generally long term, likely aligning with the Hadoop one. That should be done but I don't think it should block fixing this bug.

KevinGrealish · 2016-07-27T00:08:20Z

Created https://issues.apache.org/jira/browse/SPARK-16744 for the override/append issue. linked to 16110. This fix remains just about being able to run Python 3.

SparkQA · 2016-07-27T00:41:22Z

Test build #62905 has finished for PR 13824 at commit f2c2e4a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2016-07-27T01:42:15Z

I think there are two ways to solve the problem that might be a little better...

The first is to try to keep the current behavior. If, in L734 (where you're removing the current code that checks the appMasterEnv conf), you make a copy of the keys that were added to the env variable, you can then apply addPathToEnvironment just to the keys that are there. That means that the code will merge any user configuration with env variables created by Spark itself; otherwise it will use the user's override.

The second is to read certain env variables using a special method that first looks at spark.yarn.appMasterEnv.FOO and if it doesn't exist, sys.env("FOO"). Then you could modify the code that currently reads PYSPARK_DRIVER_PYTHON and friends using that new method, instead of directly peeking at sys.env. You could also apply that new method to the code that currently reads PYTHONPATH.

I think the latter is a better solution than you currently have, since it avoids hardcoding these env variable names in more places.

KevinGrealish · 2016-07-27T02:08:43Z

How about just this:
// propagate PYSPARK_DRIVER_PYTHON and PYSPARK_PYTHON to driver in cluster mode
if (!env.contains("PYSPARK_DRIVER_PYTHON")) {
sys.env.get("PYSPARK_DRIVER_PYTHON").foreach(env("PYSPARK_DRIVER_PYTHON") = _)
}
if (!env.contains("PYSPARK_PYTHON")) {
sys.env.get("PYSPARK_PYTHON").foreach(env("PYSPARK_PYTHON") = _)
}

vanzin · 2016-07-27T02:23:05Z

That is fine too. I'd just do something like:

Seq("PYSPARK_DRIVER_PYTHON", "PYSPARK_PYTHON").foreach { envname =>
  // code to set the value
}

To avoid the repetition. BTW I just noticed you have a typo in the PR title.

tgravescs · 2016-07-27T12:50:44Z

sorry missed that this wouldn't work without not appending, yeah I like @vanzin latter suggestion if this change is needed soon (just fix it for these 2 variables). then figure out long term way to support with SPARK-16744.

SparkQA · 2016-07-27T19:31:45Z

Test build #62933 has finished for PR 13824 at commit ca4d5fa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

KevinGrealish · 2016-07-27T19:36:47Z

@vanzin, I think we are good to good now. Agree?

vanzin · 2016-07-27T19:47:11Z

yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnClusterSuite.scala


-  private def testPySpark(clientMode: Boolean): Unit = {
+  private def testPySpark(clientMode: Boolean,
+                          extraConf: Map[String, String] = Map(),


sorry, one last nit. style for multi-line args is:

def method( arg1: Foo, arg2: Bar): Unit = { }

SparkQA · 2016-07-27T20:38:00Z

Test build #62935 has finished for PR 13824 at commit 7ad3d77.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2016-07-27T20:50:07Z

Cool, LGTM. Merging to master.

KevinGrealish closed this Jun 22, 2016

KevinGrealish reopened this Jun 22, 2016

Alter apprach to instead set env vars from config last, thus

4ad7904

avoiding all issues with overriding env var's per submit.

Add a test that spark.yarn.appMasterEnv overrides env vars.

cb41e9e

Fix spark.yarn.appMasterEnv to set rather than java path style append.

6791edc

Still use CLASSPATH style append for other env vars. Narrows fix.

f2c2e4a

Narrow fix further to avoid issues with append.

ca4d5fa

KevinGrealish changed the title ~~[SPARK-16110][YARN][PYSPARK] Fix allowing python version to be speicifed per submit for cluster mode.~~ [SPARK-16110][YARN][PYSPARK] Fix allowing python version to be specified per submit for cluster mode. Jul 27, 2016

vanzin reviewed Jul 27, 2016
View reviewed changes

Minor code formatting change.

7ad3d77

asfgit closed this in b14d7b5 Jul 27, 2016

[SPARK-16110][YARN][PYSPARK] Fix allowing python version to be specified per submit for cluster mode. #13824

[SPARK-16110][YARN][PYSPARK] Fix allowing python version to be specified per submit for cluster mode. #13824

Uh oh!

Conversation

KevinGrealish commented Jun 21, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

jerryshao commented Jun 22, 2016

Uh oh!

KevinGrealish commented Jun 22, 2016

Uh oh!

jerryshao commented Jun 22, 2016

Uh oh!

KevinGrealish commented Jun 22, 2016

Uh oh!

vanzin commented Jun 28, 2016

Uh oh!

vanzin commented Jun 28, 2016

Uh oh!

SparkQA commented Jun 28, 2016

Uh oh!

vanzin commented Jul 5, 2016

Uh oh!

vanzin commented Jul 12, 2016

Uh oh!

KevinGrealish commented Jul 25, 2016

Uh oh!

KevinGrealish commented Jul 25, 2016

Uh oh!

vanzin commented Jul 25, 2016

Uh oh!

SparkQA commented Jul 26, 2016

Uh oh!

KevinGrealish commented Jul 26, 2016

Uh oh!

SparkQA commented Jul 26, 2016

Uh oh!

vanzin commented Jul 26, 2016

Uh oh!

tgravescs commented Jul 26, 2016

Uh oh!

KevinGrealish commented Jul 26, 2016

Uh oh!

KevinGrealish commented Jul 26, 2016

Uh oh!

KevinGrealish commented Jul 27, 2016

Uh oh!

SparkQA commented Jul 27, 2016

Uh oh!

vanzin commented Jul 27, 2016

Uh oh!

KevinGrealish commented Jul 27, 2016

Uh oh!

vanzin commented Jul 27, 2016

Uh oh!

tgravescs commented Jul 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Jul 27, 2016

Uh oh!

KevinGrealish commented Jul 27, 2016

Uh oh!

vanzin Jul 27, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 27, 2016

Uh oh!

vanzin commented Jul 27, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

tgravescs commented Jul 27, 2016 •

edited

Loading