Skip to content

Conversation

@KevinGrealish
Copy link
Contributor

What changes were proposed in this pull request?

This fix allows submit of pyspark jobs to specify python 2 or 3.

Change ordering in setup for application master environment so env vars PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON can be overridden by spark.yarn.appMasterEnv.* conf settings. This applies to YARN in cluster mode. This allows them to be set per submission without needing the unset the env vars (which is not always possible - e.g. batch submit with LIVY only exposes the arguments to spark-submit)

How was this patch tested?

Manual and existing unit tests.

propogated to YARN application master in cluster mode before applying
spark.yarn.appMasterEnv.* conf settings so that they can be set
per submission. This allows submit of pyspark jobs via livy to
specify python 2 or 3.
@jerryshao
Copy link
Contributor

Is it possible to move "spark.yarn.appMasterEnv.*" related codes to the bottom of the function? Looks like other env variables may also have the same problem like you mentioned for PYSPARK_DRIVER_PATH.

@KevinGrealish
Copy link
Contributor Author

That is possible. It was done as is to be a narrower fix and less likely to cause other behavior to change. It does not look like env hashmap is read by the rest of the function (only written or values appended to) so the appMasterEnv could be moved down to give it precedence. I'll update the pull request.

avoiding all issues with overriding env var's per submit.
@jerryshao
Copy link
Contributor

Also would you please add a unit test about it.

@KevinGrealish
Copy link
Contributor Author

Unit tests involving env vars can get ugly:
http://stackoverflow.com/questions/318239/how-do-i-set-environment-variables-from-java. I'm assuming such hacks would not be acceptable in this codebase.

@vanzin
Copy link
Contributor

vanzin commented Jun 28, 2016

@KevinGrealish you can read env vars through SparkConf.getenv and that allows you to override them in tests somewhat easily (look for SparkConfWithEnv).

@vanzin
Copy link
Contributor

vanzin commented Jun 28, 2016

ok to test

@SparkQA
Copy link

SparkQA commented Jun 28, 2016

Test build #61407 has finished for PR 13824 at commit 4ad7904.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Jul 5, 2016

@KevinGrealish were you planning to add a unit test?

@vanzin
Copy link
Contributor

vanzin commented Jul 12, 2016

ping @KevinGrealish?

@KevinGrealish
Copy link
Contributor Author

Was on vacation. Looking again today.

@KevinGrealish
Copy link
Contributor Author

@vanzin the function setupLaunchEnv reads env vars directly using sys.env.get so I don't understand how SparkConf helps you setup values for this function to read. See my earlier link to setting environment variables from Java/Scala code. Without the ability to set the values for test cases, unit testing that environment vars are overridden by conf is difficult to unit test.

@vanzin
Copy link
Contributor

vanzin commented Jul 25, 2016

You could change that code to use sparkConf.getenv, since there's a SparkConf instance available.

@SparkQA
Copy link

SparkQA commented Jul 26, 2016

Test build #62855 has finished for PR 13824 at commit cb41e9e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@KevinGrealish
Copy link
Contributor Author

@vanzin I looked at refactoring the client to be more amenable to more granular unit testing, including using the mockable env var on conf you mentioned, but concluded it would be too disruptive to be part of this simple bug fix. I have included a test (that found an issue) and believe this should be good to go now. I'm not sure why the spark.yarn.appMasterEnv was do java path style appending, but that appears to be simply incorrect and was changed to a plain set.

@SparkQA
Copy link

SparkQA commented Jul 26, 2016

Test build #62899 has finished for PR 13824 at commit 6791edc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Jul 26, 2016

I'm ok with the behavior change. Maybe other people have more background on that behavior though; @tgravescs?

@tgravescs
Copy link
Contributor

I'm fine with the change to allow this overriding the existing PYSPARK_PYTHON and others, but on the change from addPathToEnvironment I'm against. The problem is that some env variables don't take this but others do. I think it was this way just incase the variable required a classpath type combining, for instance LD_LIBRARY_PATH, or I think any other env variables that have "PATH" in the name of them. We don't have a current way to differentiate when they should be combined with existing vs just overwritten.

If you want to change that I would say we need a way to differentiate. So whether that is looking for "PATH" in the env variable name and appending or having a config or something that says for these variables use append else don't. Probably should be a separate jira for this one though.

Note here is jira very similar in MapReduce: https://issues.apache.org/jira/browse/MAPREDUCE-6491

@KevinGrealish
Copy link
Contributor Author

Even when working with env vars values that will be interpreted as lists of paths, the user intent could still be append or override. There could be a spark.yarn.appMasterAppendEnv.* config setting for appending (with the existing assumption of the list delimiter). Tom, how are you proposing we fix the PYSPARK_PYTHON need without the change? That environment variable does not work like a java path. Different behavior based on var name contains "PATH" seems like a temp hack at best.

@KevinGrealish
Copy link
Contributor Author

@tgravescs do you have a preference for this bug fix to use a hard coded list of PATH, LD_LIBRARY_PATH, CLASSPATH, APP_CLASSPATH (an append list), or a list of PYSPARK_PYTHON, PYSPARK_DRIVER_PYTHON (a override list). The later would narrow this fix, and reduce risk of it breaking anyone, so I'll propose that.

A distinct issue/JIRA should be added for making the append/override work more generally long term, likely aligning with the Hadoop one. That should be done but I don't think it should block fixing this bug.

@KevinGrealish
Copy link
Contributor Author

Created https://issues.apache.org/jira/browse/SPARK-16744 for the override/append issue. linked to 16110. This fix remains just about being able to run Python 3.

@SparkQA
Copy link

SparkQA commented Jul 27, 2016

Test build #62905 has finished for PR 13824 at commit f2c2e4a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Jul 27, 2016

I think there are two ways to solve the problem that might be a little better...

The first is to try to keep the current behavior. If, in L734 (where you're removing the current code that checks the appMasterEnv conf), you make a copy of the keys that were added to the env variable, you can then apply addPathToEnvironment just to the keys that are there. That means that the code will merge any user configuration with env variables created by Spark itself; otherwise it will use the user's override.

The second is to read certain env variables using a special method that first looks at spark.yarn.appMasterEnv.FOO and if it doesn't exist, sys.env("FOO"). Then you could modify the code that currently reads PYSPARK_DRIVER_PYTHON and friends using that new method, instead of directly peeking at sys.env. You could also apply that new method to the code that currently reads PYTHONPATH.

I think the latter is a better solution than you currently have, since it avoids hardcoding these env variable names in more places.

@KevinGrealish
Copy link
Contributor Author

How about just this:
// propagate PYSPARK_DRIVER_PYTHON and PYSPARK_PYTHON to driver in cluster mode
if (!env.contains("PYSPARK_DRIVER_PYTHON")) {
sys.env.get("PYSPARK_DRIVER_PYTHON").foreach(env("PYSPARK_DRIVER_PYTHON") = _)
}
if (!env.contains("PYSPARK_PYTHON")) {
sys.env.get("PYSPARK_PYTHON").foreach(env("PYSPARK_PYTHON") = _)
}

@vanzin
Copy link
Contributor

vanzin commented Jul 27, 2016

That is fine too. I'd just do something like:

Seq("PYSPARK_DRIVER_PYTHON", "PYSPARK_PYTHON").foreach { envname =>
  // code to set the value
}

To avoid the repetition. BTW I just noticed you have a typo in the PR title.

@tgravescs
Copy link
Contributor

tgravescs commented Jul 27, 2016

sorry missed that this wouldn't work without not appending, yeah I like @vanzin latter suggestion if this change is needed soon (just fix it for these 2 variables). then figure out long term way to support with SPARK-16744.

@SparkQA
Copy link

SparkQA commented Jul 27, 2016

Test build #62933 has finished for PR 13824 at commit ca4d5fa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@KevinGrealish KevinGrealish changed the title [SPARK-16110][YARN][PYSPARK] Fix allowing python version to be speicifed per submit for cluster mode. [SPARK-16110][YARN][PYSPARK] Fix allowing python version to be specified per submit for cluster mode. Jul 27, 2016
@KevinGrealish
Copy link
Contributor Author

@vanzin, I think we are good to good now. Agree?


private def testPySpark(clientMode: Boolean): Unit = {
private def testPySpark(clientMode: Boolean,
extraConf: Map[String, String] = Map(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, one last nit. style for multi-line args is:

def method(
    arg1: Foo,
    arg2: Bar): Unit = {

}

@SparkQA
Copy link

SparkQA commented Jul 27, 2016

Test build #62935 has finished for PR 13824 at commit 7ad3d77.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Jul 27, 2016

Cool, LGTM. Merging to master.

@asfgit asfgit closed this in b14d7b5 Jul 27, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants