[SPARK-16272][core] Allow config values to reference conf, env, system props. #14022

vanzin · 2016-07-01T23:33:22Z

This allows configuration to be more flexible, for example, when the cluster does
not have a homogeneous configuration (e.g. packages are installed on different
paths in different nodes). By allowing one to reference the environment from
the conf, it becomes possible to work around those in certain cases.

As part of the implementation, ConfigEntry now keeps track of all "known" configs
(i.e. those created through the use of ConfigBuilder), since that list is used
by the resolution code. This duplicates some code in SQLConf, which could potentially
be merged with this now. It will also make it simpler to implement some missing
features such as filtering which configs show up in the UI or in event logs - which
are not part of this change.

Another change is in the way ConfigEntry reads config data; it now takes a string
map and a function that reads env variables, so that it can be called both from
SparkConf and SQLConf. This makes it so both places follow the same read path,
instead of having to replicate certain logic in SQLConf. There are still a
couple of methods in SQLConf that peek into fields of ConfigEntry directly,
though.

Tested via unit tests, and by using the new variable expansion functionality
in a shell session with a custom spark.sql.hive.metastore.jars value.

…m props. This allows configuration to be more flexible when the cluster does not have a homogeneous configuration (e.g. packages are installed on different paths in different nodes). By allowing one to reference the environment from the conf, it becomes possible to work around those in certain cases. The feature is hooked up to spark.sql.hive.metastore.jars to show how to use it, and because it's an example of said scenario that I ran into. It uses a new "pathConf" config type that is a shorthand for enabling variable expansion on string configs. As part of the implementation, ConfigEntry now keeps track of all "known" configs (i.e. those created through the use of ConfigBuilder), since that list is used by the resolution code. This duplicates some code in SQLConf, which could potentially be merged with this now. It will also make it simpler to implement some missing features such as filtering which configs show up in the UI or in event logs - which are not part of this change. Another change is in the way ConfigEntry reads config data; it now takes a string map and a function that reads env variables, so that it can be called both from SparkConf and SQLConf. This makes it so both places follow the same read path, instead of having to replicate certain logic in SQLConf. There are still a couple of methods in SQLConf that peek into fields of ConfigEntry directly, though. Tested via unit tests, and by using the new variable expansion functionality in a shell session with a custom spark.sql.hive.metastore.jars value.

vanzin · 2016-07-01T23:33:48Z

Note this contains the code for #14020 also since I needed that to test the changes.

SparkQA · 2016-07-02T01:44:29Z

Test build #61643 has finished for PR 14022 at commit f4e772a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2016-07-05T21:42:44Z

Randomly pinging a couple of people... @JoshRosen @srowen

srowen · 2016-07-05T21:46:06Z

My only concern is the added complexity vs how useful it would be. Do we have weird corner cases here like circular references? or resolving references among many sources in the right order leading to odd ambiguity? maybe that's not possible and I haven't thought it through.

vanzin · 2016-07-05T21:51:26Z

Do we have weird corner cases here like circular references?

That's a good question; the current code doesn't handle that (if you reference a config that itself has references, those are not expanded), but it probably should for correctness. But implementing that creates the risk of circular references. Let me take a look, shouldn't be hard to solve.

re: multiple source, the prefix is mandatory, so there shouldn't be any ambiguity related to where the values come from. Also, only Spark config values that have been explicitly allowed do variable expansion.

re: usability, this is pretty much mandatory at least on YARN if the cluster doesn't have software installed on the same location on every node. That's more common than I'd like it to be, unfortunately.

Expand values of known configs when doing variable expansion, but detect and error out when circular references are found.

SparkQA · 2016-07-06T02:37:53Z

Test build #61795 has finished for PR 14022 at commit 240ab73.

This patch fails from timeout after a configured wait of 250m.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-07-06T06:36:38Z

Spark SQL allows env:xxx and system:xxx. We should follow the same here.

See VariableSubstitution

vanzin · 2016-07-07T19:32:04Z

Spark SQL allows env:xxx and system:xxx. We should follow the same here.

Sounds good. I looked briefly at the code and they could potentially be merged later, but to avoid issues like "how to support hiveconf:" I'll just modify this code to use "system:" and "env:" for the moment.

"env:" and "system:", to match how SQL does variable expansion. We can try to merge the two later, but since this code doesn't have access to things like hadoop or hive configuration, let's leave them separate for now.

SparkQA · 2016-07-07T21:58:48Z

Test build #61924 has finished for PR 14022 at commit 392bddc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2016-07-11T20:29:17Z

Any more comments? I'd like to get this in so I can work on cleaning up the code duplication between SQL and core.

SparkQA · 2016-07-13T19:43:43Z

Test build #62259 has finished for PR 14022 at commit b928a55.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2016-07-13T19:45:21Z

retest this please

SparkQA · 2016-07-13T22:20:15Z

Test build #62275 has finished for PR 14022 at commit b928a55.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2016-07-13T23:03:25Z

I'll leave this around a little more but if I don't see more feedback I plan to push it before the weekend.

ericl · 2016-07-13T23:50:28Z

core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala

+    val result = new StringBuilder()
+    var end = 0
+
+    while (end < value.length() && matcher.find(end)) {


I think you can use Regex#replaceAllIn here instead of a loop.

That looks interesting, let me take a look.

ericl · 2016-07-13T23:53:10Z

Instead of selectively enabling this for certain confs / config builders, how about a global flag for enabling config expansion? I think that would be less likely to be confusing. Also, perhaps a warning should be logged if there is a config value that looks expandable but the flag is off.

vanzin · 2016-07-13T23:56:10Z

how about a global flag for enabling config expansion

I think that would be more confusing. Why would someone disable expansion?

My only concern with enabling it for all options is performance - reading configs becomes more expensive. Maybe that's not a big deal and we can just do that.

ericl · 2016-07-14T00:00:56Z

Wouldn't enabling it by default break backwards compatibility? I agree that would be better, but it seems likely that '${...}' may be used in existing configs.

I think performance is probably ok, configs shouldn't be resolved that often and there aren't that many of them.

vanzin · 2016-07-14T00:08:24Z

Wouldn't enabling it by default break backwards compatibility?

Yes, maybe. But having a flag to disable everything would also potentially break features that rely on it... although you could argue it's the user's fault at that point.

I just don't like adding new options that only exist because we can't make a decision about what their behavior should be. If variable substitution applies to all configs, then it does apply to all configs and you can't disable it. Since no configs that I know of have this functionality (the SQL variable substitution would be broken), from a user's perspective it's very unlikely doing it for all would break things.

ericl · 2016-07-14T00:19:49Z

To reduce the risk, how about changing the semantics to

* - sparkconf: looks for the value in the spark config
* - system: looks for the value in the system properties
* - env: looks for the value in the environment
* - (any other or no prefix): left as-is

vanzin · 2016-07-14T00:23:48Z

That's mostly how it works; I would like to avoid an explicit sparkconf: prefix to avoid things like sparkconf:spark.master, but I can enforce that only variables starting with spark. are expanded easily.

ericl · 2016-07-14T00:29:25Z

IMO ${sparkconf:spark.master} is more clear to the unfamiliar reader, since then it's not different from env: or system:. There might also be an issue if someone adds a spark conf that doesn't start with "spark.", which I think is allowed by spark conf.

Also enforce a "spark." prefix when doing substitution.

SparkQA · 2016-07-14T21:01:37Z

Test build #62339 has finished for PR 14022 at commit cca8314.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2016-07-15T21:17:18Z

if someone adds a spark conf that doesn't start with "spark.", which I think is allowed by spark conf.

Well, it is and it isn't. You can't set non-spark. configs from spark-submit or system properties. But you can set them directly in the code. In general, Spark completely ignores anything that does not start with spark. so I think that's ok here.

ericl · 2016-07-18T18:03:33Z

core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala

+          }
+        case "system" => sys.props.get(name)
+        case "env" => Option(getenv(name))
+        case _ => throw new IllegalArgumentException(s"Invalid prefix: $prefix")


Should this throw?

Yeah, I have to take a look at this. Throwing might actually break some existing code in SparkHadoopUtil.

After thinking about it a bit more, it can lead to confusing behavior. Since substitution only applies to configs declared internally by Spark, this avoids random user configs being affected already, so no need for another check to disable that.

ericl · 2016-07-18T20:20:15Z

It looks good to me, except I really think you should keep the "spark." requirement. It seems plausible that some string like '${x}' can show up in spark confs, perhaps in confs like spark.r.command.

This reverts commit 3afe221.

vanzin · 2016-07-18T20:24:04Z

I really think you should keep the "spark." requirement.

I reverted my last change since I forgot to change the test code anyway.

I'll spend some time to see if it's easy to handle SQLConf's WAREHOUSE_PATH with this code, to remove some manual variable expansion happening there, but if it's too much work I'll leave that for a later change.

This allows some custom expansion code to be removed from SQLConf.

SparkQA · 2016-07-18T21:56:06Z

Test build #62485 has finished for PR 14022 at commit 6618bc4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-18T22:00:41Z

Test build #62488 has finished for PR 14022 at commit 3afe221.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-18T22:27:19Z

Test build #62489 has finished for PR 14022 at commit 33de81f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-19T00:33:04Z

Test build #62494 has finished for PR 14022 at commit ed5c18b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2016-07-20T22:47:29Z

I'll push this later after another test run. retest this please

SparkQA · 2016-07-21T00:53:10Z

Test build #62637 has finished for PR 14022 at commit ed5c18b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2016-07-21T01:24:09Z

Merging to master.

Allow chained references.

240ab73

Expand values of known configs when doing variable expansion, but detect and error out when circular references are found.

Use SQL-style prefixes for the references.

392bddc

"env:" and "system:", to match how SQL does variable expansion. We can try to merge the two later, but since this code doesn't have access to things like hadoop or hive configuration, let's leave them separate for now.

Merge branch 'master' into SPARK-16272

b928a55

ericl reviewed Jul 13, 2016
View reviewed changes

Marcelo Vanzin added 2 commits July 14, 2016 11:20

Apply substitution to all configs.

24ccef0

Also enforce a "spark." prefix when doing substitution.

Use Regex.replaceAllIn.

cca8314

ericl reviewed Jul 18, 2016
View reviewed changes

Marcelo Vanzin added 2 commits July 18, 2016 12:44

Don't error out on unknown prefixes.

6618bc4

Drop "spark." requirement.

3afe221

After thinking about it a bit more, it can lead to confusing behavior. Since substitution only applies to configs declared internally by Spark, this avoids random user configs being affected already, so no need for another check to disable that.

Revert "Drop "spark." requirement."

33de81f

This reverts commit 3afe221.

Expand default values that are strings.

ed5c18b

This allows some custom expansion code to be removed from SQLConf.

vanzin mentioned this pull request Jul 20, 2016

[SPARK-5847][CORE] Allow for configuring MetricsSystem's use of app ID to namespace all metrics #14270

Closed

asfgit closed this in 75a06aa Jul 21, 2016

vanzin deleted the SPARK-16272 branch August 2, 2016 18:34

[SPARK-16272][core] Allow config values to reference conf, env, system props. #14022

[SPARK-16272][core] Allow config values to reference conf, env, system props. #14022

Uh oh!

Conversation

vanzin commented Jul 1, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vanzin commented Jul 1, 2016

Uh oh!

SparkQA commented Jul 2, 2016

Uh oh!

vanzin commented Jul 5, 2016

Uh oh!

srowen commented Jul 5, 2016

Uh oh!

vanzin commented Jul 5, 2016

Uh oh!

SparkQA commented Jul 6, 2016

Uh oh!

rxin commented Jul 6, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vanzin commented Jul 7, 2016

Uh oh!

SparkQA commented Jul 7, 2016

Uh oh!

vanzin commented Jul 11, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Jul 13, 2016

Uh oh!

vanzin commented Jul 13, 2016

Uh oh!

SparkQA commented Jul 13, 2016

Uh oh!

vanzin commented Jul 13, 2016

Uh oh!

ericl Jul 13, 2016

Choose a reason for hiding this comment

Uh oh!

vanzin Jul 13, 2016

Choose a reason for hiding this comment

Uh oh!

ericl commented Jul 13, 2016

Uh oh!

vanzin commented Jul 13, 2016

Uh oh!

ericl commented Jul 14, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vanzin commented Jul 14, 2016

Uh oh!

ericl commented Jul 14, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vanzin commented Jul 14, 2016

Uh oh!

ericl commented Jul 14, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Jul 14, 2016

Uh oh!

vanzin commented Jul 15, 2016

Uh oh!

ericl Jul 18, 2016

Choose a reason for hiding this comment

Uh oh!

vanzin Jul 18, 2016

Choose a reason for hiding this comment

Uh oh!

ericl commented Jul 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vanzin commented Jul 18, 2016

Uh oh!

SparkQA commented Jul 18, 2016

Uh oh!

SparkQA commented Jul 18, 2016

Uh oh!

SparkQA commented Jul 18, 2016

vanzin commented Jul 1, 2016 •

edited

Loading

rxin commented Jul 6, 2016 •

edited

Loading

vanzin commented Jul 11, 2016 •

edited

Loading

ericl commented Jul 14, 2016 •

edited

Loading

ericl commented Jul 14, 2016 •

edited

Loading

ericl commented Jul 14, 2016 •

edited

Loading

ericl commented Jul 18, 2016 •

edited

Loading