-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-16272][core] Allow config values to reference conf, env, system props. #14022
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…m props. This allows configuration to be more flexible when the cluster does not have a homogeneous configuration (e.g. packages are installed on different paths in different nodes). By allowing one to reference the environment from the conf, it becomes possible to work around those in certain cases. The feature is hooked up to spark.sql.hive.metastore.jars to show how to use it, and because it's an example of said scenario that I ran into. It uses a new "pathConf" config type that is a shorthand for enabling variable expansion on string configs. As part of the implementation, ConfigEntry now keeps track of all "known" configs (i.e. those created through the use of ConfigBuilder), since that list is used by the resolution code. This duplicates some code in SQLConf, which could potentially be merged with this now. It will also make it simpler to implement some missing features such as filtering which configs show up in the UI or in event logs - which are not part of this change. Another change is in the way ConfigEntry reads config data; it now takes a string map and a function that reads env variables, so that it can be called both from SparkConf and SQLConf. This makes it so both places follow the same read path, instead of having to replicate certain logic in SQLConf. There are still a couple of methods in SQLConf that peek into fields of ConfigEntry directly, though. Tested via unit tests, and by using the new variable expansion functionality in a shell session with a custom spark.sql.hive.metastore.jars value.
|
Note this contains the code for #14020 also since I needed that to test the changes. |
|
Test build #61643 has finished for PR 14022 at commit
|
|
Randomly pinging a couple of people... @JoshRosen @srowen |
|
My only concern is the added complexity vs how useful it would be. Do we have weird corner cases here like circular references? or resolving references among many sources in the right order leading to odd ambiguity? maybe that's not possible and I haven't thought it through. |
That's a good question; the current code doesn't handle that (if you reference a config that itself has references, those are not expanded), but it probably should for correctness. But implementing that creates the risk of circular references. Let me take a look, shouldn't be hard to solve. re: multiple source, the prefix is mandatory, so there shouldn't be any ambiguity related to where the values come from. Also, only Spark config values that have been explicitly allowed do variable expansion. re: usability, this is pretty much mandatory at least on YARN if the cluster doesn't have software installed on the same location on every node. That's more common than I'd like it to be, unfortunately. |
Expand values of known configs when doing variable expansion, but detect and error out when circular references are found.
|
Test build #61795 has finished for PR 14022 at commit
|
|
Spark SQL allows See VariableSubstitution |
Sounds good. I looked briefly at the code and they could potentially be merged later, but to avoid issues like "how to support hiveconf:" I'll just modify this code to use "system:" and "env:" for the moment. |
"env:" and "system:", to match how SQL does variable expansion. We can try to merge the two later, but since this code doesn't have access to things like hadoop or hive configuration, let's leave them separate for now.
|
Test build #61924 has finished for PR 14022 at commit
|
|
Any more comments? I'd like to get this in so I can work on cleaning up the code duplication between SQL and core. |
|
Test build #62259 has finished for PR 14022 at commit
|
|
retest this please |
|
Test build #62275 has finished for PR 14022 at commit
|
|
I'll leave this around a little more but if I don't see more feedback I plan to push it before the weekend. |
| val result = new StringBuilder() | ||
| var end = 0 | ||
|
|
||
| while (end < value.length() && matcher.find(end)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can use Regex#replaceAllIn here instead of a loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks interesting, let me take a look.
|
Instead of selectively enabling this for certain confs / config builders, how about a global flag for enabling config expansion? I think that would be less likely to be confusing. Also, perhaps a warning should be logged if there is a config value that looks expandable but the flag is off. |
I think that would be more confusing. Why would someone disable expansion? My only concern with enabling it for all options is performance - reading configs becomes more expensive. Maybe that's not a big deal and we can just do that. |
|
Wouldn't enabling it by default break backwards compatibility? I agree that would be better, but it seems likely that '${...}' may be used in existing configs. I think performance is probably ok, configs shouldn't be resolved that often and there aren't that many of them. |
Yes, maybe. But having a flag to disable everything would also potentially break features that rely on it... although you could argue it's the user's fault at that point. I just don't like adding new options that only exist because we can't make a decision about what their behavior should be. If variable substitution applies to all configs, then it does apply to all configs and you can't disable it. Since no configs that I know of have this functionality (the SQL variable substitution would be broken), from a user's perspective it's very unlikely doing it for all would break things. |
|
To reduce the risk, how about changing the semantics to |
|
That's mostly how it works; I would like to avoid an explicit |
|
IMO |
Also enforce a "spark." prefix when doing substitution.
|
Test build #62339 has finished for PR 14022 at commit
|
Well, it is and it isn't. You can't set non- |
| } | ||
| case "system" => sys.props.get(name) | ||
| case "env" => Option(getenv(name)) | ||
| case _ => throw new IllegalArgumentException(s"Invalid prefix: $prefix") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this throw?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I have to take a look at this. Throwing might actually break some existing code in SparkHadoopUtil.
After thinking about it a bit more, it can lead to confusing behavior. Since substitution only applies to configs declared internally by Spark, this avoids random user configs being affected already, so no need for another check to disable that.
|
It looks good to me, except I really think you should keep the "spark." requirement. It seems plausible that some string like '${x}' can show up in spark confs, perhaps in confs like spark.r.command. |
This reverts commit 3afe221.
I reverted my last change since I forgot to change the test code anyway. I'll spend some time to see if it's easy to handle SQLConf's WAREHOUSE_PATH with this code, to remove some manual variable expansion happening there, but if it's too much work I'll leave that for a later change. |
This allows some custom expansion code to be removed from SQLConf.
|
Test build #62485 has finished for PR 14022 at commit
|
|
Test build #62488 has finished for PR 14022 at commit
|
|
Test build #62489 has finished for PR 14022 at commit
|
|
Test build #62494 has finished for PR 14022 at commit
|
|
I'll push this later after another test run. retest this please |
|
Test build #62637 has finished for PR 14022 at commit
|
|
Merging to master. |
This allows configuration to be more flexible, for example, when the cluster does
not have a homogeneous configuration (e.g. packages are installed on different
paths in different nodes). By allowing one to reference the environment from
the conf, it becomes possible to work around those in certain cases.
As part of the implementation, ConfigEntry now keeps track of all "known" configs
(i.e. those created through the use of ConfigBuilder), since that list is used
by the resolution code. This duplicates some code in SQLConf, which could potentially
be merged with this now. It will also make it simpler to implement some missing
features such as filtering which configs show up in the UI or in event logs - which
are not part of this change.
Another change is in the way ConfigEntry reads config data; it now takes a string
map and a function that reads env variables, so that it can be called both from
SparkConf and SQLConf. This makes it so both places follow the same read path,
instead of having to replicate certain logic in SQLConf. There are still a
couple of methods in SQLConf that peek into fields of ConfigEntry directly,
though.
Tested via unit tests, and by using the new variable expansion functionality
in a shell session with a custom spark.sql.hive.metastore.jars value.