Skip to content

Conversation

@vanzin
Copy link
Contributor

@vanzin vanzin commented Jul 1, 2016

This allows configuration to be more flexible, for example, when the cluster does
not have a homogeneous configuration (e.g. packages are installed on different
paths in different nodes). By allowing one to reference the environment from
the conf, it becomes possible to work around those in certain cases.

As part of the implementation, ConfigEntry now keeps track of all "known" configs
(i.e. those created through the use of ConfigBuilder), since that list is used
by the resolution code. This duplicates some code in SQLConf, which could potentially
be merged with this now. It will also make it simpler to implement some missing
features such as filtering which configs show up in the UI or in event logs - which
are not part of this change.

Another change is in the way ConfigEntry reads config data; it now takes a string
map and a function that reads env variables, so that it can be called both from
SparkConf and SQLConf. This makes it so both places follow the same read path,
instead of having to replicate certain logic in SQLConf. There are still a
couple of methods in SQLConf that peek into fields of ConfigEntry directly,
though.

Tested via unit tests, and by using the new variable expansion functionality
in a shell session with a custom spark.sql.hive.metastore.jars value.

…m props.

This allows configuration to be more flexible when the cluster does not
have a homogeneous configuration (e.g. packages are installed on different
paths in different nodes). By allowing one to reference the environment from
the conf, it becomes possible to work around those in certain cases.

The feature is hooked up to spark.sql.hive.metastore.jars to show how to use
it, and because it's an example of said scenario that I ran into. It uses a
new "pathConf" config type that is a shorthand for enabling variable expansion
on string configs.

As part of the implementation, ConfigEntry now keeps track of all "known" configs
(i.e. those created through the use of ConfigBuilder), since that list is used
by the resolution code. This duplicates some code in SQLConf, which could potentially
be merged with this now. It will also make it simpler to implement some missing
features such as filtering which configs show up in the UI or in event logs - which
are not part of this change.

Another change is in the way ConfigEntry reads config data; it now takes a string
map and a function that reads env variables, so that it can be called both from
SparkConf and SQLConf. This makes it so both places follow the same read path,
instead of having to replicate certain logic in SQLConf. There are still a
couple of methods in SQLConf that peek into fields of ConfigEntry directly,
though.

Tested via unit tests, and by using the new variable expansion functionality
in a shell session with a custom spark.sql.hive.metastore.jars value.
@vanzin
Copy link
Contributor Author

vanzin commented Jul 1, 2016

Note this contains the code for #14020 also since I needed that to test the changes.

@SparkQA
Copy link

SparkQA commented Jul 2, 2016

Test build #61643 has finished for PR 14022 at commit f4e772a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor Author

vanzin commented Jul 5, 2016

Randomly pinging a couple of people... @JoshRosen @srowen

@srowen
Copy link
Member

srowen commented Jul 5, 2016

My only concern is the added complexity vs how useful it would be. Do we have weird corner cases here like circular references? or resolving references among many sources in the right order leading to odd ambiguity? maybe that's not possible and I haven't thought it through.

@vanzin
Copy link
Contributor Author

vanzin commented Jul 5, 2016

Do we have weird corner cases here like circular references?

That's a good question; the current code doesn't handle that (if you reference a config that itself has references, those are not expanded), but it probably should for correctness. But implementing that creates the risk of circular references. Let me take a look, shouldn't be hard to solve.

re: multiple source, the prefix is mandatory, so there shouldn't be any ambiguity related to where the values come from. Also, only Spark config values that have been explicitly allowed do variable expansion.

re: usability, this is pretty much mandatory at least on YARN if the cluster doesn't have software installed on the same location on every node. That's more common than I'd like it to be, unfortunately.

Expand values of known configs when doing variable expansion, but
detect and error out when circular references are found.
@SparkQA
Copy link

SparkQA commented Jul 6, 2016

Test build #61795 has finished for PR 14022 at commit 240ab73.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Jul 6, 2016

Spark SQL allows env:xxx and system:xxx. We should follow the same here.

See VariableSubstitution

@vanzin
Copy link
Contributor Author

vanzin commented Jul 7, 2016

Spark SQL allows env:xxx and system:xxx. We should follow the same here.

Sounds good. I looked briefly at the code and they could potentially be merged later, but to avoid issues like "how to support hiveconf:" I'll just modify this code to use "system:" and "env:" for the moment.

"env:" and "system:", to match how SQL does variable expansion. We
can try to merge the two later, but since this code doesn't have
access to things like hadoop or hive configuration, let's leave them
separate for now.
@SparkQA
Copy link

SparkQA commented Jul 7, 2016

Test build #61924 has finished for PR 14022 at commit 392bddc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor Author

vanzin commented Jul 11, 2016

Any more comments? I'd like to get this in so I can work on cleaning up the code duplication between SQL and core.

@SparkQA
Copy link

SparkQA commented Jul 13, 2016

Test build #62259 has finished for PR 14022 at commit b928a55.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor Author

vanzin commented Jul 13, 2016

retest this please

@SparkQA
Copy link

SparkQA commented Jul 13, 2016

Test build #62275 has finished for PR 14022 at commit b928a55.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor Author

vanzin commented Jul 13, 2016

I'll leave this around a little more but if I don't see more feedback I plan to push it before the weekend.

val result = new StringBuilder()
var end = 0

while (end < value.length() && matcher.find(end)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can use Regex#replaceAllIn here instead of a loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks interesting, let me take a look.

@ericl
Copy link
Contributor

ericl commented Jul 13, 2016

Instead of selectively enabling this for certain confs / config builders, how about a global flag for enabling config expansion? I think that would be less likely to be confusing. Also, perhaps a warning should be logged if there is a config value that looks expandable but the flag is off.

@vanzin
Copy link
Contributor Author

vanzin commented Jul 13, 2016

how about a global flag for enabling config expansion

I think that would be more confusing. Why would someone disable expansion?

My only concern with enabling it for all options is performance - reading configs becomes more expensive. Maybe that's not a big deal and we can just do that.

@ericl
Copy link
Contributor

ericl commented Jul 14, 2016

Wouldn't enabling it by default break backwards compatibility? I agree that would be better, but it seems likely that '${...}' may be used in existing configs.

I think performance is probably ok, configs shouldn't be resolved that often and there aren't that many of them.

@vanzin
Copy link
Contributor Author

vanzin commented Jul 14, 2016

Wouldn't enabling it by default break backwards compatibility?

Yes, maybe. But having a flag to disable everything would also potentially break features that rely on it... although you could argue it's the user's fault at that point.

I just don't like adding new options that only exist because we can't make a decision about what their behavior should be. If variable substitution applies to all configs, then it does apply to all configs and you can't disable it. Since no configs that I know of have this functionality (the SQL variable substitution would be broken), from a user's perspective it's very unlikely doing it for all would break things.

@ericl
Copy link
Contributor

ericl commented Jul 14, 2016

To reduce the risk, how about changing the semantics to

* - sparkconf: looks for the value in the spark config
* - system: looks for the value in the system properties
* - env: looks for the value in the environment
* - (any other or no prefix): left as-is

@vanzin
Copy link
Contributor Author

vanzin commented Jul 14, 2016

That's mostly how it works; I would like to avoid an explicit sparkconf: prefix to avoid things like sparkconf:spark.master, but I can enforce that only variables starting with spark. are expanded easily.

@ericl
Copy link
Contributor

ericl commented Jul 14, 2016

IMO ${sparkconf:spark.master} is more clear to the unfamiliar reader, since then it's not different from env: or system:. There might also be an issue if someone adds a spark conf that doesn't start with "spark.", which I think is allowed by spark conf.

Marcelo Vanzin added 2 commits July 14, 2016 11:20
Also enforce a "spark." prefix when doing substitution.
@SparkQA
Copy link

SparkQA commented Jul 14, 2016

Test build #62339 has finished for PR 14022 at commit cca8314.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor Author

vanzin commented Jul 15, 2016

if someone adds a spark conf that doesn't start with "spark.", which I think is allowed by spark conf.

Well, it is and it isn't. You can't set non-spark. configs from spark-submit or system properties. But you can set them directly in the code. In general, Spark completely ignores anything that does not start with spark. so I think that's ok here.

}
case "system" => sys.props.get(name)
case "env" => Option(getenv(name))
case _ => throw new IllegalArgumentException(s"Invalid prefix: $prefix")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this throw?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I have to take a look at this. Throwing might actually break some existing code in SparkHadoopUtil.

Marcelo Vanzin added 2 commits July 18, 2016 12:44
After thinking about it a bit more, it can lead to confusing behavior.
Since substitution only applies to configs declared internally by Spark,
this avoids random user configs being affected already, so no need for
another check to disable that.
@ericl
Copy link
Contributor

ericl commented Jul 18, 2016

It looks good to me, except I really think you should keep the "spark." requirement. It seems plausible that some string like '${x}' can show up in spark confs, perhaps in confs like spark.r.command.

@vanzin
Copy link
Contributor Author

vanzin commented Jul 18, 2016

I really think you should keep the "spark." requirement.

I reverted my last change since I forgot to change the test code anyway.

I'll spend some time to see if it's easy to handle SQLConf's WAREHOUSE_PATH with this code, to remove some manual variable expansion happening there, but if it's too much work I'll leave that for a later change.

This allows some custom expansion code to be removed from SQLConf.
@SparkQA
Copy link

SparkQA commented Jul 18, 2016

Test build #62485 has finished for PR 14022 at commit 6618bc4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 18, 2016

Test build #62488 has finished for PR 14022 at commit 3afe221.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 18, 2016

Test build #62489 has finished for PR 14022 at commit 33de81f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 19, 2016

Test build #62494 has finished for PR 14022 at commit ed5c18b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor Author

vanzin commented Jul 20, 2016

I'll push this later after another test run. retest this please

@SparkQA
Copy link

SparkQA commented Jul 21, 2016

Test build #62637 has finished for PR 14022 at commit ed5c18b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor Author

vanzin commented Jul 21, 2016

Merging to master.

@asfgit asfgit closed this in 75a06aa Jul 21, 2016
@vanzin vanzin deleted the SPARK-16272 branch August 2, 2016 18:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants