-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-19720][CORE] Redact sensitive information from SparkSubmit console #17047
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…bmit console output This change redacts senstive information (based on spark.redaction.regex property) from the Spark Submit console logs. Such sensitive information is already being redacted from event logs and yarn logs, etc. Testing was done manually to make sure that the console logs were not printing any sensitive information. Here's some output from the console: Spark properties used, including those specified through --conf and those from the properties file /etc/spark2/conf/spark-defaults.conf: (spark.yarn.appMasterEnv.HADOOP_CREDSTORE_PASSWORD,*********(redacted)) (spark.authenticate,false) (spark.executorEnv.HADOOP_CREDSTORE_PASSWORD,*********(redacted))
|
Test build #73385 has finished for PR 17047 at commit
|
| * @param kvs | ||
| * @return | ||
| */ | ||
| def redact(kvs: Map[String, String]): Seq[(String, String)] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Nit: I'd omit param and return if they're not filled in.)
So this is used in cases where there isn't a conf object available yet, but the argument itself has the redaction config? I was slightly worried about the parallel implementation but that would be a reasonable reason to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, that's exactly the use case - where there isn't a conf object available yet. I will update the Javadoc. Thanks for reviewing!
|
Test build #73554 has finished for PR 17047 at commit
|
vanzin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look good. Can you replace "spark submit" with "core" in the title (since that's the right component)?
| private def redact(redactionPattern: Regex, kvs: Seq[(String, String)]): Seq[(String, String)] = { | ||
| kvs.map { kv => | ||
| redactionPattern.findFirstIn(kv._1) | ||
| .map { ignore => (kv._1, REDACTION_REPLACEMENT_TEXT) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/ignore/_/
| .longConf | ||
| .createWithDefault(4 * 1024 * 1024) | ||
|
|
||
| private[spark] val SECRET_REDACTION_PROPERTY = "spark.redaction.regex" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, instead of this you could use SECRET_REDACTION_PATTERN.key and SECRET_REDACTION_PATTERN.defaultValue in Utils.scala.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @vanzin Updated.
| kvs.map { kv => | ||
| redactionPattern.findFirstIn(kv._1) | ||
| .map { ignore => (kv._1, REDACTION_REPLACEMENT_TEXT) } | ||
| .map {ignore => (kv._1, REDACTION_REPLACEMENT_TEXT) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: replace "ignore" with "_" (in case my previous comment wasn't clear). also missing a space after '{'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right I misunderstood that - I took your comment as a bash + scala way to say eliminate space (partly because it's hard to understand spacing in github comments). My bad, let me fix that.
|
Test build #73625 has finished for PR 17047 at commit
|
|
Seems unrelated. |
|
Jenkins, test this again, please. |
|
retest this please |
|
Test build #73626 has finished for PR 17047 at commit
|
|
Test build #73651 has finished for PR 17047 at commit
|
|
Merging to master. |
|
Thanks @vanzin |
…sole This change redacts senstive information (based on default password and secret regex) from the Spark Submit console logs. Such sensitive information is already being redacted from event logs and yarn logs, etc. Testing was done manually to make sure that the console logs were not printing any sensitive information. Here's some output from the console: ``` Spark properties used, including those specified through --conf and those from the properties file /etc/spark2/conf/spark-defaults.conf: (spark.yarn.appMasterEnv.HADOOP_CREDSTORE_PASSWORD,*********(redacted)) (spark.authenticate,false) (spark.executorEnv.HADOOP_CREDSTORE_PASSWORD,*********(redacted)) ``` ``` System properties: (spark.yarn.appMasterEnv.HADOOP_CREDSTORE_PASSWORD,*********(redacted)) (spark.authenticate,false) (spark.executorEnv.HADOOP_CREDSTORE_PASSWORD,*********(redacted)) ``` There is a risk if new print statements were added to the console down the road, sensitive information may still get leaked, since there is no test that asserts on the console log output. I considered it out of the scope of this JIRA to write an integration test to make sure new leaks don't happen in the future. Running unit tests to make sure nothing else is broken by this change. Using reference from Mark Grover <mark@apache.org> Closes apache#17047 for 2.1.2 spark vesion.
…sole ## What changes were proposed in this pull request? This change redacts senstive information (based on `spark.redaction.regex` property) from the Spark Submit console logs. Such sensitive information is already being redacted from event logs and yarn logs, etc. ## How was this patch tested? Testing was done manually to make sure that the console logs were not printing any sensitive information. Here's some output from the console: ``` Spark properties used, including those specified through --conf and those from the properties file /etc/spark2/conf/spark-defaults.conf: (spark.yarn.appMasterEnv.HADOOP_CREDSTORE_PASSWORD,*********(redacted)) (spark.authenticate,false) (spark.executorEnv.HADOOP_CREDSTORE_PASSWORD,*********(redacted)) ``` ``` System properties: (spark.yarn.appMasterEnv.HADOOP_CREDSTORE_PASSWORD,*********(redacted)) (spark.authenticate,false) (spark.executorEnv.HADOOP_CREDSTORE_PASSWORD,*********(redacted)) ``` There is a risk if new print statements were added to the console down the road, sensitive information may still get leaked, since there is no test that asserts on the console log output. I considered it out of the scope of this JIRA to write an integration test to make sure new leaks don't happen in the future. Running unit tests to make sure nothing else is broken by this change. Author: Mark Grover <mark@apache.org> Closes apache#17047 from markgrover/master_redaction.
What changes were proposed in this pull request?
This change redacts senstive information (based on
spark.redaction.regexproperty)from the Spark Submit console logs. Such sensitive information is already being
redacted from event logs and yarn logs, etc.
How was this patch tested?
Testing was done manually to make sure that the console logs were not printing any
sensitive information.
Here's some output from the console:
There is a risk if new print statements were added to the console down the road, sensitive information may still get leaked, since there is no test that asserts on the console log output. I considered it out of the scope of this JIRA to write an integration test to make sure new leaks don't happen in the future.
Running unit tests to make sure nothing else is broken by this change.