-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-16757] Set up Spark caller context to HDFS and YARN #14659
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Jenkins test this please |
|
Test build #63910 has finished for PR 14659 at commit
|
|
Hi, @srowen . Thank you so much for the review. Sorry for the test failure and late update. The failure reasons are that ‘jobID’ were none or there was no ‘spark.app.name’ in sparkConf. I have updated the PR to set default values to ‘jobID’ and ‘spark.app.name’. When a real application runs on Spark, it will always have ‘jobID’ and ‘spark.app.name’. What's the use case for this? |
|
Hadoop's |
|
Hi, @steveloughran Thank you very much for the comments. I have created an Hadoop jira HADOOP-13527 and attached the patch, could you please review it? I am unable to assign the jira to me, could you please add me as “contributor” role in Hadoop? Thanks again. |
|
having some problems adding you as a contributor; JIRA scale issues, browser problems &c, I've asked others to try and do it. Start with the coding; I'll sort out the contributor entry |
|
Hello @Sherry302 . Thank you for the patch. From the HDFS perspective, we recommend against using spaces in the value of the caller context. The HDFS audit log frequently gets parsed by administrators using ad-hoc scripting. Spaces in the fields can make this more challenging for them. For example, if they used an awk script that parsed $NF expecting to find the callerContext, then the "on Spark" at the end would cause it to return "Spark" instead of the full caller context. May I suggest prepending "Spark" instead? Perhaps something like this: callerContext=Spark_JobId_0_StageID_0_stageAttemptId_0_taskID_0_attemptNumber_0 |
|
Chris: maybe the CallerContext class could check for bad characters, including spaces, newlines, "," and quotation marks .. the usual things to break parsers. There's also application ID, (or in yarn appID+attemptID); does that need to be included —or does the audit log include enough about the application that it's not needed? |
Yes, good conversation to have over in HDFS JIRA. |
|
Actually a HADOOP JIRA... I filed HADOOP-13528 for follow-up. |
…when Yarn client and ApplicationMaster invoke Hadoop caller context API
…when 'Task' invoke Hadoop caller context API
|
Hi, @steveloughran Thanks a lot for the comments. In the audit log, if users set some configuration in spark-defaults.conf like We can see the application id In the commit 5ab2a41, application ID and attemptID (only in yarn cluster mode) are included in the value of the caller context when Yarn Applications in yarn client mode In the commit 1512775, application ID, name and attempt ID (only in yarn cluster mode) are included in the value of the caller context when Applications in Yarn client mode For commit 1512775, application Id and attemptID are passed to ‘Task’, is it good for ‘Task’ to see those application information? What do you think about this @steveloughran ? Thanks. |
|
Thanks a lot for adding me as “contributor” in Hadoop :) @steveloughran @cnauroth |
|
+1 (non-binding) for the latest revision. I think this is an improvement, speaking from the perspective of the HDFS audit log. Thank you, @Sherry302 . :-) |
|
Test build #64372 has finished for PR 14659 at commit
|
|
The only failure is 'basic functionality', but it passed locally. I'll re-trigger again. |
|
Retest this please. |
|
Test build #64380 has finished for PR 14659 at commit
|
|
Hi, @srowen Could you please review this PR? Thanks. |
|
Hi, @srowen Could you please review this PR again? |
|
I'm not sure I know enough about the APIs and need for this to say I would review and merge this. |
| val callerContext = Utils.classForName("org.apache.hadoop.ipc.CallerContext") | ||
| callerContext.getMethod("setCurrent", callerContext).invoke(null, ret) | ||
| } catch { | ||
| case NonFatal(e) => logDebug(s"${e.getMessage}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- better to use
logDebug(s"$e", e). Some exceptions (e.g.NullPointerException) return null fromgetMessage(); you may also want the stack
|
This context is just something passed over IPC to provide a general string for the audit logs, the main actual access of it is in the HDFS audit log Otherwise it gets passed along RPC calls (along with htrace context), and set by the various entry points (CLI, MR, etc). Lets you find out which piece of code is bringing your NN to its knees. See HDFS-9184 I'd nominate @cnauroth as someone capable of reviewing the link up with HDFS; he's happy now that there aren't spaces in the context string. Spark-wise: I've added a comment on how a test for this could be written. The HBase keytab reflection code showed me how brittle some of that stuff can be to changes in the reflected-on classes |
|
Hi, @tgravescs Thank you so much for the review. I have updated the PR based on your every comment. The only one question left is this one (in To make the caller context more readable, at commit 10dbc6f, I added the static strings Yes, this PR will set up the caller context for both HDFS and YARN. At very beginning, to make the review easier, I created two different jiras to set up caller contexts for HDFS(SPARK-16757) and YARN (SPARK-16758) although the code is the same. I have updated the jiras, the title of this PR, and the description of this PR. In the “How was this patch tested” of the PR’s description, you can see what are showing in HDFS hdfs-audit.log and Yarn RM audit log. When invoking Hadoop CallerContext API in Yarn Client, the caller context (including In Yarn RM log: Also, I have tested this with multiple tasks running in the same executor. Take My command line to run tests as below: In Spark History Application page, you can see there are two executors (one is driver), in the executor, there are 46 tasks: |
|
Test build #65677 has finished for PR 14659 at commit
|
| val JobID = if (jobID.isDefined) s"_JobID_${jobID.get}" else "" | ||
| val StageID = if (stageID.isDefined) s"_StageID_${stageID.get}" else "" | ||
| val StageAttemptId = if (stageAttemptId.isDefined) s"_${stageAttemptId.get}" else "" | ||
| val AppId = if (appId.isDefined) s"_AppId_${appId.get}" else "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as mentioned please remove AppId_... the application id is pretty obvious in the logs it starts with application_ so no need to print extra characters
|
|
||
| new CallerContext(Option(System.getProperty("spark.app.name")), | ||
| Option(appAttemptId.getApplicationId.toString), attemptID).set() | ||
| new CallerContext("APPLICATION_MASTER", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
truncate to APPMASTER
| val StageID = if (stageID.isDefined) s"_StageID_${stageID.get}" else "" | ||
| val StageAttemptId = if (stageAttemptId.isDefined) s"_${stageAttemptId.get}" else "" | ||
| val AppId = if (appId.isDefined) s"_AppId_${appId.get}" else "" | ||
| val AppAttemptId = if (appAttemptId.isDefined) s"_AttemptId_${appAttemptId.get}" else "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't agree with adding the AttemptId_ string back. I understand it helps readability but its a lot of characters compared to the actual attempt id. I want this thing to be as small as possible as to not add extra overhead to the rpc calls. The strings from your example run are already 112 characters long (SPARK_TASK_AppId_application_1474394339641_0006_AttemptId_1_JobId_0_StageId_0_AttemptId_0_TaskId_14_AttemptNum_0), if you start getting into many tasks and stages that could reach the default 128 print to the audit log and get truncated.
how about SPARK_TASK_application_1474394339641_0006_1_JId_0_SId_0_0_TId_14_0, that is only 66 characters. yes the user may have to look up the format but I think that is ok. If its really that big of an issue parsing this we can change it later but I would rather have it smaller and better performance and and have all the information (rather then it possibly getting truncated).
Really right now I think all you need is the task id and attempt because the numbers just increase across jobs and stages, but having the job id and stage id would be helpful to find the task id quickly and handles if that every changes and we have taskid unique per job or stage.
|
Hi, @tgravescs Thank you very much. Yes. I have updated the PR to make the string of the caller context shorter. |
|
Test build #65799 has finished for PR 14659 at commit
|
| * @param stageId id of the stage this task belongs to | ||
| * @param stageAttemptId attempt id of the stage this task belongs to | ||
| * @param taskId task id | ||
| * @param taskAttemptNumber task attempt id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry I missed this before, this is new feature and will only go into 2.1, lets just remove the @SInCE since this isn't public api
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| stageAttemptId: Option[Int] = None, | ||
| taskId: Option[Long] = None, | ||
| taskAttemptNumber: Option[Int] = None) extends Logging { | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make all these local vals start with lower case, add Str to them if you need to differentiate appIdStr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
Couple minor things otherwise looks good. |
|
Hi, @tgravescs Thanks a lot for the comments. I have updated the PR to rename local vals and remove the |
|
Test build #65934 has finished for PR 14659 at commit
|
|
+1 |
|
thanks @Sherry302 |
|
Thanks a lot for the review. @tgravescs @cnauroth @steveloughran @srowen |
|
Hi, @tgravescs Should we also commit this PR to Branch-2? Thanks. |
|
@Sherry302 @tgravescs oops it looks like this causes master Maven builds to fail: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/ But not the SBT ones, weird. And it fails REPL tests, with an odd error. I don't know what's actually going on there. |
|
this is a new feature so generally don't put them into point releases which are bug fixes. @srowen thanks for pointing out, I will take alook. |
|
@tgravescs @srowen Sorry for the failure. I am looking into it. |
|
Hi, @tgravescs @srowen I am doing some tests and trying to figure out the root cause. |
|
Yeah its definitely caused by the call in Task to the Utils.CallerContext.setCurrent(). It seems to be from it using the ExecutorClassLoader to go try to check the class for the executor and I guess because ExecutorClassLoader overrides findClass which is called form loadClass which is called from forName. There must be something slightly different in the MessageMatcher which I guess is generated on the fly by netty. I guess loading the remote one must mismatch with the local one. @srowen do you think we revert this while we figure it out? |
|
@tgravescs @srowen Thanks. Using |
|
its there to make sure everyone is using the same classloader and to handle if it they are chained. I'm not really familiar with all the scenarios of the repl. I see the Suite itself is getting the classloader and loading somethings based on uri. I think for now can you put up a patch changing to use Class.forName and we can file a followup jira to investigate more. |
|
Thanks @tgravescs yes. I have created a PR 15286. |
|
Hi, @tgravescs SPARK-17714 has been created for further investigation. |

What changes were proposed in this pull request?
jobIdto Task.setCallerContextis added inUtils.setCallerContextfunction invokes APIs oforg.apache.hadoop.ipc.CallerContextto set up spark caller contexts, which will be written intohdfs-audit.logand Yarn RM audit log.org.apache.hadoop.ipc.CallerContextinTaskand YarnClientandApplicationMaster.org.apache.hadoop.ipc.CallerContextin YarnClient.How was this patch tested?
Manual Tests against some Spark applications in Yarn client mode and Yarn cluster mode. Need to check if spark caller contexts are written into HDFS hdfs-audit.log and Yarn RM audit log successfully.
For example, run SparkKmeans in Yarn client mode:
Before:
There will be no Spark caller context in records of
hdfs-audit.logand Yarn RM audit log.After:
Spark caller contexts will be written in records of
hdfs-audit.logand Yarn RM audit log.These are records in
hdfs-audit.log:This is a record in Yarn RM log: