-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-26928][CORE] Add driver CPU Time to the metrics system #23838
Conversation
@@ -568,6 +569,10 @@ class SparkContext(config: SparkConf) extends Logging { | |||
_taskScheduler.postStartHook() | |||
_env.metricsSystem.registerSource(_dagScheduler.metricsSource) | |||
_env.metricsSystem.registerSource(new BlockManagerSource(_env.blockManager)) | |||
// Register JVMCPUSource to the metrics system if spark.metrics.cpu.time.driver.enabled==true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment just repeats the code.
Why would you not enable this source? It seems to be always on for executors, what would be the drawback of doing the same for the driver?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Actually, I would not mind to have this metrics gauge always on for the driver too. I don't think the overhead is high. However, I was just being cautious here following previous discussions on the implementation of this gauge for the executor instance [SPARK-25228]. I would like to ping @srowen for comment as he helped back then with [SPARK-25228].
@@ -499,6 +499,13 @@ package object config { | |||
.stringConf | |||
.createOptional | |||
|
|||
private[spark] val METRICS_CPU_TIME_DRIVER_ENABLED = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need a flag -- is there a drawback to enabling it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thoughts adding a flag/configuratoni parameter is probably not necessary and can be removed, as the JVMCPU source should be quite safe. We have been using it for the executor instance for a while. Also the implementation takes care of various error conditions that can occur for exmaple with JVMs that don't implement "ProcessCpuTime"
c665fb5
to
5068836
Compare
5146f55
to
b50bcb6
Compare
|
||
private[spark] class JVMCPUSource extends Source { | ||
|
||
override implicit val metricRegistry = new MetricRegistry() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't need to be implicit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. I'll correct this, thanks.
ok to test |
Test build #103004 has finished for PR 23838 at commit
|
Test build #103009 has finished for PR 23838 at commit
|
Merging to master. |
What changes were proposed in this pull request?
This proposes to add instrumentation for the driver's JVM CPU time via the Spark Dropwizard/Codahale metrics system. It follows directly from previous work SPARK-25228 and shares similar motivations: it is intended as an improvement to be used for Spark performance dashboards and monitoring tools/instrumentation.
Implementation details: this PR takes the code introduced in SPARK-25228 and moves it to a new separate Source JVMCPUSource, which is then used to register the jvmCpuTime gauge metric for both executor and driver.
The registration of the jvmCpuTime metric for the driver is conditional, a new configuration parameter
spark.metrics.cpu.time.driver.enabled
(proposed default: false) is introduced for this purpose.How was this patch tested?
Manually tested, using local mode and using YARN.