[SPARK-24594][YARN] Introducing metrics for YARN #21635

attilapiros · 2018-06-25T14:21:07Z

What changes were proposed in this pull request?

In this PR metrics are introduced for YARN. As up to now there was no metrics in the YARN module a new metric system is created with the name "applicationMaster".
To support both client and cluster mode the metric system lifecycle is bound to the AM.

How was this patch tested?

Both client and cluster mode was tested manually.
Before the test on one of the YARN node spark-core was removed to cause the allocation failure.
Spark was started as (in case of client mode):

spark2-submit \
  --class org.apache.spark.examples.SparkPi \
  --conf "spark.yarn.blacklist.executor.launch.blacklisting.enabled=true" --conf "spark.blacklist.application.maxFailedExecutorsPerNode=2" --conf "spark.dynamicAllocation.enabled=true" --conf "spark.metrics.conf.*.sink.console.class=org.apache.spark.metrics.sink.ConsoleSink" \
  --master yarn \
  --deploy-mode client \
  original-spark-examples_2.11-2.4.0-SNAPSHOT.jar \
  1000

In both cases the YARN logs contained the new metrics as:

$ yarn logs --applicationId application_1529926424933_0015 
...
-- Gauges ----------------------------------------------------------------------
application_1531751594108_0046.applicationMaster.numContainersPendingAllocate
             value = 0
application_1531751594108_0046.applicationMaster.numExecutorsFailed
             value = 3
application_1531751594108_0046.applicationMaster.numExecutorsRunning
             value = 9
application_1531751594108_0046.applicationMaster.numLocalityAwareTasks
             value = 0
application_1531751594108_0046.applicationMaster.numReleasedContainers
             value = 0
...

SparkQA · 2018-06-25T14:45:40Z

Test build #92301 has finished for PR 21635 at commit 9b033cc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2018-06-25T23:47:38Z

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala


  private var master: ApplicationMaster = _

+


nit: remove

vanzin · 2018-06-25T23:48:33Z

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala


  private val securityMgr = new SecurityManager(sparkConf)

+  private[spark] val failureTracker = new FailureTracker(sparkConf, new SystemClock)


No need to specify the clock here?

As other metrics from YarnAllocator will be added this change is reverted.

vanzin · 2018-06-25T23:52:34Z

...gers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClusterSchedulerSource.scala

+  override val sourceName: String = "yarn_cluster"
+  override val metricRegistry: MetricRegistry = new MetricRegistry()
+
+  metricRegistry.register(


The mechanics of adding the metric source are ok, but have you thought of other metrics to expose? YarnAllocator has a lot of things that could be easily hooked up here.

Agreed, creating metric source with only one metrics seems overkill. Maybe we can mix this into FailureTracker.

I have added some new metrics. So the current metrics are:

yarn.numExecutorsFailed

yarn.numExecutorsRunning

yarn.numLocalityAwareTasks

yarn.numPendingLossReasonRequests

yarn.numReleasedContainers

SparkQA · 2018-06-26T13:10:37Z

Test build #92341 has finished for PR 21635 at commit 4968ad6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2018-06-27T13:28:20Z

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala

          ApplicationMaster.EXIT_UNCAUGHT_EXCEPTION,
          "Uncaught exception: " + StringUtils.stringifyException(e))
+    } finally {
+      metricsSystem.report()


add yarn to the monitoring.md doc as component

Thanks Tom, documentation is added.

SparkQA · 2018-06-27T15:44:20Z

Test build #92385 has finished for PR 21635 at commit 9735525.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2018-06-27T20:15:08Z

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala

          ApplicationMaster.EXIT_UNCAUGHT_EXCEPTION,
          "Uncaught exception: " + StringUtils.stringifyException(e))
+    } finally {
+      metricsSystem.report()


metricsSystem can be null at this point, can't it? in case of some issue during startup.

Yes, better and more elegant to store metricSystem in an Option.

vanzin · 2018-06-27T20:18:08Z

...e-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnClusterSchedulerSource.scala

+    override def getValue: Int = yarnAllocator.numLocalityAwareTasks
+  })
+
+}


The size of getPendingAllocate might be an interesting metric, but need to check whether it requires synchronization... and it may be an expensive operation, not sure if the AM client has a better API to get the number of pending requests.

Yes, I have seen the call goes to YARN and I also was afraid abut its execution time so this is why I finally decided to leave it out.

But I will check whether there is something better to get it.

The getPendingAllocate seams to me quite cheap as it just uses local maps (and tables) to calculate a list of ContainerRequests.

SparkQA · 2018-06-27T21:20:53Z

Test build #92396 has finished for PR 21635 at commit a8f3146.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-06-27T21:40:09Z

Test build #92397 has finished for PR 21635 at commit b0ee4ec.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2018-07-02T14:54:11Z

...e-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnClusterSchedulerSource.scala

+  })
+
+  metricRegistry.register(MetricRegistry.name("numPendingLossReasonRequests"), new Gauge[Int] {
+    override def getValue: Int = yarnAllocator.getNumPendingLossReasonRequests


I'm not sure how useful this metric is, did you have specific use case?

Sorry I have no use case for it. I added it as in previous comments it was requested to have more metrics and this one was something easy to collect If it is totally useless then better to remove it.

yeah I would leave it out if no one specifically requested it and we can't think of use case. Its easier to add later then to remove.

SparkQA · 2018-07-02T16:52:33Z

Test build #92534 has finished for PR 21635 at commit 68ba47e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2018-07-09T16:26:25Z

...e-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnClusterSchedulerSource.scala

+    override def getValue: Int = yarnAllocator.numLocalityAwareTasks
+  })
+
+  metricRegistry.register(MetricRegistry.name("numPendingAllocate"), new Gauge[Int] {


should we give these a more clear name. Like numContainersPendingAllocate?

Sure we can do that.

tgravescs · 2018-07-09T16:32:34Z

one minor naming question otherwise lgtm.

tgravescs · 2018-07-10T13:23:17Z

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala

          "Uncaught exception: " + StringUtils.stringifyException(e))
+    } finally {
+      metricsSystem.foreach { ms =>
+        ms.report()


is there any issues with this if AM gets killed badly or OOMs? Basically where we would want to wrap this in try catch and ignore any exceptions from it.

In case of OOM or any interrupt exception I expect Line 309 to catch the exception and log the error. But the exit code can be lost if metricSystems throws new exception so to be on the safe side I add the try catch to avoid this.

SparkQA · 2018-07-10T13:27:00Z

Test build #92819 has finished for PR 21635 at commit f3781bd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-10T18:39:20Z

Test build #92825 has finished for PR 21635 at commit f3781bd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-12T10:50:01Z

Test build #92927 has finished for PR 21635 at commit 6751ec5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jerryshao · 2018-07-12T12:29:15Z

docs/monitoring.md

 * `executor`: A Spark executor.
 * `driver`: The Spark driver process (the process in which your SparkContext is created).
 * `shuffleService`: The Spark shuffle service.
+* `yarn`: Spark resource allocations on YARN.


Is it better to change to application master for better understanding?

Sure we can do that. After this many change I would like to test it on a cluster again.
Soon I will come back with the result.

Successfully retested on cluster.

SparkQA · 2018-07-12T13:22:15Z

Test build #92934 has finished for PR 21635 at commit c0c4748.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2018-07-16T15:55:45Z

+1 . @jerryshao

jerryshao · 2018-07-17T03:38:36Z

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala

+        }
+      } catch {
+        case e: Exception =>
+          logInfo("Exception during stopping of the metric system: ", e)


I would suggest to change to warning log if exception occurred.

jerryshao · 2018-07-17T03:43:12Z

...urce-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMasterSource.scala

+
+private[spark] class ApplicationMasterSource(yarnAllocator: YarnAllocator) extends Source {
+
+  override val sourceName: String = "applicationMaster"


In case this is the metrics output:

-- Gauges ---------------------------------------------------------------------- applicationMaster.numContainersPendingAllocate value = 0 applicationMaster.numExecutorsFailed value = 3 applicationMaster.numExecutorsRunning value = 9 applicationMaster.numLocalityAwareTasks value = 0 applicationMaster.numReleasedContainers value = 0 ...

I would suggest to add application id as a prefix to differentiate between different apps.

Ah good catch, I was thinking it automatically added the namespace but it looks like that is only on executor and driver instances. Perhaps we should just add it as system that will append in the spark.metrics.namespace setting. for yarn I see the applicationmaster metrics the same as the dag scheduler source, executor allocation manager, etc.. Allowing user to control this makes sense to me. thoughts?

@tgravescs Would you please explain more, are you going to add a new configuration "spark.metrics.namespace", also how do you use this configuration?

the config spark.metrics.namespace already exists. see the metrics section in http://spark.apache.org/docs/latest/monitoring.html. But if you look at the code https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala#L129 its only applied for executor and driver metrics. I think we should have it apply to the yarn metrics as well.

I see. But I think we may not get "spark.app.id" in AM side, instead I think we can get yarn application id, so either we can set this configuration with application id, or directly prepend to the source name.

I like the idea to make the metric names more app specific. So I will prepend the app ID to the sourcename. And rerun my test.

Ah for the client mode yes there is an order issue with spark.app.id. I'm fine with using the yarn app id since that is essentially what the driver executor use anyway, but I think we should also make it configurable. I like to see these stay consistent. If the user can set the driver/executor metrics with spark.metrics.namespace we should allow them to set the yarn ones so that they all could have similar prefix. Perhaps we add a spark.yarn.metrics.namespace?

application_1530654167152_24008.driver.LiveListenerBus.listenerProcessingTime.org.apache.spark.ExecutorAllocationManager$ExecutorAllocationListener
application_1530654167152_25538.2.executor.recordsRead

jerryshao · 2018-07-18T01:56:50Z

docs/monitoring.md

 * `driver`: The Spark driver process (the process in which your SparkContext is created).
 * `shuffleService`: The Spark shuffle service.
+* `applicationMaster`: The Spark application master on YARN.



I think it would be better to clarify as "The Spark ApplicationMaster when running on YARN."

Thanks, updated accordingly.

SparkQA · 2018-07-19T19:26:18Z

Test build #93293 has finished for PR 21635 at commit 7958525.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-19T20:50:09Z

Test build #93297 has finished for PR 21635 at commit 6761098.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2018-07-20T21:33:55Z

+1 @jerryshao

jerryshao · 2018-07-23T01:46:28Z

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/config.scala

    .timeConf(TimeUnit.MILLISECONDS)
    .createWithDefaultString("100s")

+  private[spark] val YARN_METRICS_NAMESPACE = ConfigBuilder("spark.yarn.metrics.namespace")


Can you please add this configuration to the yarn doc?

SparkQA · 2018-07-23T14:38:18Z

Test build #93439 has finished for PR 21635 at commit 0b86788.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jerryshao

LGTM, merging to master branch.

Initial commit (yarn metrics)

9b033cc

vanzin reviewed Jun 25, 2018

View reviewed changes

attilapiros changed the title ~~[SPARK-24594][YARN] Introducing metrics for YARN executor allocation problems~~ [SPARK-24594][YARN] Introducing metrics for YARN Jun 26, 2018

adding more YARN metrics

4968ad6

tgravescs reviewed Jun 27, 2018

View reviewed changes

adding doc

9735525

vanzin reviewed Jun 27, 2018

View reviewed changes

attilapiros added 2 commits June 27, 2018 22:54

store metric system in Option

a8f3146

add numPendingAllocate as a metric

b0ee4ec

tgravescs reviewed Jul 2, 2018

View reviewed changes

Remove numPendingLossReasonRequests metric.

68ba47e

tgravescs reviewed Jul 9, 2018

View reviewed changes

rename numPendingAllocate to numContainersPendingAllocate

f3781bd

tgravescs reviewed Jul 10, 2018

View reviewed changes

Ignore exceptions during the stopping of the metric system.

6751ec5

jerryshao reviewed Jul 12, 2018

View reviewed changes

Rename metric source/system from yarn to applicationMaster.

c0c4748

jerryshao reviewed Jul 17, 2018

View reviewed changes

jerryshao reviewed Jul 18, 2018

View reviewed changes

adding namespace prefix

7958525

Merge branch 'master' into SPARK-24594

6761098

jerryshao reviewed Jul 23, 2018

View reviewed changes

Update running-on-yarn.md

0b86788

jerryshao approved these changes Jul 24, 2018

View reviewed changes

asfgit closed this in d2436a8 Jul 24, 2018


		private val securityMgr = new SecurityManager(sparkConf)

		private[spark] val failureTracker = new FailureTracker(sparkConf, new SystemClock)


		private[spark] class ApplicationMasterSource(yarnAllocator: YarnAllocator) extends Source {

		override val sourceName: String = "applicationMaster"

[SPARK-24594][YARN] Introducing metrics for YARN #21635

[SPARK-24594][YARN] Introducing metrics for YARN #21635

Uh oh!

Conversation

attilapiros commented Jun 25, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jun 25, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 26, 2018

Uh oh!

tgravescs Jun 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 27, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 27, 2018

Uh oh!

SparkQA commented Jun 27, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 2, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tgravescs commented Jul 9, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 10, 2018

Uh oh!

SparkQA commented Jul 10, 2018

Uh oh!

SparkQA commented Jul 12, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 12, 2018

attilapiros commented Jun 25, 2018 •

edited

Loading

tgravescs Jun 27, 2018 •

edited

Loading