[SPARK-11373] [CORE] Add metrics to the History Server and FsHistoryProvider #9571

steveloughran · 2015-11-09T18:31:03Z

This adds metrics to the history server, with the FsHistoryProvider metering its load, performance and reliability.

The HistoryServer sets up the codahale metrics for the Web under metrics/ with metrics/metrics behind metrics, metrics/health any health probes and metrics/threads a thread dump. There's currently no attempt to hook up JMX, etc. The Web servlets are the ones tests can easily hit and don't need infrastructure, so are the good initial first step.

It then passes the metrics and health registries down to the providers in a ApplicationHistoryBinding case class, via a new method

def start(binding: ApplicationHistoryBinding): Unit

The base class has implementation so that all existing providers will still link properly; the base implementation currently checks and fails
the use of a binding case class is also to ensure that if new binding information were added in future, existing implementations would still link.

The FsHistoryProvider implements the start() method, registering two counters and two timers.

Number of update attempts and number of failed updates —and the same for app UI loads.
Time for updates and app UI loads.

Points of note

Why not use Spark's MetricsSystem? I did start off with that, but it needs a SparkContext to run off, which the server doesn't have. Ideally that would be way to go, as it would support all the spark conf -based metrics setup. Someone who understands the MetricsSystem would need to get involved here as would make for a more complex patch. In FsHistoryProvider the registry information is all kept in a Source subclass for ease of future migration to MetricsSystem.
Why the extra HealthRegistry? It's a nice way of allowing providers to indicate (possibly transient) health problems for monitoring tools/clients to hit. For the FS provider it could maybe flag when there hadn't been any successful update for a specified time period. (that could also be indicated by having a counter of "seconds since last update" and let monitoring tools monitor the counter value and act on it). Access control problems to the directory is something else which may be considered a liveness problem: it won't get better without human intervention
The FsHistoryProvider.start() method should really take the thread start code from from class constructor's initialize() method. This would ensure that incomplete classes don't get called by spawned threads, and makes it possible for test-time subclasses to skip thread startup. I've not attempted to do that in this patch.
No tests for this yet. Hitting the three metrics servlets in the HistoryServer is the obvious route; the JSON payload of the metrics can be parsed and scanned for relevant counters too.
Part of the patch for HistoryServerSuite removes the call to HistoryServer.initialize() the before clause. That was a duplicate call, one which hit the re-entrancy tests on the provider & registry. As well as cutting it, HistoryServer.initialize() has been made idempotent. That should not be needed -but it will eliminate the problem arising again.

Once the SPARK-1537 YARN timeline server history provider is committed, then I'll add metrics support there too. The YARN timeline provider would:

Add timers of REST operations as well as playback load times, which can count network delays as well as JSON deserialization overhead.
Add a health check for connectivity too: the timeline server would be unhealthy if connections to the timeline server were either blocking or failing. And again, if there were security/auth problems, they'd be considered non-recoverable.
Move thread launch under the start() method, with some test subclasses disabling thread launch.

SparkQA · 2015-11-09T18:50:58Z

Test build #45386 has finished for PR 9571 at commit cb7cddb.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

CharlesYeh · 2015-11-11T07:49:18Z

Thanks! I'll look into this

CharlesYeh · 2015-11-11T18:06:20Z

I'm assuming you didn't make a HistorySource in HistoryServer (what MasterSource in Master does) because it requires the MetricsSystem.

Where does MetricsSystem require a SparkContext? It looks like it just takes SparkConf which HistoryServer has.

steveloughran · 2015-11-11T20:23:52Z

...I thought it did. If I'm wrong, then yes, all you need is a context. Once that's in there, the HistorySource should be what's needed

steveloughran · 2015-11-25T14:52:31Z

core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala

given the repetition here, this could take a list of (servlet, path) and apply that to the registration

SparkQA · 2015-11-26T21:34:29Z

Test build #46779 has finished for PR 9571 at commit f6bf558.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

steveloughran · 2015-11-26T21:39:30Z

Added reworked design

all metrics go through the MetricsSystem; the providers return an optional Source from the start() call.
FsHistoryProvider metrics split lookups of missing files from failed attempts to replay the logs.
FsHistoryProvider metrics include time of the mergeListing() operation, which can be quite the CPU killer.
There's a HealthSource for health checks; it's being explicitly managed in the HistoryServer.
And there's an initial health check for the FS, which simply returns FS safe/mode flag.
The ApplicationHistoryBinding class is now empty; it could be culled. The original goal of a binding class like that was to allow existing providers to handle changes without breaking. However, I think it's probably best to be ruthless and say "your provider needs to be compiled with the specific version of spark —that is: no compatibility guarantees". Saves regression testing & forces every provider to be in sync with the latest listener events

Really, the health check logic needs its own HealthSystem for the register/unregister. Trying to design one that spans all the applications is more complex and I'm trying to avoid that.

Furthermore, those operations which fail asynchronous and just have exceptions logged should have their exceptions saved to another health check (I may implement that). That way the health check will react to live system failures, including things like the log directory being deleted during a run.

steveloughran · 2015-11-26T21:39:41Z

jenkins, test this please

steveloughran · 2015-11-26T22:32:31Z

(Note that the POMs changed to pull in some more of the codahale servlets, though only the health checks & thread dump are being registered. Hooking up those servlets through the MetricsSystem may be a bit tricky; and as it has its own metrics, is probably the one to go for. In which case: wrapping up JVM stats as a Source would give automatic access to those numbers

SparkQA · 2015-11-26T23:39:21Z

Test build #46782 has finished for PR 9571 at commit 25e77bd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-11-26T23:40:08Z

Test build #46780 has finished for PR 9571 at commit 1dcbb5f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-11-26T23:55:37Z

Test build #46781 has finished for PR 9571 at commit 1dcbb5f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-12-09T18:58:42Z

Test build #47440 has started for PR 9571 at commit 2e04803.

steveloughran · 2015-12-09T19:00:51Z

latest branch cut out health checks. They'd be nice, but as they are useful across all of spark, it's probably best to wait for something central to go in and pick it up, rather than try and squeeze something in here.

SparkQA · 2016-01-01T18:40:43Z

Test build #48567 has finished for PR 9571 at commit 06e5289.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-05T14:20:49Z

Test build #48768 has finished for PR 9571 at commit d6fa568.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-27T18:19:55Z

Test build #50202 has finished for PR 9571 at commit 984100d.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-05T15:00:26Z

Test build #50822 has finished for PR 9571 at commit 3ceeb9e.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-08T13:55:49Z

Test build #50921 has finished for PR 9571 at commit 889978c.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

… checks, which aren't part of it

…ider launch

…e and set to time of update/update success. Makes update time visible for tests/users/ops

… played back during merge and U load

…s pulled up into HistoryMetricSource, including registration and prefix support. The existing test case "incomplete apps get refreshed" has been extended to check for metrics in the listings, alongside some specific probes for values in the fs history provider (loads, load time average > 0)

… the average load time is "0" and not a division by zero error. This validates the logic in the relevant lambda-expression

…e. Interesting, and not in a good way

-nits -recommended minor changes +pull out lambda gauge and HistoryMetricSource into their own file, HistoryMetricSource.scala. Added tests to go with this, and made the toString call robust against failing gauges +fixed a scaladoc warning in HistoryServer

SparkQA · 2017-04-11T10:14:34Z

Test build #75704 has finished for PR 9571 at commit ec1f2d7.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

Change-Id: I7c3fc0865abce380fcbe08b8984cfd00b3ce0faa

SparkQA · 2017-04-11T15:44:35Z

Test build #75708 has finished for PR 9571 at commit 8903dcf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

steveloughran · 2017-04-12T21:03:19Z

Line lengths fixed, tests all happy.

@vanzin —any chance of adding this to your review list?

vanzin

I was secretly hoping you'd just give up on this patch, since it will generate a lot of conflicts with the code I'm working on in parallel... but if you really want to get this in, I'd appreciate if you took the time to address the feedback I left in previous reviews.

vanzin · 2017-04-17T21:53:04Z

core/src/main/scala/org/apache/spark/deploy/history/ApplicationCache.scala


  /** all metrics, including timers */
-  private val allMetrics = counters ++ Seq(
+  private val allMetrics: Seq[(String, Metric with Counting)] = counters ++ Seq(


Is declaring the type useful here in some way?

Either I was just being explicit about what came in, or the IDE decided to get involved. Removed

vanzin · 2017-04-17T21:54:25Z

core/src/main/scala/org/apache/spark/deploy/history/ApplicationHistoryProvider.scala

+  /**
+   * Bind to the History Server: threads should be started here; exceptions may be raised
+   * Start the provider: threads should be started here; exceptions may be raised
+   * if the history provider cannot be started.


This makes this interface awkward. Why can't the HistoryServer keep track of that?

All the work on Yarn service model, and that of SmartFrog before it, have given me a fear of startup logic. I'll see about doing it there though. Anyway, cut: if problems arise, people can restate it. This patch does change HistoryServer such that HistoryServerSuite doesn't call initalize() twice BTW

One thing to consider here is that really the FsHistoryProvider should be starting its threads in the start() method, so that subclasses, like the test class SafeModeTestProvider can be sure that they are fully inited before they start. I left that alone.

vanzin · 2017-04-17T21:58:44Z

core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala

+        lastScanTime.set(newLastScanTime)
+        metrics.updateLastSucceeded.setValue(newLastScanTime)
+      } catch {
+        case e: Exception => logError(


logError goes in the next line.

vanzin · 2017-04-17T22:01:01Z

core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala

+    ("appui.event.count", appUIEventCount),
+    ("appui.event.replay.time", appUIEventReplayTime),
+    ("update.timer", updateTimer),
+    ("history.merge.timer", historyMergeTimer)))


This times is still in this list, why? It's not useful. (This is the timer I was talking about in my previous comment.)

steveloughran · 2017-04-18T12:55:27Z

I was secretly hoping you'd just give up on this patch, since it will generate a lot of conflicts with the code I'm working on in parallel..

No. Sorry

I do suspect the reviewers of my other outstanding patches are of the same mind "if we ignore him, he'll go away". which is why I'm effectively giving up filing new patches and bug reports related to Spark.

However, that doesn't mean that I'm going to stop updating and reminding people of my current set of patches. It's easier to accept them and hope I'll go away afterwards. After all, if you'd merged this in earlier: no merge conflicts. Better all round.

vanzin · 2017-04-18T16:25:15Z

if you'd merged this in earlier: no merge conflicts

Well, if you're going to follow that route, if you had addressed feedback earlier, the patch would have been merged long ago...

steveloughran · 2017-04-24T15:26:27Z

I'm going to close this PR and start one based on a reapplication of this patch onto master; gets rid of all the merge pain and is intended to be more minimal. The latest comments of this one will be applied

jiangxb1987 · 2017-05-24T22:38:33Z

Could you close this PR please? @steveloughran

steveloughran mentioned this pull request Nov 24, 2015

[SPARK-7889] [CORE] HistoryServer to refresh cache of incomplete applications #6935

Closed

steveloughran reviewed Nov 25, 2015
View reviewed changes

steveloughran force-pushed the feature/SPARK-11373 branch from 8130ae8 to f6bf558 Compare November 26, 2015 21:18

steveloughran force-pushed the feature/SPARK-11373 branch from 25e77bd to 2e04803 Compare December 9, 2015 18:56

steveloughran force-pushed the feature/SPARK-11373 branch from 2e04803 to 06e5289 Compare January 1, 2016 16:45

steveloughran changed the title ~~[SPARK-11373] [CORE] WiP Add metrics to the History Server and providers~~ [SPARK-11373] [CORE] Add metrics to the History Server and FsHistoryProvider Jan 1, 2016

steveloughran mentioned this pull request Jan 1, 2016

[SPARK-1537] [YARN] Add history provider for YARN Application Timeline Server #10545

Closed

steveloughran force-pushed the feature/SPARK-11373 branch from 06e5289 to d6fa568 Compare January 5, 2016 12:34

steveloughran force-pushed the feature/SPARK-11373 branch from d6fa568 to 984100d Compare January 27, 2016 18:11

steveloughran force-pushed the feature/SPARK-11373 branch from 984100d to 3ceeb9e Compare February 5, 2016 14:43

steveloughran force-pushed the feature/SPARK-11373 branch from 3ceeb9e to 889978c Compare February 8, 2016 13:47

steveloughran and others added 19 commits April 10, 2017 17:26

[SPARK-11373] scala-style javadoc tuning in FsHistoryProvider

9e63bcf

SPARK-11373: move to MetricsSystem, though retaining/expanding health…

a658769

… checks, which aren't part of it

scalacheck

f02323f

SPARK-11373: cut out the notion of binding information; simplify prov…

dbde3ab

…ider launch

SPARK-11373 cut the health checks out

59a4a67

[SPARK-11373] tail end of rebase operation

4c30404

[SPARK-11373] scalastyle and import ordering

ee74f81

SPARK-11373 finish review of merge with trunk; add new Timestamp gaug…

a3a6383

…e and set to time of update/update success. Makes update time visible for tests/users/ops

[SPARK-11373] finish rebasing to master, correct tightened style checks

e936d5b

[SPARK-11373] Address review comments and add a new counter of events…

cef1577

… played back during merge and U load

[SPARK-11373] add a check that before there's been any events loaded,…

818e14b

… the average load time is "0" and not a division by zero error. This validates the logic in the relevant lambda-expression

[SPARK-11373] style check in the javadocs

bfea17f

[SPARK-11373] IDE had mysteriously re-ordered imports in the same lin…

b01facd

…e. Interesting, and not in a good way

[SPARK-1137] cull surplus lines

50727e7

[SPARK-11373] don't register appui.load.timer as a metric

012cf92

[SPARK-11373] sync with master; tests all happy

1e59b68

[SPARK-11373] sync this PR up with the master branch

ec1f2d7

steveloughran force-pushed the feature/SPARK-11373 branch from d8ae876 to ec1f2d7 Compare April 11, 2017 10:09

HADOOP-11374 fix line length errors

8903dcf

Change-Id: I7c3fc0865abce380fcbe08b8984cfd00b3ce0faa

vanzin reviewed Apr 17, 2017

View reviewed changes

steveloughran closed this May 25, 2017

[SPARK-11373] [CORE] Add metrics to the History Server and FsHistoryProvider #9571

[SPARK-11373] [CORE] Add metrics to the History Server and FsHistoryProvider #9571

Uh oh!

Conversation

steveloughran commented Nov 9, 2015

Uh oh!

SparkQA commented Nov 9, 2015

Uh oh!

CharlesYeh commented Nov 11, 2015

Uh oh!

CharlesYeh commented Nov 11, 2015

Uh oh!

steveloughran commented Nov 11, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 26, 2015

Uh oh!

steveloughran commented Nov 26, 2015

Uh oh!

steveloughran commented Nov 26, 2015

Uh oh!

steveloughran commented Nov 26, 2015

Uh oh!

SparkQA commented Nov 26, 2015

Uh oh!

SparkQA commented Nov 26, 2015

Uh oh!

SparkQA commented Nov 26, 2015

Uh oh!

SparkQA commented Dec 9, 2015

Uh oh!

steveloughran commented Dec 9, 2015

Uh oh!

SparkQA commented Jan 1, 2016

Uh oh!

SparkQA commented Jan 5, 2016

Uh oh!

SparkQA commented Jan 27, 2016

Uh oh!

SparkQA commented Feb 5, 2016

Uh oh!

SparkQA commented Feb 8, 2016

Uh oh!

SparkQA commented Apr 11, 2017

Uh oh!

SparkQA commented Apr 11, 2017

Uh oh!

steveloughran commented Apr 12, 2017

Uh oh!

vanzin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

steveloughran Apr 18, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

steveloughran commented Apr 18, 2017

Uh oh!

vanzin commented Apr 18, 2017

Uh oh!

steveloughran commented Apr 24, 2017

Uh oh!

jiangxb1987 commented May 24, 2017

Uh oh!

Reviewers

steveloughran Apr 18, 2017 •

edited

Loading