-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-11373] [CORE] Add metrics to the History Server and FsHistoryProvider #9571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #45386 has finished for PR 9571 at commit
|
|
Thanks! I'll look into this |
|
I'm assuming you didn't make a HistorySource in HistoryServer (what MasterSource in Master does) because it requires the MetricsSystem. Where does MetricsSystem require a SparkContext? It looks like it just takes SparkConf which HistoryServer has. |
|
...I thought it did. If I'm wrong, then yes, all you need is a context. Once that's in there, the HistorySource should be what's needed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given the repetition here, this could take a list of (servlet, path) and apply that to the registration
8130ae8 to
f6bf558
Compare
|
Test build #46779 has finished for PR 9571 at commit
|
|
Added reworked design
Really, the health check logic needs its own Furthermore, those operations which fail asynchronous and just have exceptions logged should have their exceptions saved to another health check (I may implement that). That way the health check will react to live system failures, including things like the log directory being deleted during a run. |
|
jenkins, test this please |
|
(Note that the POMs changed to pull in some more of the codahale servlets, though only the health checks & thread dump are being registered. Hooking up those servlets through the MetricsSystem may be a bit tricky; and as it has its own metrics, is probably the one to go for. In which case: wrapping up JVM stats as a |
|
Test build #46782 has finished for PR 9571 at commit
|
|
Test build #46780 has finished for PR 9571 at commit
|
|
Test build #46781 has finished for PR 9571 at commit
|
25e77bd to
2e04803
Compare
|
Test build #47440 has started for PR 9571 at commit |
|
latest branch cut out health checks. They'd be nice, but as they are useful across all of spark, it's probably best to wait for something central to go in and pick it up, rather than try and squeeze something in here. |
2e04803 to
06e5289
Compare
|
Test build #48567 has finished for PR 9571 at commit
|
06e5289 to
d6fa568
Compare
|
Test build #48768 has finished for PR 9571 at commit
|
d6fa568 to
984100d
Compare
|
Test build #50202 has finished for PR 9571 at commit
|
984100d to
3ceeb9e
Compare
|
Test build #50822 has finished for PR 9571 at commit
|
3ceeb9e to
889978c
Compare
|
Test build #50921 has finished for PR 9571 at commit
|
… checks, which aren't part of it
…e and set to time of update/update success. Makes update time visible for tests/users/ops
… played back during merge and U load
…s pulled up into HistoryMetricSource, including registration and prefix support. The existing test case "incomplete apps get refreshed" has been extended to check for metrics in the listings, alongside some specific probes for values in the fs history provider (loads, load time average > 0)
… the average load time is "0" and not a division by zero error. This validates the logic in the relevant lambda-expression
…e. Interesting, and not in a good way
-nits -recommended minor changes +pull out lambda gauge and HistoryMetricSource into their own file, HistoryMetricSource.scala. Added tests to go with this, and made the toString call robust against failing gauges +fixed a scaladoc warning in HistoryServer
d8ae876 to
ec1f2d7
Compare
|
Test build #75704 has finished for PR 9571 at commit
|
Change-Id: I7c3fc0865abce380fcbe08b8984cfd00b3ce0faa
|
Test build #75708 has finished for PR 9571 at commit
|
|
Line lengths fixed, tests all happy. @vanzin —any chance of adding this to your review list? |
vanzin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was secretly hoping you'd just give up on this patch, since it will generate a lot of conflicts with the code I'm working on in parallel... but if you really want to get this in, I'd appreciate if you took the time to address the feedback I left in previous reviews.
|
|
||
| /** all metrics, including timers */ | ||
| private val allMetrics = counters ++ Seq( | ||
| private val allMetrics: Seq[(String, Metric with Counting)] = counters ++ Seq( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is declaring the type useful here in some way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either I was just being explicit about what came in, or the IDE decided to get involved. Removed
| /** | ||
| * Bind to the History Server: threads should be started here; exceptions may be raised | ||
| * Start the provider: threads should be started here; exceptions may be raised | ||
| * if the history provider cannot be started. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes this interface awkward. Why can't the HistoryServer keep track of that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the work on Yarn service model, and that of SmartFrog before it, have given me a fear of startup logic. I'll see about doing it there though. Anyway, cut: if problems arise, people can restate it. This patch does change HistoryServer such that HistoryServerSuite doesn't call initalize() twice BTW
One thing to consider here is that really the FsHistoryProvider should be starting its threads in the start() method, so that subclasses, like the test class SafeModeTestProvider can be sure that they are fully inited before they start. I left that alone.
| lastScanTime.set(newLastScanTime) | ||
| metrics.updateLastSucceeded.setValue(newLastScanTime) | ||
| } catch { | ||
| case e: Exception => logError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logError goes in the next line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| ("appui.event.count", appUIEventCount), | ||
| ("appui.event.replay.time", appUIEventReplayTime), | ||
| ("update.timer", updateTimer), | ||
| ("history.merge.timer", historyMergeTimer))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This times is still in this list, why? It's not useful. (This is the timer I was talking about in my previous comment.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cut
No. Sorry I do suspect the reviewers of my other outstanding patches are of the same mind "if we ignore him, he'll go away". which is why I'm effectively giving up filing new patches and bug reports related to Spark. However, that doesn't mean that I'm going to stop updating and reminding people of my current set of patches. It's easier to accept them and hope I'll go away afterwards. After all, if you'd merged this in earlier: no merge conflicts. Better all round. |
Well, if you're going to follow that route, if you had addressed feedback earlier, the patch would have been merged long ago... |
|
I'm going to close this PR and start one based on a reapplication of this patch onto master; gets rid of all the merge pain and is intended to be more minimal. The latest comments of this one will be applied |
|
Could you close this PR please? @steveloughran |
This adds metrics to the history server, with the
FsHistoryProvidermetering its load, performance and reliability.see SPARK-11373
The
HistoryServersets up the codahale metrics for the Web under metrics/ with metrics/metrics behind metrics, metrics/health any health probes and metrics/threads a thread dump. There's currently no attempt to hook up JMX, etc. The Web servlets are the ones tests can easily hit and don't need infrastructure, so are the good initial first step.It then passes the metrics and health registries down to the providers in a
ApplicationHistoryBindingcase class, via a new methodThe base class has implementation so that all existing providers will still link properly; the base implementation currently checks and fails
the use of a binding case class is also to ensure that if new binding information were added in future, existing implementations would still link.
The
FsHistoryProviderimplements thestart()method, registering two counters and two timers.Points of note
MetricsSystem? I did start off with that, but it needs aSparkContextto run off, which the server doesn't have. Ideally that would be way to go, as it would support all the spark conf -based metrics setup. Someone who understands theMetricsSystemwould need to get involved here as would make for a more complex patch. InFsHistoryProviderthe registry information is all kept in aSourcesubclass for ease of future migration toMetricsSystem.HealthRegistry? It's a nice way of allowing providers to indicate (possibly transient) health problems for monitoring tools/clients to hit. For the FS provider it could maybe flag when there hadn't been any successful update for a specified time period. (that could also be indicated by having a counter of "seconds since last update" and let monitoring tools monitor the counter value and act on it). Access control problems to the directory is something else which may be considered a liveness problem: it won't get better without human interventionFsHistoryProvider.start()method should really take the thread start code from from class constructor'sinitialize()method. This would ensure that incomplete classes don't get called by spawned threads, and makes it possible for test-time subclasses to skip thread startup. I've not attempted to do that in this patch.HistoryServerSuiteremoves the call toHistoryServer.initialize()thebeforeclause. That was a duplicate call, one which hit the re-entrancy tests on the provider & registry. As well as cutting it,HistoryServer.initialize()has been made idempotent. That should not be needed -but it will eliminate the problem arising again.Once the SPARK-1537 YARN timeline server history provider is committed, then I'll add metrics support there too. The YARN timeline provider would:
start()method, with some test subclasses disabling thread launch.