Skip to content

Conversation

@steveloughran
Copy link
Contributor

This adds metrics to the history server, with the FsHistoryProvider metering its load, performance and reliability.

see SPARK-11373

The HistoryServer sets up the codahale metrics for the Web under metrics/ with metrics/metrics behind metrics, metrics/health any health probes and metrics/threads a thread dump. There's currently no attempt to hook up JMX, etc. The Web servlets are the ones tests can easily hit and don't need infrastructure, so are the good initial first step.

It then passes the metrics and health registries down to the providers in a ApplicationHistoryBinding case class, via a new method

def start(binding: ApplicationHistoryBinding): Unit

The base class has implementation so that all existing providers will still link properly; the base implementation currently checks and fails
the use of a binding case class is also to ensure that if new binding information were added in future, existing implementations would still link.

The FsHistoryProvider implements the start() method, registering two counters and two timers.

  1. Number of update attempts and number of failed updates —and the same for app UI loads.
  2. Time for updates and app UI loads.

Points of note

  • Why not use Spark's MetricsSystem? I did start off with that, but it needs a SparkContext to run off, which the server doesn't have. Ideally that would be way to go, as it would support all the spark conf -based metrics setup. Someone who understands the MetricsSystem would need to get involved here as would make for a more complex patch. In FsHistoryProvider the registry information is all kept in a Source subclass for ease of future migration to MetricsSystem.
  • Why the extra HealthRegistry? It's a nice way of allowing providers to indicate (possibly transient) health problems for monitoring tools/clients to hit. For the FS provider it could maybe flag when there hadn't been any successful update for a specified time period. (that could also be indicated by having a counter of "seconds since last update" and let monitoring tools monitor the counter value and act on it). Access control problems to the directory is something else which may be considered a liveness problem: it won't get better without human intervention
  • The FsHistoryProvider.start() method should really take the thread start code from from class constructor's initialize() method. This would ensure that incomplete classes don't get called by spawned threads, and makes it possible for test-time subclasses to skip thread startup. I've not attempted to do that in this patch.
  • No tests for this yet. Hitting the three metrics servlets in the HistoryServer is the obvious route; the JSON payload of the metrics can be parsed and scanned for relevant counters too.
  • Part of the patch for HistoryServerSuite removes the call to HistoryServer.initialize() the before clause. That was a duplicate call, one which hit the re-entrancy tests on the provider & registry. As well as cutting it, HistoryServer.initialize() has been made idempotent. That should not be needed -but it will eliminate the problem arising again.

Once the SPARK-1537 YARN timeline server history provider is committed, then I'll add metrics support there too. The YARN timeline provider would:

  1. Add timers of REST operations as well as playback load times, which can count network delays as well as JSON deserialization overhead.
  2. Add a health check for connectivity too: the timeline server would be unhealthy if connections to the timeline server were either blocking or failing. And again, if there were security/auth problems, they'd be considered non-recoverable.
  3. Move thread launch under the start() method, with some test subclasses disabling thread launch.

@SparkQA
Copy link

SparkQA commented Nov 9, 2015

Test build #45386 has finished for PR 9571 at commit cb7cddb.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@CharlesYeh
Copy link

Thanks! I'll look into this

@CharlesYeh
Copy link

I'm assuming you didn't make a HistorySource in HistoryServer (what MasterSource in Master does) because it requires the MetricsSystem.

Where does MetricsSystem require a SparkContext? It looks like it just takes SparkConf which HistoryServer has.

@steveloughran
Copy link
Contributor Author

...I thought it did. If I'm wrong, then yes, all you need is a context. Once that's in there, the HistorySource should be what's needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given the repetition here, this could take a list of (servlet, path) and apply that to the registration

@SparkQA
Copy link

SparkQA commented Nov 26, 2015

Test build #46779 has finished for PR 9571 at commit f6bf558.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@steveloughran
Copy link
Contributor Author

Added reworked design

  • all metrics go through the MetricsSystem; the providers return an optional Source from the start() call.
  • FsHistoryProvider metrics split lookups of missing files from failed attempts to replay the logs.
  • FsHistoryProvider metrics include time of the mergeListing() operation, which can be quite the CPU killer.
  • There's a HealthSource for health checks; it's being explicitly managed in the HistoryServer.
  • And there's an initial health check for the FS, which simply returns FS safe/mode flag.
  • The ApplicationHistoryBinding class is now empty; it could be culled. The original goal of a binding class like that was to allow existing providers to handle changes without breaking. However, I think it's probably best to be ruthless and say "your provider needs to be compiled with the specific version of spark —that is: no compatibility guarantees". Saves regression testing & forces every provider to be in sync with the latest listener events

Really, the health check logic needs its own HealthSystem for the register/unregister. Trying to design one that spans all the applications is more complex and I'm trying to avoid that.

Furthermore, those operations which fail asynchronous and just have exceptions logged should have their exceptions saved to another health check (I may implement that). That way the health check will react to live system failures, including things like the log directory being deleted during a run.

@steveloughran
Copy link
Contributor Author

jenkins, test this please

@steveloughran
Copy link
Contributor Author

(Note that the POMs changed to pull in some more of the codahale servlets, though only the health checks & thread dump are being registered. Hooking up those servlets through the MetricsSystem may be a bit tricky; and as it has its own metrics, is probably the one to go for. In which case: wrapping up JVM stats as a Source would give automatic access to those numbers

@SparkQA
Copy link

SparkQA commented Nov 26, 2015

Test build #46782 has finished for PR 9571 at commit 25e77bd.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 26, 2015

Test build #46780 has finished for PR 9571 at commit 1dcbb5f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 26, 2015

Test build #46781 has finished for PR 9571 at commit 1dcbb5f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 9, 2015

Test build #47440 has started for PR 9571 at commit 2e04803.

@steveloughran
Copy link
Contributor Author

latest branch cut out health checks. They'd be nice, but as they are useful across all of spark, it's probably best to wait for something central to go in and pick it up, rather than try and squeeze something in here.

@steveloughran steveloughran changed the title [SPARK-11373] [CORE] WiP Add metrics to the History Server and providers [SPARK-11373] [CORE] Add metrics to the History Server and FsHistoryProvider Jan 1, 2016
@SparkQA
Copy link

SparkQA commented Jan 1, 2016

Test build #48567 has finished for PR 9571 at commit 06e5289.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 5, 2016

Test build #48768 has finished for PR 9571 at commit d6fa568.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 27, 2016

Test build #50202 has finished for PR 9571 at commit 984100d.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 5, 2016

Test build #50822 has finished for PR 9571 at commit 3ceeb9e.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 8, 2016

Test build #50921 has finished for PR 9571 at commit 889978c.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

steveloughran and others added 19 commits April 10, 2017 17:26
…e and set to time of update/update success. Makes update time visible for tests/users/ops
…s pulled up into HistoryMetricSource, including registration and prefix support. The existing test case "incomplete apps get refreshed" has been extended to check for metrics in the listings, alongside some specific probes for values in the fs history provider (loads, load time average > 0)
… the average load time is "0" and not a division by zero error. This validates the logic in the relevant lambda-expression
-nits
-recommended minor changes
+pull out lambda gauge and HistoryMetricSource into their own file, HistoryMetricSource.scala. Added tests to go with this, and made the toString call robust against failing gauges
+fixed a scaladoc warning in HistoryServer
@SparkQA
Copy link

SparkQA commented Apr 11, 2017

Test build #75704 has finished for PR 9571 at commit ec1f2d7.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Change-Id: I7c3fc0865abce380fcbe08b8984cfd00b3ce0faa
@SparkQA
Copy link

SparkQA commented Apr 11, 2017

Test build #75708 has finished for PR 9571 at commit 8903dcf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@steveloughran
Copy link
Contributor Author

Line lengths fixed, tests all happy.

@vanzin —any chance of adding this to your review list?

Copy link
Contributor

@vanzin vanzin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was secretly hoping you'd just give up on this patch, since it will generate a lot of conflicts with the code I'm working on in parallel... but if you really want to get this in, I'd appreciate if you took the time to address the feedback I left in previous reviews.


/** all metrics, including timers */
private val allMetrics = counters ++ Seq(
private val allMetrics: Seq[(String, Metric with Counting)] = counters ++ Seq(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is declaring the type useful here in some way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either I was just being explicit about what came in, or the IDE decided to get involved. Removed

/**
* Bind to the History Server: threads should be started here; exceptions may be raised
* Start the provider: threads should be started here; exceptions may be raised
* if the history provider cannot be started.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes this interface awkward. Why can't the HistoryServer keep track of that?

Copy link
Contributor Author

@steveloughran steveloughran Apr 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the work on Yarn service model, and that of SmartFrog before it, have given me a fear of startup logic. I'll see about doing it there though. Anyway, cut: if problems arise, people can restate it. This patch does change HistoryServer such that HistoryServerSuite doesn't call initalize() twice BTW

One thing to consider here is that really the FsHistoryProvider should be starting its threads in the start() method, so that subclasses, like the test class SafeModeTestProvider can be sure that they are fully inited before they start. I left that alone.

lastScanTime.set(newLastScanTime)
metrics.updateLastSucceeded.setValue(newLastScanTime)
} catch {
case e: Exception => logError(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logError goes in the next line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

("appui.event.count", appUIEventCount),
("appui.event.replay.time", appUIEventReplayTime),
("update.timer", updateTimer),
("history.merge.timer", historyMergeTimer)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This times is still in this list, why? It's not useful. (This is the timer I was talking about in my previous comment.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cut

@steveloughran
Copy link
Contributor Author

I was secretly hoping you'd just give up on this patch, since it will generate a lot of conflicts with the code I'm working on in parallel..

No. Sorry

I do suspect the reviewers of my other outstanding patches are of the same mind "if we ignore him, he'll go away". which is why I'm effectively giving up filing new patches and bug reports related to Spark.

However, that doesn't mean that I'm going to stop updating and reminding people of my current set of patches. It's easier to accept them and hope I'll go away afterwards. After all, if you'd merged this in earlier: no merge conflicts. Better all round.

@vanzin
Copy link
Contributor

vanzin commented Apr 18, 2017

if you'd merged this in earlier: no merge conflicts

Well, if you're going to follow that route, if you had addressed feedback earlier, the patch would have been merged long ago...

@steveloughran
Copy link
Contributor Author

I'm going to close this PR and start one based on a reapplication of this patch onto master; gets rid of all the merge pain and is intended to be more minimal. The latest comments of this one will be applied

@jiangxb1987
Copy link
Contributor

Could you close this PR please? @steveloughran

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants