[SPARK-2321] Several progress API improvements / refactorings #3197

JoshRosen · 2014-11-11T00:58:43Z

This PR refactors / extends the status API introduced in #2696.

Change StatusAPI from a mixin trait to a class. Before, the new status API methods were directly accessible through SparkContext, whereas now they're accessed through a sc.statusAPI field. As long as we were going to add these methods directly to SparkContext, the mixin trait seemed like a good idea, but this might be simpler to reason about and may avoid pitfalls that I've run into while attempting to refactor other parts of SparkContext to use mixins (see Move SparkContext accumulator methods to Accumulators.scala #3071, for example).
Change the name from SparkStatusAPI to SparkStatusTracker.
Make getJobIdsForGroup(null) return ids for jobs that aren't associated with any job group.
Add getActiveStageIds() and getActiveJobIds() methods that return the ids of whatever's currently active in this SparkContext. This should simplify @davies's progress bar code.

This makes binary compatibility easier to reason about and might avoid some pitfalls that I’ve run into while attempting to refactor other parts of SparkContext to use mixin traits (see apache#3071, for example). Requiring users to access status API methods through `sc.statusAPI.*` also avoids SparkContext bloat and buys us extra freedom for adding parallel higher / lower-level APIs.

SparkQA · 2014-11-11T01:05:20Z

Test build #23179 has started for PR 3197 at commit 2cc7353.

This patch merges cleanly.

SparkQA · 2014-11-11T01:06:58Z

Test build #23179 has finished for PR 3197 at commit 2cc7353.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class SparkContext(config: SparkConf) extends Logging

AmplabJenkins · 2014-11-11T01:06:59Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23179/
Test FAILed.

SparkQA · 2014-11-11T01:17:33Z

Test build #23183 has started for PR 3197 at commit d1b08d8.

This patch merges cleanly.

SparkQA · 2014-11-11T02:43:09Z

Test build #23183 has finished for PR 3197 at commit d1b08d8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-11T02:43:13Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23183/
Test PASSed.

davies · 2014-11-12T23:31:20Z

core/src/main/scala/org/apache/spark/SparkContext.scala

Could we just call it status w/o API?

+1 just status would be better i think

davies · 2014-11-12T23:52:38Z

@JoshRosen this re-org looks better than before, how about move all of these API into a namespace called sc.internal and mark all of them as DeveloperAPI?

JoshRosen · 2014-11-13T02:43:11Z

@davies This is a low-level API that's not really designed for typical users to use, but it's also an API that we want to stabilize. @pwendell do you think this should be @developerAPI even if we plan to maintain fairly strong compatibility guarantees for it?

pwendell · 2014-11-13T21:34:53Z

I think we want to keep this as a public API - that was the whole reason for adding this in the first place, since it is technically redundant with things users could already do with the SparkListener interface and projects like Hive want us to support a stable API.

In terms of the naming, it did seem slightly weird to me to call this statusAPI when for other things we don't use the term "API" anywhere (e.g. SparkListener). I'll defer to @rxin @kayousterhout or anyone else who has feelings about this for more feedback.

pwendell · 2014-11-14T05:50:07Z

Okay I talked offline with @kayousterhout and the best name we could come up with was the following:

class SparkStatusTracker
...
val statusTracker = new SparkStatusTracker(this)

IMO this is nicer than the current name since API is sort of implicit in the fact that this is an exposed class (i.e. in some sense everything is an API). The name "Tracker" implies that this is an object that actively is tracking changes. So this is my favorite option. I also thing SparkStatus and val status is alright. Both of these I prefer to the current naming.

rxin · 2014-11-14T05:51:38Z

core/src/main/scala/org/apache/spark/SparkStatusAPI.scala

why bother having this? we can just do new ....

The goal here was to hide this class's constructor from users so that we're free to change it later. I think that making constructors part of public APIs is a bad idea.

Can't we just make the constructor package private? It is really awkward to me that you have to create a factory for this because users are not supposed to create the status listener by themselves.

If you really want to be air tight, a more common way is to expose the interface, and then you have a concrete implementation of the interface. Then you don't have the constructor problem. But that seems overkill to me too for this.

If you really want a factory, I'd use something other than apply.

Is there a way to make a Scala constructor Java-package-private as opposed to Scala-package-private, since that can become public from Java's point of view? I think I originally used this factory pattern for JavaSparkStatusAPI and just kept the same approach here.

Also, I think that CompanionObject.apply() might be a fairly common idiom; I think it's used in several of the Scala standard libraries. I don't really care what we call it, one way or the other, so I can change it if you think that apply is confusing.

Sorry the main problem I have is that I don't get why we need to protect the constructor at all. It is not something we expect the users to call. Why don't you just remove all of these stuff, and add a line in javadoc for the constructor saying we don't expect users to call this constructor?

Sure, that's fine; I'll just make both constructors private[spark] and add a note; as long as we've warned users not to call the constructor and hidden it from the Scaladoc, then I don't think anyone should complain if we need to change it later.

rxin · 2014-11-14T07:51:54Z

+1 on statusTracker

Remove factory methods and replace with private[spark] constructors.

SparkQA · 2014-11-15T06:20:17Z

Test build #23413 has started for PR 3197 at commit 30b0afa.

This patch merges cleanly.

rxin · 2014-11-15T06:34:25Z

Hmmm in case I haven't expressed this earlier, I really like this new API.

rxin · 2014-11-15T07:19:14Z

LGTM.

SparkQA · 2014-11-15T07:42:58Z

Test build #23413 has finished for PR 3197 at commit 30b0afa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-15T07:43:01Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23413/
Test PASSed.

rxin · 2014-11-15T07:46:14Z

Merging in master & branch-1.2. Thanks!

This PR refactors / extends the status API introduced in #2696. - Change StatusAPI from a mixin trait to a class. Before, the new status API methods were directly accessible through SparkContext, whereas now they're accessed through a `sc.statusAPI` field. As long as we were going to add these methods directly to SparkContext, the mixin trait seemed like a good idea, but this might be simpler to reason about and may avoid pitfalls that I've run into while attempting to refactor other parts of SparkContext to use mixins (see #3071, for example). - Change the name from SparkStatusAPI to SparkStatusTracker. - Make `getJobIdsForGroup(null)` return ids for jobs that aren't associated with any job group. - Add `getActiveStageIds()` and `getActiveJobIds()` methods that return the ids of whatever's currently active in this SparkContext. This should simplify davies's progress bar code. Author: Josh Rosen <joshrosen@databricks.com> Closes #3197 from JoshRosen/progress-api-improvements and squashes the following commits: 30b0afa [Josh Rosen] Rename SparkStatusAPI to SparkStatusTracker. d1b08d8 [Josh Rosen] Add missing newlines 2cc7353 [Josh Rosen] Add missing file. d5eab1f [Josh Rosen] Add getActive[Stage|Job]Ids() methods. a227984 [Josh Rosen] getJobIdsForGroup(null) should return jobs for default group c47e294 [Josh Rosen] Remove StatusAPI mixin trait. (cherry picked from commit 40eb8b6) Signed-off-by: Reynold Xin <rxin@databricks.com>

JoshRosen added 4 commits November 10, 2014 14:38

getJobIdsForGroup(null) should return jobs for default group

a227984

Add getActive[Stage|Job]Ids() methods.

d5eab1f

Add missing file.

2cc7353

Add missing newlines

d1b08d8

davies reviewed Nov 12, 2014
View reviewed changes

rxin reviewed Nov 14, 2014
View reviewed changes

JoshRosen changed the title ~~Several progress API improvements / refactorings~~ [SPARK-2321] Several progress API improvements / refactorings Nov 14, 2014

Rename SparkStatusAPI to SparkStatusTracker.

30b0afa

Remove factory methods and replace with private[spark] constructors.

asfgit closed this in 40eb8b6 Nov 15, 2014

[SPARK-2321] Several progress API improvements / refactorings #3197

[SPARK-2321] Several progress API improvements / refactorings #3197

Uh oh!

Conversation

JoshRosen commented Nov 11, 2014

Uh oh!

SparkQA commented Nov 11, 2014

Uh oh!

SparkQA commented Nov 11, 2014

Uh oh!

AmplabJenkins commented Nov 11, 2014

Uh oh!

SparkQA commented Nov 11, 2014

Uh oh!

SparkQA commented Nov 11, 2014

Uh oh!

AmplabJenkins commented Nov 11, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davies commented Nov 12, 2014

Uh oh!

JoshRosen commented Nov 13, 2014

Uh oh!

pwendell commented Nov 13, 2014

Uh oh!

pwendell commented Nov 14, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rxin commented Nov 14, 2014

Uh oh!

SparkQA commented Nov 15, 2014

Uh oh!

rxin commented Nov 15, 2014

Uh oh!

rxin commented Nov 15, 2014

Uh oh!

SparkQA commented Nov 15, 2014

Uh oh!

AmplabJenkins commented Nov 15, 2014

Uh oh!

rxin commented Nov 15, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants