[SPARK-15926] Improve readability of DAGScheduler stage creation methods #13677

kayousterhout · 2016-06-15T00:57:52Z

What changes were proposed in this pull request?

This pull request refactors parts of the DAGScheduler to improve readability, focusing on the code around stage creation. One goal of this change it to make it clearer which functions may create new stages (as opposed to looking up stages that already exist). There are no functionality changes in this pull request. In more detail:

shuffleToMapStage was renamed to shuffleIdToMapStage (when reading the existing code I have sometimes struggled to remember what the key is -- is it a stage? A stage id? This change is intended to avoid that confusion)
Cleaned up the code to create shuffle map stages. Previously, creating a shuffle map stage involved 3 different functions (newOrUsedShuffleStage, newShuffleMapStage, and getShuffleMapStage), and it wasn't clear what the purpose of each function was. With the new code, a single function (getOrCreateShuffleMapStage) is responsible for getting a stage (if it already exists) or creating new shuffle map stages and any missing ancestor stages, and it delegates to createShuffleMapStage when new stages need to be created. There's some remaining confusion here because the getOrCreateParentStages call in createShuffleMapStage may recursively create ancestor stages; this is an issue I plan to fix in a future pull request, because it's trickier to fix and involves a slight functionality change.
newResultStage was renamed to createResultStage, for consistency with naming around shuffle map stages.
getParentStages has been renamed to getOrCreateParentStages, to make it clear that this function will sometimes create missing ancestor stages.
The only slight functionality change is that on line 478, updateJobIdStageIdMaps now uses a stage's parents instance variable rather than re-calculating them (I couldn't see any reason why they'd need to be re-calculated, and suspect this is just leftover from older code).
getAncestorShuffleDependencies was renamed to getMissingAncestorShuffleDependencies, to make it clear that this only returns dependencies that have not yet been run.

cc @squito @markhamstra @JoshRosen (who requested more DAG scheduler commenting long ago -- an issue this pull request tries, in part, to address)

FYI @rxin

This commit renames shuffleToMapStage to shuffleIdToMapStage to make it clear that the shuffle id (and not the shuffle dependency, or the shuffle map stage, or the shuffle map stage id) is the key in the hash map.

I think having it nested made the code harder to read.

SparkQA · 2016-06-15T02:36:17Z

Test build #60540 has finished for PR 13677 at commit 1b7a338.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

markhamstra · 2016-06-15T03:04:02Z

One goal of this change it to make it clearer which functions may create new stages (as opposed to looking up stages that already exist).

Something that I have been looking at of late, and I know that @squito has looked at some, too. In short, I'm pretty confident that we are doing some silliness around creating new stages instead of reusing already existing stages, then recognizing that all the tasks for the "new" stages are already completed (at least we're smart enough to reuse the map outputs), so the "new" stages just become "skipped".

I'll take a closer look at this tomorrow, and may have a follow-on PR in the not too distant future.

squito · 2016-06-15T05:00:12Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

+      jobId: Int,
+      callSite: CallSite): ResultStage = {
+    val id = nextStageId.getAndIncrement()
+    val stage = new ResultStage(


getOrCreateParentStages should be called before getting the id for the result stage, otherwise the result stage will get numbered below the dependent stages.

from a quick check, I think that might solve all the failing tests

squito · 2016-06-15T05:24:13Z

everything proposed here makes sense to me, I think just needs the one minor reordering I mentioned.

about creating extra stages, I looked into this some here: https://issues.apache.org/jira/browse/SPARK-10193. Actually seems fairly straight-forward, though I never followed up on profiling the extra memory involved. (happy to let someone else takeover, I doubt I'll get back to it anytime soon ...)

kayousterhout · 2016-06-15T17:27:15Z

Thanks for pointing out the re-ordering issue! Fixed.

Re: creating extra stages, I agree we could do better with that, because it is confusing. I have another commit with some light cleanup of that (but wanted to keep refactoring separate from functionality changes).

SparkQA · 2016-06-15T19:41:28Z

Test build #60580 has finished for PR 13677 at commit 22c6e42.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

kayousterhout · 2016-06-15T19:44:51Z

This test failure is the flakey scheduler test -- but I'll hold off on re-running Jenkins since I'm guessing there will be some comments to address on this anyway (so the tests will need to be re-run after that).

markhamstra · 2016-06-15T20:57:27Z

LGTM, and runs without flakiness for me when rebased onto master with the #13688 HOTFIX.

kayousterhout · 2016-06-17T16:48:35Z

Jenkins, retest this please

SparkQA · 2016-06-17T18:54:26Z

Test build #60714 has finished for PR 13677 at commit 22c6e42.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kayousterhout · 2016-06-17T19:13:35Z

Merged into master (FYI @squito, since I'm guessing there may be minor merge conflicts with your blacklisting work)

kayousterhout added 7 commits June 14, 2016 17:34

Renamed shuffleToMapStaqe in DAGScheduler.

4dca800

This commit renames shuffleToMapStage to shuffleIdToMapStage to make it clear that the shuffle id (and not the shuffle dependency, or the shuffle map stage, or the shuffle map stage id) is the key in the hash map.

Cleaned up the code to create new shuffle map stages

1468b91

Various comment and naming cleanups

a1a51ac

More cleanup of shuffle map stage creation

671d5de

Some more naming and commenting improvements

43841a7

Moved createShuffleMapStage to be a non-nested function.

49156f2

I think having it nested made the code harder to read.

Renamed newResultStage to createResultStage and added commenting

1b7a338

squito reviewed Jun 15, 2016
View reviewed changes

Create parent stages before assigning stage ID

22c6e42

kayousterhout changed the title ~~[SPARK 15926] Improve readability of DAGScheduler stage creation methods~~ [SPARK-15926] Improve readability of DAGScheduler stage creation methods Jun 17, 2016

asfgit closed this in c8809db Jun 17, 2016

viirya mentioned this pull request Jul 7, 2016

[SPARK-15422] [Core] Remove unnecessary calculation of stage's parents #13123

Closed

kayousterhout deleted the SPARK-15926 branch April 11, 2017 22:09

[SPARK-15926] Improve readability of DAGScheduler stage creation methods #13677

[SPARK-15926] Improve readability of DAGScheduler stage creation methods #13677

Uh oh!

Conversation

kayousterhout commented Jun 15, 2016

What changes were proposed in this pull request?

Uh oh!

SparkQA commented Jun 15, 2016

Uh oh!

markhamstra commented Jun 15, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

squito Jun 15, 2016

Choose a reason for hiding this comment

Uh oh!

squito Jun 15, 2016

Choose a reason for hiding this comment

Uh oh!

squito commented Jun 15, 2016

Uh oh!

kayousterhout commented Jun 15, 2016

Uh oh!

SparkQA commented Jun 15, 2016

Uh oh!

kayousterhout commented Jun 15, 2016

Uh oh!

markhamstra commented Jun 15, 2016

Uh oh!

kayousterhout commented Jun 17, 2016

Uh oh!

SparkQA commented Jun 17, 2016

Uh oh!

kayousterhout commented Jun 17, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

markhamstra commented Jun 15, 2016 •

edited

Loading