Skip to content

Conversation

@kayousterhout
Copy link
Contributor

What changes were proposed in this pull request?

This pull request refactors parts of the DAGScheduler to improve readability, focusing on the code around stage creation. One goal of this change it to make it clearer which functions may create new stages (as opposed to looking up stages that already exist). There are no functionality changes in this pull request. In more detail:

  • shuffleToMapStage was renamed to shuffleIdToMapStage (when reading the existing code I have sometimes struggled to remember what the key is -- is it a stage? A stage id? This change is intended to avoid that confusion)
  • Cleaned up the code to create shuffle map stages. Previously, creating a shuffle map stage involved 3 different functions (newOrUsedShuffleStage, newShuffleMapStage, and getShuffleMapStage), and it wasn't clear what the purpose of each function was. With the new code, a single function (getOrCreateShuffleMapStage) is responsible for getting a stage (if it already exists) or creating new shuffle map stages and any missing ancestor stages, and it delegates to createShuffleMapStage when new stages need to be created. There's some remaining confusion here because the getOrCreateParentStages call in createShuffleMapStage may recursively create ancestor stages; this is an issue I plan to fix in a future pull request, because it's trickier to fix and involves a slight functionality change.
  • newResultStage was renamed to createResultStage, for consistency with naming around shuffle map stages.
  • getParentStages has been renamed to getOrCreateParentStages, to make it clear that this function will sometimes create missing ancestor stages.
  • The only slight functionality change is that on line 478, updateJobIdStageIdMaps now uses a stage's parents instance variable rather than re-calculating them (I couldn't see any reason why they'd need to be re-calculated, and suspect this is just leftover from older code).
  • getAncestorShuffleDependencies was renamed to getMissingAncestorShuffleDependencies, to make it clear that this only returns dependencies that have not yet been run.

cc @squito @markhamstra @JoshRosen (who requested more DAG scheduler commenting long ago -- an issue this pull request tries, in part, to address)

FYI @rxin

This commit renames shuffleToMapStage to shuffleIdToMapStage to make
it clear that the shuffle id (and not the shuffle dependency, or the
shuffle map stage, or the shuffle map stage id) is the key in the
hash map.
I think having it nested made the code harder to read.
@SparkQA
Copy link

SparkQA commented Jun 15, 2016

Test build #60540 has finished for PR 13677 at commit 1b7a338.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@markhamstra
Copy link
Contributor

markhamstra commented Jun 15, 2016

One goal of this change it to make it clearer which functions may create new stages (as opposed to looking up stages that already exist).

Something that I have been looking at of late, and I know that @squito has looked at some, too. In short, I'm pretty confident that we are doing some silliness around creating new stages instead of reusing already existing stages, then recognizing that all the tasks for the "new" stages are already completed (at least we're smart enough to reuse the map outputs), so the "new" stages just become "skipped".

I'll take a closer look at this tomorrow, and may have a follow-on PR in the not too distant future.

jobId: Int,
callSite: CallSite): ResultStage = {
val id = nextStageId.getAndIncrement()
val stage = new ResultStage(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getOrCreateParentStages should be called before getting the id for the result stage, otherwise the result stage will get numbered below the dependent stages.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from a quick check, I think that might solve all the failing tests

@squito
Copy link
Contributor

squito commented Jun 15, 2016

everything proposed here makes sense to me, I think just needs the one minor reordering I mentioned.

about creating extra stages, I looked into this some here: https://issues.apache.org/jira/browse/SPARK-10193. Actually seems fairly straight-forward, though I never followed up on profiling the extra memory involved. (happy to let someone else takeover, I doubt I'll get back to it anytime soon ...)

@kayousterhout
Copy link
Contributor Author

Thanks for pointing out the re-ordering issue! Fixed.

Re: creating extra stages, I agree we could do better with that, because it is confusing. I have another commit with some light cleanup of that (but wanted to keep refactoring separate from functionality changes).

@SparkQA
Copy link

SparkQA commented Jun 15, 2016

Test build #60580 has finished for PR 13677 at commit 22c6e42.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kayousterhout
Copy link
Contributor Author

This test failure is the flakey scheduler test -- but I'll hold off on re-running Jenkins since I'm guessing there will be some comments to address on this anyway (so the tests will need to be re-run after that).

@markhamstra
Copy link
Contributor

LGTM, and runs without flakiness for me when rebased onto master with the #13688 HOTFIX.

@kayousterhout
Copy link
Contributor Author

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Jun 17, 2016

Test build #60714 has finished for PR 13677 at commit 22c6e42.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kayousterhout
Copy link
Contributor Author

Merged into master (FYI @squito, since I'm guessing there may be minor merge conflicts with your blacklisting work)

@kayousterhout kayousterhout changed the title [SPARK 15926] Improve readability of DAGScheduler stage creation methods [SPARK-15926] Improve readability of DAGScheduler stage creation methods Jun 17, 2016
@asfgit asfgit closed this in c8809db Jun 17, 2016
@kayousterhout kayousterhout deleted the SPARK-15926 branch April 11, 2017 22:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants