Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-25299] Add shuffle map output un-registration hooks upon fetch failure #609

Merged
merged 7 commits into from
Oct 4, 2019

Conversation

mccheah
Copy link

@mccheah mccheah commented Oct 4, 2019

We realized that there's complexity with whether or not map outputs should be unregistered, and thus should be recomputed.

Previously, we were using a boolean, then, a three-way switch. But these modes do not capture a lot of intricacies with how this should work - in particular, for our async upload proof of concept, we end up unregistering all the map outputs written by an executor, despite the fact that this would invalidate and re-write all the map outputs that were persisted to the remote storage system.

@@ -36,31 +35,8 @@ class PluginShuffleDataIO(sparkConf: SparkConf) extends ShuffleDataIO {

class PluginShuffleDriverComponents(delegate: ShuffleDriverComponents)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to write more comprehensive tests with a particular implementation of checking for persisted files.

@@ -1673,8 +1673,11 @@ private[spark] class DAGScheduler(
// TODO: mark the executor as failed only if there were lots of fetch failures on it
if (bmAddress != null) {
if (bmAddress.executorId == null) {
if (shuffleDriverComponents.unregistrationStrategyOnFetchFailure() ==
MapOutputUnregistrationStrategy.HOST) {
if (unRegisterOutputOnHostOnFetchFailure &&
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's carefully consider if this is correct - not entirely sure, to be honest.

@@ -71,6 +72,7 @@ class SparkEnv (
val metricsSystem: MetricsSystem,
val memoryManager: MemoryManager,
val outputCommitCoordinator: OutputCommitCoordinator,
val shuffleDataIo: ShuffleDataIO,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it weird to have the shuffleDataIO as part of the developer API? My concern is that users could use shuffleDAtaIO.driver() in the executor and vice versa, but maybe it's not a problem.

If you wanted to get around this, you could pass the driver components, or pass in a function, as part of the method call to MapOutputTracker since you already a driver components in the DAGScheduler

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to pass functional modules around through method calls - modules should be dependency injected at construction time.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the rationale for not passing functional modules through method calls?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ownership of the module becomes unclear - the dependency tree should be more or less static and clear.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to eventually have ShuffleDataIO in this API, but it'll be tricky when it comes to upstream Spark because our APIs will still be in experimental status for now - ok with deferring to the community to see what they think here, cause I can't think of much better at all to set up the dependency injection.

@bulldozer-bot bulldozer-bot bot merged commit d551551 into master Oct 4, 2019
@bulldozer-bot bulldozer-bot bot deleted the add-shuffle-removal-hooks branch October 4, 2019 23:42
@yifeih yifeih added the ess-changes changes related to external shuffle service work (see spark-25299) label Jan 14, 2020
rshkv added a commit that referenced this pull request Jul 3, 2020
jdcasale pushed a commit that referenced this pull request Aug 11, 2020
rshkv added a commit that referenced this pull request Jan 25, 2021
rshkv added a commit that referenced this pull request Jan 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ess-changes changes related to external shuffle service work (see spark-25299)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants