[SPARK-25299] Add shuffle map output un-registration hooks upon fetch failure #609

mccheah · 2019-10-04T00:41:28Z

We realized that there's complexity with whether or not map outputs should be unregistered, and thus should be recomputed.

Previously, we were using a boolean, then, a three-way switch. But these modes do not capture a lot of intricacies with how this should work - in particular, for our async upload proof of concept, we end up unregistering all the map outputs written by an executor, despite the fact that this would invalidate and re-write all the map outputs that were persisted to the remote storage system.

…e map outputs or not

This reverts commit 5abde44.

mccheah · 2019-10-04T00:42:27Z

core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerShufflePluginSuite.scala

@@ -36,31 +35,8 @@ class PluginShuffleDataIO(sparkConf: SparkConf) extends ShuffleDataIO {

 class PluginShuffleDriverComponents(delegate: ShuffleDriverComponents)


I'd like to write more comprehensive tests with a particular implementation of checking for persisted files.

mccheah · 2019-10-04T00:43:23Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

@@ -1673,8 +1673,11 @@ private[spark] class DAGScheduler(
          // TODO: mark the executor as failed only if there were lots of fetch failures on it
          if (bmAddress != null) {
            if (bmAddress.executorId == null) {
-              if (shuffleDriverComponents.unregistrationStrategyOnFetchFailure() ==
-                  MapOutputUnregistrationStrategy.HOST) {
+              if (unRegisterOutputOnHostOnFetchFailure &&


Let's carefully consider if this is correct - not entirely sure, to be honest.

yifeih · 2019-10-04T19:31:11Z

core/src/main/scala/org/apache/spark/SparkEnv.scala

@@ -71,6 +72,7 @@ class SparkEnv (
    val metricsSystem: MetricsSystem,
    val memoryManager: MemoryManager,
    val outputCommitCoordinator: OutputCommitCoordinator,
+    val shuffleDataIo: ShuffleDataIO,


Is it weird to have the shuffleDataIO as part of the developer API? My concern is that users could use shuffleDAtaIO.driver() in the executor and vice versa, but maybe it's not a problem.

If you wanted to get around this, you could pass the driver components, or pass in a function, as part of the method call to MapOutputTracker since you already a driver components in the DAGScheduler

We don't want to pass functional modules around through method calls - modules should be dependency injected at construction time.

What's the rationale for not passing functional modules through method calls?

Ownership of the module becomes unclear - the dependency tree should be more or less static and clear.

I think it's fine to eventually have ShuffleDataIO in this API, but it'll be tricky when it comes to upstream Spark because our APIs will still be in experimental status for now - ok with deferring to the community to see what they think here, cause I can't think of much better at all to set up the dependency injection.

…on fetch failure (#609)" This reverts commit d551551.

mccheah added 3 commits October 3, 2019 16:20

Use shuffle driver components as a plugin point for deciding to remov…

94262f8

…e map outputs or not

Revert "Make removing map output by exec id configurable (#608)"

b95f918

This reverts commit 5abde44.

Fix tests

df0a6a5

mccheah commented Oct 4, 2019

View reviewed changes

Instantiate LocalDiskShuffleDriverComponents directly in tests

841841d

yifeih reviewed Oct 4, 2019

View reviewed changes

mccheah added 2 commits October 4, 2019 14:46

Add back unregister host flag

449326e

Change method name

564891e

yifeih approved these changes Oct 4, 2019

View reviewed changes

Fix build

4dbb80a

bulldozer-bot bot merged commit d551551 into master Oct 4, 2019

bulldozer-bot bot deleted the add-shuffle-removal-hooks branch October 4, 2019 23:42

yifeih added the ess-changes changes related to external shuffle service work (see spark-25299) label Jan 14, 2020

rshkv added a commit that referenced this pull request Jul 3, 2020

Revert "[SPARK-25299] Add shuffle map output un-registration hooks up…

e1d758a

…on fetch failure (#609)" This reverts commit d551551.

jdcasale pushed a commit that referenced this pull request Aug 11, 2020

Revert "[SPARK-25299] Add shuffle map output un-registration hooks up…

db71f10

…on fetch failure (#609)" This reverts commit d551551.

rshkv added a commit that referenced this pull request Jan 25, 2021

Revert "[SPARK-25299] Add shuffle map output un-registration hooks up…

8bded63

…on fetch failure (#609)" This reverts commit d551551.

rshkv added a commit that referenced this pull request Jan 25, 2021

Revert "[SPARK-25299] Add shuffle map output un-registration hooks up…

c94a95d

…on fetch failure (#609)" This reverts commit d551551.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-25299] Add shuffle map output un-registration hooks upon fetch failure #609

[SPARK-25299] Add shuffle map output un-registration hooks upon fetch failure #609

mccheah commented Oct 4, 2019

mccheah Oct 4, 2019

mccheah Oct 4, 2019

yifeih Oct 4, 2019

mccheah Oct 4, 2019

yifeih Oct 4, 2019

mccheah Oct 4, 2019

mccheah Oct 4, 2019

		@@ -36,31 +35,8 @@ class PluginShuffleDataIO(sparkConf: SparkConf) extends ShuffleDataIO {

		class PluginShuffleDriverComponents(delegate: ShuffleDriverComponents)

[SPARK-25299] Add shuffle map output un-registration hooks upon fetch failure #609

[SPARK-25299] Add shuffle map output un-registration hooks upon fetch failure #609

Conversation

mccheah commented Oct 4, 2019

mccheah Oct 4, 2019

Choose a reason for hiding this comment

mccheah Oct 4, 2019

Choose a reason for hiding this comment

yifeih Oct 4, 2019

Choose a reason for hiding this comment

mccheah Oct 4, 2019

Choose a reason for hiding this comment

yifeih Oct 4, 2019

Choose a reason for hiding this comment

mccheah Oct 4, 2019

Choose a reason for hiding this comment

mccheah Oct 4, 2019

Choose a reason for hiding this comment