Skip to content

Conversation

@JoshRosen
Copy link
Contributor

What changes were proposed in this pull request?

In SPARK-21444, @sitalkedia reported an issue where the Broadcast.destroy() call in MapOutputTracker's ShuffleStatus.invalidateSerializedMapOutputStatusCache() was failing with an IOException, causing the DAGScheduler to crash and bring down the entire driver.

This is a bug introduced by #17955. In the old code, we removed a broadcast variable by calling BroadcastManager.unbroadcast with blocking=false, but the new code simply calls Broadcast.destroy() which is capable of failing with an IOException in case certain blocking RPCs time out.

The fix implemented here is to replace this with a call to destroy(blocking = false) and to wrap the entire operation in Utils.tryLogNonFatalError.

How was this patch tested?

I haven't written regression tests for this because it's really hard to inject mocks to simulate RPC failures here. Instead, this class of issue is probably best uncovered with more generalized error injection / network unreliability / fuzz testing tools.

@sameeragarwal
Copy link
Member

cc @jiangxb1987

@sitalkedia
Copy link

@JoshRosen - Thanks for the lightning fast response. The change looks reasonable to me.

@cloud-fan
Copy link
Contributor

LGTM

Copy link
Contributor

@jiangxb1987 jiangxb1987 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@SparkQA
Copy link

SparkQA commented Jul 18, 2017

Test build #79691 has finished for PR 18662 at commit a5ebcac.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in 5952ad2 Jul 18, 2017
@JoshRosen JoshRosen deleted the SPARK-21444 branch July 18, 2017 03:44
@JoshRosen
Copy link
Contributor Author

Merged to master. Thanks for the quick reviews.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants