[SPARK-21444] Be more defensive when removing broadcasts in MapOutputTracker #18662
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
In SPARK-21444, @sitalkedia reported an issue where the
Broadcast.destroy()call inMapOutputTracker'sShuffleStatus.invalidateSerializedMapOutputStatusCache()was failing with anIOException, causing the DAGScheduler to crash and bring down the entire driver.This is a bug introduced by #17955. In the old code, we removed a broadcast variable by calling
BroadcastManager.unbroadcastwithblocking=false, but the new code simply callsBroadcast.destroy()which is capable of failing with an IOException in case certain blocking RPCs time out.The fix implemented here is to replace this with a call to
destroy(blocking = false)and to wrap the entire operation inUtils.tryLogNonFatalError.How was this patch tested?
I haven't written regression tests for this because it's really hard to inject mocks to simulate RPC failures here. Instead, this class of issue is probably best uncovered with more generalized error injection / network unreliability / fuzz testing tools.