-
Notifications
You must be signed in to change notification settings - Fork 29k
[WIP][SPARK-29257][Core][Shuffle] Use task attempt number as noop reduce id to handle disk failures during shuffle #25941
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…d to handle disk failures during shuffle
|
Can you give more details on the design? It seems to me that the shuffle reader and writer need to work together to support multi-disks well. I don't understand how the noop reducer id is related to it. |
|
@cloud-fan you're right. this will break the dependency of the reader and writer. I will try to work out a proper solution. |
|
About this issue, I propose my thought. This shuffleIndexFile is kept for per executor to store partition lengths. And when ESS is enabled, it will be read by ESS. I think we can define a parameter, such as spark.shuffle.index.maxRetry(just make sense) to control the max retry to get a NoOPReduceId to avoid located to same bad disk every time. Its default value can be the num of localDirs. We can define a inner class in IndexShuffleResolver and use an atomicInteger to present the value of NOOPReduceID. When failed get IndexFile, we increamentAndGet a new NOOPReduceId, which can not exceed So the name of index file is shuffleId-MapId-retriedNoOPReduceId. And when read data from indexFile, we should try to read indexFile named |
|
Oh, there is a question that how to transfer this parameter to ESS. |
|
Test build #111414 has finished for PR 25941 at commit
|
|
Test build #111413 has finished for PR 25941 at commit
|
How should the shuffle writer write these files? Do you mean we need to slow down shuffle writing to write duplicated files? |
|
I prefer we propagate attemptNumber with map statuses |
|
This is a high-level question: How can we be tolerant of disk failures without data duplication? If so, what's the design here to duplicate shuffle files? |
|
We will not create duplicate shuffle files at all. The unsuccess attempts may have a chance to create or pick a subdirectory under the "blockmgr-xxx" before disk failures on it, but they can not commit the index and data files because of the disk failure. The “shuffle_$shuffleId_$mapId_$attemptNumber_0.index” and “shuffle_$shuffleId_$mapId_$attemptNumber_0.data” may locate in different disks, and any of them meets the bad disk will abort the write process and fail the task attempt. Only the success task can create those files. In another case, if the disk failure happens in shuffle read phase, which may cause fetch failed exception and re-run the dependent map taskes, but I guess for those duplicated files, our processing logic will not change. |
|
ok I get your point now. Please notice that, after #25620 , |
|
@cloud-fan, Thanks for noticing me #25620, the unique taskAttemptId can produce the same effect with |
|
closing this because it isn't a problem anymore. |
…
What changes were proposed in this pull request?
the noop reduce id used to be 0, which results a fixed index or data file name. When the nodes in your cluster has more than one disks, if one of them is broken and the fixed file name's hash code % disk number just point to it, all attempts of one task will inevitably access the broken disk, which lead to meaningless task failure or even tear down the whole job. Here we change the noop reduce id to task attempt number to produce different file name, which may try another healthy disk.
Why are the changes needed?
We have an HDFS/YARN cluster with about 2k~3k nodes, each node has 8T * 12 local disks for storage and shuffle. Sometimes, one or more disks get into bad status during computations. Sometimes it does cause job level failure, sometimes does.
The following picture shows one failure job caused by 4 task attempts were all delivered to the same node and failed with almost the same exception for writing the index temporary file to the same bad disk.
This is caused by two reasons:
Does this PR introduce any user-facing change?
NO
How was this patch tested?
add some uts
ping @cloud-fan @srowen @squito