-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-8366] maxNumExecutorsNeeded should properly handle failed tasks #6817
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- when the task fails, a new one will append, so I post a resubmit event here to let the ExecutorAllocationManager know.
- if the numFailures is bigger than maxTaskFailures, not need to append new a task and submit it
|
@sryza @andrewor14 Can you have a look? |
|
err, sorry @XuTingjun I was looking at an old version of the page, I deleted my comment about the jira |
|
Hi @XuTingjun I'll look at this today if not tomorrow thanks for the ping. |
|
ok to test |
|
Test build #35170 has finished for PR 6817 at commit
|
|
Test build #35207 has finished for PR 6817 at commit
|
|
I think this patch has no association with the failed unit tests, please retest. |
|
Jenkins, retest this please. |
|
Test build #35506 has finished for PR 6817 at commit
|
|
@XuTingjun From the description, I'm still having a hard time trying to understand what the symptom is. From the JIRA:
What do you mean won't add? Are the resubmitted tasks not being run on the new executor? Or are we not requesting new executors? Could you give a detailed example of a case when this happens? |
|
Also, echoing @squito's comments, I also don't really see why a new |
|
@andrewor14, sorry for my pool English. The problem is : |
|
@andrewor14, Have you understood the problem? The new attempt tasks will count into |
|
@andrewor14 , Sorry to bother you again. I think it's really a bug, wish you have a look again, thanks! |
|
Hi @XuTingjun sorry for slipping. I'll have another look at this tomorrow. |
|
@XuTingjun I dug into the scheduler code a little. When a task is resubmitted, it uses a new task ID, but not a new task index. To calculate the number of pending tasks, we use the task index, not the task ID. Therefore, it should handle resubmit correctly since task indices are the same across multiple attempts of the same task. Could you clarify what the resulting behavior of this bug is? It will be useful to describe the symptoms without referring to the low-level implementation. I just want to know what the consequences the issue has for the Spark user who knows nothing about |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If anything, I would think that we should remove this line. If this task fails, then the next attempt would go to the else case of stageIdNumTasks.getOrElse(stageId, -1), which not technically correct. It's safe to remove it because we remove it in stageCompleted anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we delete line 557-585 ? I think stageCompleted also have done this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not following... L557 - 585 refer to adding executors. stageCompleted does not even deal with executors.
|
Let me answer the question - “Could you clarify what the resulting behavior of this bug is? ”. @andrewor14 I don't agree with your opinion. Let me give an example, a stage only has one task with task index is 1, so one executor is allocate to this task. But when the task fails of executor lost , a new attempt task will start, no executor will allocate, because |
|
Yes, so in that case we should just remove the task from Do you understand my proposal? |
|
Yeah, I got it. I think we can add below code into |
|
Test build #233 has finished for PR 6817 at commit
|
|
Test build #39862 has finished for PR 6817 at commit
|
|
retest this please, this is failing a test that was already fixed in upstream |
|
Test build #40049 has finished for PR 6817 at commit
|
|
Test build #40160 has finished for PR 6817 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update or remove this comment
|
@XuTingjun The fix here is not correct because after all tasks have been scheduled we keep asking for executors even though we don't need them. I would just add back the entire code block you deleted in L602-610 and call (1) We stop requesting executors once we have started all tasks (not finished) Does that make sense? |
|
@andrewor14 Makes better sense to me. Thanks for the explanation, Andrew. |
|
@andrewor14, I understand what you mean. |
|
Maybe we can change below code, right? to |
|
@XuTingjun yes I think that looks fine. Would you mind testing this change on a real cluster? This scenario is somewhat tricky and it would be good to verify that it works outside of unit tests. |
|
@andrewor14, I have tested it in the real cluster, it's ok. |
|
retest this please |
|
Latest changes LGTM, will merge once tests pass. Thanks for your persistence @XuTingjun |
|
Test build #40476 timed out for PR 6817 at commit |
|
retest this please |
1 similar comment
|
retest this please |
|
Test build #40509 timed out for PR 6817 at commit |
|
Test build #1461 has finished for PR 6817 at commit
|
|
Test build #40526 has finished for PR 6817 at commit
|
|
Alright, merging into master 1.5. |
Author: xutingjun <xutingjun@huawei.com> Author: meiyoula <1039320815@qq.com> Closes #6817 from XuTingjun/SPARK-8366. (cherry picked from commit b85f9a2) Signed-off-by: Andrew Or <andrew@databricks.com>
Author: xutingjun <xutingjun@huawei.com> Author: meiyoula <1039320815@qq.com> Closes apache#6817 from XuTingjun/SPARK-8366.
No description provided.