-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-6973]remove skipped stage ID from completed set on the allJobsPage #5550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can one of the admins verify this patch? |
|
No, I disagree with this change. Skipped stages will never be completed, so this will show things like "2/3 (1 skipped)" which sounds like something is still waiting to run. |
|
Yeah, there will be this result. But consider the bug described in the jira, I think it's more reasonable. |
|
I think you're describing a different question in the JIRA, which is how to count retried stages? I think retried stages are counted separately, but not skipped stages. Is that the issue? |
|
Sorry, I can't attach the screenshot of stages, so maybe I described not clearly. |
|
OK, the screenshot and new description clarify it. Stage 0 completes, so is counted towards the total and completed stage IDs. It is retried because Stage 1 failed (?) and is skipped. It's still considered to have completed, but now does not count against the total since it was also skipped (the second time). The same logic works for tasks since tasks are not retried under the same ID, I think? The problem with this change as it is, is what I mentioned -- if a stage doesn't complete and is skipped then you have something like "2/3 (1 skipped)" which isn't great. However to fix this, should a stage's ID be removed from the set of completed stages if it is retried? |
|
@srowen , do you mean we should count the first time of stages into total/completed stages? |
|
As far as I can see, the fix is as I say -- when a stage begins a retry, it should no longer be considered a completed stage. I'm not clear if that messes up other logic, but, it sounds right. It should solve this problem. |
|
@srowen As far as you said, a retry stage should be considered as an uncompleted stage. |
|
I think that it would then show BTW I don't think you want to change the logic for task counts since it doesn't have the same problem, I think. There's no reused task ID as there was with stage ID, I think. |
|
@srowen I have update the patch code, delete the skipped stage from completed set. |
|
I like that you've introduced a set for skipped stages, since there is also a problem with the same stage being skipped many times! But can It still feels funny to count a stage as completed and skipped at the same time. How about just removing a stage ID from the set of completed stages in Then you don't need a new CC @JoshRosen to make sure we're not going off track here, as I think you added the skipped-stage logic here. |
|
@srowen Sorry for my English, I may miss something. Actually your idea is that, we should count the last retry status of stage into completed/total, right? |
|
@XuTingjun If a stage completes, but then is retried and skipped on the second execution, then it's counted as both "completed" and "skipped". Yes, I think it should not be counted as "completed". Your change does that, but I'm saying it might make this more error-prone later to count the stage ID in both the "completed" and "skipped" sets. It also requires this new |
|
But I think I have done that in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this case? A stage didn't fail but never submitted, so is it skipped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, like the case described in the jira.
|
@srowen I have updated the code, please have a look. |
|
Jenkins, retest this please |
|
This is looking good although I'd still love to get more experienced eyes on this. Are you able to verify this fixed your problem? |
|
Test build #30837 has started for PR 5550 at commit |
|
@srowen I have tested it, it has fixed my problem. Thank you very much. |
|
Test build #30837 has finished for PR 5550 at commit
|
|
Test PASSed. |
|
@JoshRosen Can you have a look on this? We need your opinion. |
|
Can one of the admins verify this patch? |
|
@srowen Can you deal with this patch? Thanks ! |
|
I'm reasonably confident about this change given the discussion. I suppose it's also a mostly cosmetic change so lower risk. Let me see again if @JoshRosen can take a look as a double-check? |
|
Sorry for letting this sit in my review queue for so long. The original idea behind skipped stages was to indicate that there were pending stages that might have been needed to be computed if previous stages' outputs were lost, but which weren't actually executed as part of this job. Intuitively, I think it makes sense for a completed job's progress bar to be at 100%. It looks like this patch is trying to address some corner-cases where we report a numeric progress (n/m) that's greater than 100%, which seems to happen when we either have pending stages that are actually executed instead of being skipped or when stages are recomputed due to failures. I guess there are a couple of options here for fixing the numeric progress indicator: we can either decrease the numerator or increase the denominator. Intuitively, it probably makes sense for the number of stages listed in the progress bar to match the number of stages actually completed on the job details page, so I guess that's increasing the denominator. The fact that we display (j failed) already conveys that extra work was performed, so I don't think we need to allow > 100% progress indicators to convey that the job may have done extra work due to failures. |
|
Thanks @JoshRosen that makes sense. In that case we're back to @XuTingjun's original patch, which simply doesn't subtract the number of skipped stages from the total (denominator). Can a stage ever be skipped without first completing once? This could be a dumb question. That is, is it possible that a stage is then counted in the total (denominator) but not in completed stages (numerator) because it was skipped and was never seen to complete? Then we have the problem that it will never reach m/m complete. If that can't happen, then I agree with the original change actually. Then I think we need both the current state of this change, which defends against a slightly different problem (stage completes, retries, but then fails, but is always counted complete), and the original change, which just removed count of skipped from the total. |
|
I agree the opinion said by @srowen before. In my case, job details page has one completed and one skipped stage, so I think decrease the numerator is better. |
|
Hm, opinion seems pretty evenly split on whether to a) "decrease the numerator" or b) "increase the denominator". I think the current state of this patch is a), and adding back the original commit makes it into b), so both are easy enough. Does anyone feel strongly one way or the other? I'm inclined to proceed as-is just to resolve this, which puts it in a slightly better state than today no matter what. |
|
Going once, twice ... adding this change seems better than leaving it as-is, and the other behavior is easy to implement if desired with another one-line change. Any objections to merging? |
…sPage Though totalStages = allStages - skippedStages is understandable. But consider the problem [SPARK-6973], I think totalStages = allStages is more reasonable. Like "2/1 (2 failed) (1 skipped)", this item also shows the skipped num, it also will be understandable. Author: Xu Tingjun <xutingjun@huawei.com> Author: Xutingjun <xutingjun@huawei.com> Author: meiyoula <1039320815@qq.com> Closes #5550 from XuTingjun/allJobsPage and squashes the following commits: a742541 [Xu Tingjun] delete the loop 40ce94b [Xutingjun] remove stage id from completed set if it retries again 6459238 [meiyoula] delete space 9e23c71 [Xu Tingjun] recover numSkippedStages b987ea7 [Xutingjun] delete skkiped stages from completed set 47525c6 [Xu Tingjun] modify total stages/tasks on the allJobsPage
|
@srowen, The jira has been updated to resolved, I think this patch can merged, right? |
|
@XuTingjun Yes it was already merged: a8077e5 It didn't auto-close for some reason. You can close it. |
…sPage Though totalStages = allStages - skippedStages is understandable. But consider the problem [SPARK-6973], I think totalStages = allStages is more reasonable. Like "2/1 (2 failed) (1 skipped)", this item also shows the skipped num, it also will be understandable. Author: Xu Tingjun <xutingjun@huawei.com> Author: Xutingjun <xutingjun@huawei.com> Author: meiyoula <1039320815@qq.com> Closes apache#5550 from XuTingjun/allJobsPage and squashes the following commits: a742541 [Xu Tingjun] delete the loop 40ce94b [Xutingjun] remove stage id from completed set if it retries again 6459238 [meiyoula] delete space 9e23c71 [Xu Tingjun] recover numSkippedStages b987ea7 [Xutingjun] delete skkiped stages from completed set 47525c6 [Xu Tingjun] modify total stages/tasks on the allJobsPage
…sPage Though totalStages = allStages - skippedStages is understandable. But consider the problem [SPARK-6973], I think totalStages = allStages is more reasonable. Like "2/1 (2 failed) (1 skipped)", this item also shows the skipped num, it also will be understandable. Author: Xu Tingjun <xutingjun@huawei.com> Author: Xutingjun <xutingjun@huawei.com> Author: meiyoula <1039320815@qq.com> Closes apache#5550 from XuTingjun/allJobsPage and squashes the following commits: a742541 [Xu Tingjun] delete the loop 40ce94b [Xutingjun] remove stage id from completed set if it retries again 6459238 [meiyoula] delete space 9e23c71 [Xu Tingjun] recover numSkippedStages b987ea7 [Xutingjun] delete skkiped stages from completed set 47525c6 [Xu Tingjun] modify total stages/tasks on the allJobsPage
Though totalStages = allStages - skippedStages is understandable. But consider the problem [SPARK-6973], I think totalStages = allStages is more reasonable. Like "2/1 (2 failed) (1 skipped)", this item also shows the skipped num, it also will be understandable.