Skip to content

Conversation

@XuTingjun
Copy link
Contributor

Though totalStages = allStages - skippedStages is understandable. But consider the problem [SPARK-6973], I think totalStages = allStages is more reasonable. Like "2/1 (2 failed) (1 skipped)", this item also shows the skipped num, it also will be understandable.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@srowen
Copy link
Member

srowen commented Apr 17, 2015

No, I disagree with this change. Skipped stages will never be completed, so this will show things like "2/3 (1 skipped)" which sounds like something is still waiting to run.

@XuTingjun
Copy link
Contributor Author

Yeah, there will be this result. But consider the bug described in the jira, I think it's more reasonable.

@srowen
Copy link
Member

srowen commented Apr 17, 2015

I think you're describing a different question in the JIRA, which is how to count retried stages? I think retried stages are counted separately, but not skipped stages. Is that the issue?

@XuTingjun
Copy link
Contributor Author

Sorry, I can't attach the screenshot of stages, so maybe I described not clearly.
I have update the description in jira. @srowen

@srowen
Copy link
Member

srowen commented Apr 17, 2015

OK, the screenshot and new description clarify it. Stage 0 completes, so is counted towards the total and completed stage IDs. It is retried because Stage 1 failed (?) and is skipped. It's still considered to have completed, but now does not count against the total since it was also skipped (the second time).

The same logic works for tasks since tasks are not retried under the same ID, I think?

The problem with this change as it is, is what I mentioned -- if a stage doesn't complete and is skipped then you have something like "2/3 (1 skipped)" which isn't great.

However to fix this, should a stage's ID be removed from the set of completed stages if it is retried?
CC @JoshRosen

@XuTingjun
Copy link
Contributor Author

@srowen , do you mean we should count the first time of stages into total/completed stages?
@JoshRosen , Do you have any suggestions?

@XuTingjun XuTingjun closed this Apr 20, 2015
@XuTingjun XuTingjun reopened this Apr 20, 2015
@srowen
Copy link
Member

srowen commented Apr 20, 2015

As far as I can see, the fix is as I say -- when a stage begins a retry, it should no longer be considered a completed stage. I'm not clear if that messes up other logic, but, it sounds right. It should solve this problem.

@SaintBacchus
Copy link
Contributor

@srowen As far as you said, a retry stage should be considered as an uncompleted stage.
The case described in the jira will be showed as 0/1 (1 skipped)(2 failed) for a successful job though there were some retries in it. Can it be acceptable for the spark user?

@srowen
Copy link
Member

srowen commented Apr 21, 2015

I think that it would then show 1/1 (1 skipped) (2 failed) The first task would not be considered completed, but then would not count against the total as it was skipped. The only odd thing about this is that it doesn't reflect the fact that the first task completed once, but that's the limit of how much this "x/y" string can show, I guess. It makes some sense because we say "1 skipped" and "2 failed" which are things that happened in addition to the report 1/1 success. This might be the most coherent thing to do.

BTW I don't think you want to change the logic for task counts since it doesn't have the same problem, I think. There's no reused task ID as there was with stage ID, I think.

@XuTingjun
Copy link
Contributor Author

@srowen I have update the patch code, delete the skipped stage from completed set.

@srowen
Copy link
Member

srowen commented Apr 22, 2015

I like that you've introduced a set for skipped stages, since there is also a problem with the same stage being skipped many times! But can numSkippedStages now be removed?

It still feels funny to count a stage as completed and skipped at the same time. How about just removing a stage ID from the set of completed stages in onStageSubmitted? if a stage is being run (again), it can no longer be called "completed", I think?

Then you don't need a new diff method or to subtract out the skipped stages. I think the set is still probably correct for skipped stages since a stage could conceivably completed, restart, and be skipped multiple times (?)

CC @JoshRosen to make sure we're not going off track here, as I think you added the skipped-stage logic here.

@XuTingjun
Copy link
Contributor Author

@srowen Sorry for my English, I may miss something. Actually your idea is that, we should count the last retry status of stage into completed/total, right?

@srowen
Copy link
Member

srowen commented Apr 23, 2015

@XuTingjun If a stage completes, but then is retried and skipped on the second execution, then it's counted as both "completed" and "skipped". Yes, I think it should not be counted as "completed". Your change does that, but I'm saying it might make this more error-prone later to count the stage ID in both the "completed" and "skipped" sets. It also requires this new diff operation. What about removing a stage ID from the completed set the moment it is retried? then everything else works out? might be simpler.

@XuTingjun
Copy link
Contributor Author

But I think I have done that in onStageSubmitted, removing a stage ID from the completed set the moment it is retried. Have you seen that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this case? A stage didn't fail but never submitted, so is it skipped?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, like the case described in the jira.

@XuTingjun XuTingjun changed the title [SPARK-6973]modify total stages/tasks on the allJobsPage [SPARK-6973]remove skipped stage ID from completed set on the allJobsPage Apr 23, 2015
@XuTingjun
Copy link
Contributor Author

@srowen I have updated the code, please have a look.

@srowen
Copy link
Member

srowen commented Apr 23, 2015

Jenkins, retest this please

@srowen
Copy link
Member

srowen commented Apr 23, 2015

This is looking good although I'd still love to get more experienced eyes on this. Are you able to verify this fixed your problem?

@SparkQA
Copy link

SparkQA commented Apr 23, 2015

Test build #30837 has started for PR 5550 at commit a742541.

@XuTingjun
Copy link
Contributor Author

@srowen I have tested it, it has fixed my problem. Thank you very much.

@SparkQA
Copy link

SparkQA commented Apr 23, 2015

Test build #30837 has finished for PR 5550 at commit a742541.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30837/
Test PASSed.

@XuTingjun
Copy link
Contributor Author

@JoshRosen Can you have a look on this? We need your opinion.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@XuTingjun
Copy link
Contributor Author

@srowen Can you deal with this patch? Thanks !

@srowen
Copy link
Member

srowen commented May 4, 2015

I'm reasonably confident about this change given the discussion. I suppose it's also a mostly cosmetic change so lower risk. Let me see again if @JoshRosen can take a look as a double-check?

@JoshRosen
Copy link
Contributor

Sorry for letting this sit in my review queue for so long.

The original idea behind skipped stages was to indicate that there were pending stages that might have been needed to be computed if previous stages' outputs were lost, but which weren't actually executed as part of this job.

Intuitively, I think it makes sense for a completed job's progress bar to be at 100%. It looks like this patch is trying to address some corner-cases where we report a numeric progress (n/m) that's greater than 100%, which seems to happen when we either have pending stages that are actually executed instead of being skipped or when stages are recomputed due to failures. I guess there are a couple of options here for fixing the numeric progress indicator: we can either decrease the numerator or increase the denominator. Intuitively, it probably makes sense for the number of stages listed in the progress bar to match the number of stages actually completed on the job details page, so I guess that's increasing the denominator.

The fact that we display (j failed) already conveys that extra work was performed, so I don't think we need to allow > 100% progress indicators to convey that the job may have done extra work due to failures.

@srowen
Copy link
Member

srowen commented May 4, 2015

Thanks @JoshRosen that makes sense. In that case we're back to @XuTingjun's original patch, which simply doesn't subtract the number of skipped stages from the total (denominator).

Can a stage ever be skipped without first completing once? This could be a dumb question. That is, is it possible that a stage is then counted in the total (denominator) but not in completed stages (numerator) because it was skipped and was never seen to complete? Then we have the problem that it will never reach m/m complete.

If that can't happen, then I agree with the original change actually.

Then I think we need both the current state of this change, which defends against a slightly different problem (stage completes, retries, but then fails, but is always counted complete), and the original change, which just removed count of skipped from the total.

@XuTingjun
Copy link
Contributor Author

I agree the opinion said by @srowen before. In my case, job details page has one completed and one skipped stage, so I think decrease the numerator is better.

@XuTingjun
Copy link
Contributor Author

@JoshRosen

@srowen
Copy link
Member

srowen commented May 17, 2015

Hm, opinion seems pretty evenly split on whether to a) "decrease the numerator" or b) "increase the denominator". I think the current state of this patch is a), and adding back the original commit makes it into b), so both are easy enough.

Does anyone feel strongly one way or the other? I'm inclined to proceed as-is just to resolve this, which puts it in a slightly better state than today no matter what.

@srowen
Copy link
Member

srowen commented May 28, 2015

Going once, twice ... adding this change seems better than leaving it as-is, and the other behavior is easy to implement if desired with another one-line change. Any objections to merging?

XuTingjun added a commit that referenced this pull request Jun 6, 2015
…sPage

Though totalStages = allStages - skippedStages is understandable. But consider the problem [SPARK-6973], I think totalStages = allStages is more reasonable. Like "2/1 (2 failed) (1 skipped)", this item also shows the skipped num, it also will be understandable.

Author: Xu Tingjun <xutingjun@huawei.com>
Author: Xutingjun <xutingjun@huawei.com>
Author: meiyoula <1039320815@qq.com>

Closes #5550 from XuTingjun/allJobsPage and squashes the following commits:

a742541 [Xu Tingjun] delete the loop
40ce94b [Xutingjun] remove stage id from completed set if it retries again
6459238 [meiyoula] delete space
9e23c71 [Xu Tingjun] recover numSkippedStages
b987ea7 [Xutingjun] delete skkiped stages from completed set
47525c6 [Xu Tingjun] modify total stages/tasks on the allJobsPage
@XuTingjun
Copy link
Contributor Author

@srowen, The jira has been updated to resolved, I think this patch can merged, right?

@srowen
Copy link
Member

srowen commented Jun 8, 2015

@XuTingjun Yes it was already merged: a8077e5 It didn't auto-close for some reason. You can close it.

@XuTingjun XuTingjun closed this Jun 8, 2015
@XuTingjun XuTingjun deleted the allJobsPage branch June 8, 2015 08:38
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
…sPage

Though totalStages = allStages - skippedStages is understandable. But consider the problem [SPARK-6973], I think totalStages = allStages is more reasonable. Like "2/1 (2 failed) (1 skipped)", this item also shows the skipped num, it also will be understandable.

Author: Xu Tingjun <xutingjun@huawei.com>
Author: Xutingjun <xutingjun@huawei.com>
Author: meiyoula <1039320815@qq.com>

Closes apache#5550 from XuTingjun/allJobsPage and squashes the following commits:

a742541 [Xu Tingjun] delete the loop
40ce94b [Xutingjun] remove stage id from completed set if it retries again
6459238 [meiyoula] delete space
9e23c71 [Xu Tingjun] recover numSkippedStages
b987ea7 [Xutingjun] delete skkiped stages from completed set
47525c6 [Xu Tingjun] modify total stages/tasks on the allJobsPage
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
…sPage

Though totalStages = allStages - skippedStages is understandable. But consider the problem [SPARK-6973], I think totalStages = allStages is more reasonable. Like "2/1 (2 failed) (1 skipped)", this item also shows the skipped num, it also will be understandable.

Author: Xu Tingjun <xutingjun@huawei.com>
Author: Xutingjun <xutingjun@huawei.com>
Author: meiyoula <1039320815@qq.com>

Closes apache#5550 from XuTingjun/allJobsPage and squashes the following commits:

a742541 [Xu Tingjun] delete the loop
40ce94b [Xutingjun] remove stage id from completed set if it retries again
6459238 [meiyoula] delete space
9e23c71 [Xu Tingjun] recover numSkippedStages
b987ea7 [Xutingjun] delete skkiped stages from completed set
47525c6 [Xu Tingjun] modify total stages/tasks on the allJobsPage
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants