-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Marking as minimum upstream severity instead of max #1789
Conversation
…` when ANY of its upstream task is `FAILED` or `DISABLED`. Moreover, when the wrapper task is marked as `UPSTREAM_DISABLED`, luigi shut the worker down as it should do. This causes another PENDING_TASK in the wrapper task not to run because worker is down. This behavior should change. Marking as `UPSTREAM_` should be ALL instead of ANY
for a_task_id in dep.deps), | ||
key=UPSTREAM_SEVERITY_KEY) | ||
upstream_severities = list(upstream_status_table.get(a_task_id) for a_task_id in dep.deps if a_task_id in upstream_status_table) or [''] | ||
status = min(upstream_severities, key=UPSTREAM_SEVERITY_KEY) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, why wasn't this as simple as just switching min
to max
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw a case that break just swtiching max to min but I don't remember it now. Let me try to reproduce it, I will explain it if it is still valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Tarrasch Here is the case;
When any SUCCESS
task exists among upstream tasks, upstream_status_table.get(a_task_id, '')
results ''
empty string event all other upstream tasks are FAILED
or DISABLED
. Because there is no record in upstream_status_table
for SUCCESS
tasks. This will cause the wrapper is as PENDING all the time even there is no PENDING task among its upstream tasks. This will also let worker runs forever.
Here are some SS showing two case;
Case which is just swtiching max to min;
As you can se the wrapper task which is TaskLevelRetryPolicy
runs all the time, even its dependencies SUCCESS
and DISABLED
. This is happenning because there is at least one SUCCESS
task.
The case what is in my commit;
In this case, the empty string upstream severity is ignored. First, it tries to find min among elements not includes empty string. Even there is SUCCESS which causes empty string return,
elmination will be done among severities. You can see it on the SS.
So all I mean is; max works fine, but min doesn't work like it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, beautiful. I was just about to tell you that "Hey, we need to ignore SUCCESS states", but now I realize that is what you've worked around here. Good job!
Okay. Let me try to write that test case now and verify that the behaviour changes as I expected it. If it does I merge this followed by sending a PR with my own test case. |
Previously, only the latter test case passed. Since spotify#1789 both tests do pass. The hairy details can be understood by reading the test case, it's comments and the related pull requests.
Previously, only the latter test case passed. Since spotify#1789 both tests do pass. The hairy details can be understood by reading the test case, it's comments and the related pull requests.
Previously, only the latter test case passed. Since spotify#1789 both tests do pass. The hairy details can be understood by reading the test case, it's comments and the related pull requests.
There's really only one narrow case where this PR keeps workers alive: if all runnable tasks are currently being run by other workers. If all of the tasks finish (success or fail) then the worker will die. This will only make a difference if you rely on retrying often and expect new tasks to switch from FAILED to PENDING while the tasks are running. It seems that the identical effect could be achieved by simply adding a configuration parameter to count RUNNING tasks in the PENDING count in order to keep workers alive. Alternatively, we could add a timeout where workers stay alive until a configurable amount of time has passed without seeing any running tasks. Meanwhile, my visualizer now gives a misleading under-count of the number of upstream disabled and failed jobs. It seems quite weird that this count should grow as more jobs complete successfully. If an important job is being blocked by a failure, I don't want to have to wait until every other dependency is complete to learn about it. Am I missing something here? Also, note that this change is equivalent to just changing the UPSTREAM_SEVERITY_ORDER to
Which I think is pretty weird, as it makes depending on a RUNNING job more severe than depending on a DISABLED job. |
What you say is correct. The interpretation of "task X is an upstream disabled task" changed with this patch:
The *star because it's actually disabled+upstream disabled. I think both numbers are pretty definitions are hard to grasp and many luigi users don't care to learn the difference. But if any, I actually do agree that the former is more intuitive. I'm not sure if it's better though. The old definition is very sensitive. Imagine a big enterprise and a leaf task from team Z has failed, now team A knows nothing about the details of team Z's tasks, but they transitively depend on it. But still their task is marked as reddish even though most dependencies are progressing. However, team Z is probably fully aware of the problem, they've got their error emails and can see that their task is "FAILED" in the scheduler. I'm kind of open to reverting this if you really think the old way to count is more useful. As long as the test case from #1790 still passes. The only thing I'm against is making the two ways of counting configurable, as it is too detailed for the typical luigi user to care about. |
My point of view is that if any upstream task is failed or disabled, your I'll put something together to handle the test case. I didn't see that was On Thu, Nov 10, 2016 at 6:57 PM, Arash Rouhani notifications@github.com
|
@daveFNbuck you are right, but upstreams are not just used for that; one of upstream status which is Here is how luigi uses if task.status == PENDING and in_workers:
upstream_status = self._upstream_status(task.id, upstream_table)
if upstream_status != UPSTREAM_DISABLED:
locally_pending_tasks += 1
if len(task.workers) == 1 and not assistant:
n_unique_pending += 1 I understand what you mean, but isn't not letting workers keep working for a task needs to be retried when there is no pending task, is problematic? I think, this is more critical issue that not to let it to retry a task in this case. If you guys think that kind of monitoring that you mentioned is also important which is I also agree, I think we should do it in a way by not reverting the feature in this PR. |
Thanks for everyone's input. My summary so far would be that we want these features in luigi:
An idea for solution could be to
|
I have a different solution that I think fixes the issues. We just count FAILED tasks as pending in count_pending in the scheduler. That way we keep the worker alive until all jobs are fully blocked by disabled jobs. I have a branch that does this, but it breaks some worker tests because they assume everything will stop once all jobs are failed. So it'll take a bit of effort to fix each of those appropriately. With this, we don't need BLOCKED_DISABLED, as we're more directly getting at the idea of allowing re-runs until jobs are disabled. I personally like having UPSTREAM_FAILURE but I can see your side too. Someone on my team is also currently adding a READY state to our visualizer that will filter to which just shows PENDING tasks where every dependency is DONE. I think this is more useful than UPSTREAM_FAILURE and keeps the nice balance of 8 items in two rows of 4 if we remove a status when we add this. |
As for count-FAILED-as-pending idea. It clings well in my ears, probably because we make clear the intention that they will be re-run. As for the READY state, I also like it, @daveFNbuck, were you up for eventually implementing any of these ideas? |
Yeah, I have an implementation mostly done. I'll be on vacation this week, On Sun, Nov 13, 2016 at 6:36 PM, Arash Rouhani notifications@github.com
|
🎉 as in vacation. Have fun! :D |
Description
Luigi marks a wrapper task as
UPSTREAM_FAILED
orUPSTREAM_DISABLED
when ANY of its upstream task isFAILED
orDISABLED
. This should be ALL instead of ANYMotivation and Context
Luigi marks a wrapper task as
UPSTREAM_FAILED
orUPSTREAM_DISABLED
when ANY of its upstream task isFAILED
orDISABLED
. Moreover, when the wrapper task is marked asUPSTREAM_DISABLED
, luigi shut the worker down as it should do. This causes another PENDING_TASK in the wrapper task not to run because worker is down. This behavior should change. Marking asUPSTREAM_
should be ALL instead of ANY.It is discussed in PR #1782 . For more detail about the problem, please check it.
This is also related #871.
Have you tested this? If so, how?
Using tox locally and debugging it in PyCharm.