Marking as minimum upstream severity instead of max #1789

javrasya · 2016-07-28T06:46:00Z

Description

Luigi marks a wrapper task as UPSTREAM_FAILED or UPSTREAM_DISABLED when ANY of its upstream task is FAILED or DISABLED. This should be ALL instead of ANY

Motivation and Context

Luigi marks a wrapper task as UPSTREAM_FAILED or UPSTREAM_DISABLED when ANY of its upstream task is FAILED or DISABLED. Moreover, when the wrapper task is marked as UPSTREAM_DISABLED, luigi shut the worker down as it should do. This causes another PENDING_TASK in the wrapper task not to run because worker is down. This behavior should change. Marking as UPSTREAM_ should be ALL instead of ANY.

It is discussed in PR #1782 . For more detail about the problem, please check it.

This is also related #871.

Have you tested this? If so, how?

Using tox locally and debugging it in PyCharm.

…` when ANY of its upstream task is `FAILED` or `DISABLED`. Moreover, when the wrapper task is marked as `UPSTREAM_DISABLED`, luigi shut the worker down as it should do. This causes another PENDING_TASK in the wrapper task not to run because worker is down. This behavior should change. Marking as `UPSTREAM_` should be ALL instead of ANY

Tarrasch · 2016-07-28T07:29:24Z

luigi/scheduler.py

-                                      for a_task_id in dep.deps),
-                                     key=UPSTREAM_SEVERITY_KEY)
+                        upstream_severities = list(upstream_status_table.get(a_task_id) for a_task_id in dep.deps if a_task_id in upstream_status_table) or ['']
+                        status = min(upstream_severities, key=UPSTREAM_SEVERITY_KEY)


Oh, why wasn't this as simple as just switching min to max?

I saw a case that break just swtiching max to min but I don't remember it now. Let me try to reproduce it, I will explain it if it is still valid.

@Tarrasch Here is the case;

When any SUCCESS task exists among upstream tasks, upstream_status_table.get(a_task_id, '') results '' empty string event all other upstream tasks are FAILED or DISABLED. Because there is no record in upstream_status_table for SUCCESS tasks. This will cause the wrapper is as PENDING all the time even there is no PENDING task among its upstream tasks. This will also let worker runs forever.

Here are some SS showing two case;

Case which is just swtiching max to min;

As you can se the wrapper task which is TaskLevelRetryPolicy runs all the time, even its dependencies SUCCESS and DISABLED. This is happenning because there is at least one SUCCESS task.

The case what is in my commit;

In this case, the empty string upstream severity is ignored. First, it tries to find min among elements not includes empty string. Even there is SUCCESS which causes empty string return,
elmination will be done among severities. You can see it on the SS.

So all I mean is; max works fine, but min doesn't work like it.

Ah, beautiful. I was just about to tell you that "Hey, we need to ignore SUCCESS states", but now I realize that is what you've worked around here. Good job!

Tarrasch · 2016-07-28T09:57:12Z

Okay. Let me try to write that test case now and verify that the behaviour changes as I expected it. If it does I merge this followed by sending a PR with my own test case.

Previously, only the latter test case passed. Since spotify#1789 both tests do pass. The hairy details can be understood by reading the test case, it's comments and the related pull requests.

Previously, only the latter test case passed. Since #1789 both tests do pass. The hairy details can be understood by reading the test case, it's comments and the related pull requests.

daveFNbuck · 2016-11-10T22:36:51Z

There's really only one narrow case where this PR keeps workers alive: if all runnable tasks are currently being run by other workers. If all of the tasks finish (success or fail) then the worker will die. This will only make a difference if you rely on retrying often and expect new tasks to switch from FAILED to PENDING while the tasks are running.

It seems that the identical effect could be achieved by simply adding a configuration parameter to count RUNNING tasks in the PENDING count in order to keep workers alive. Alternatively, we could add a timeout where workers stay alive until a configurable amount of time has passed without seeing any running tasks.

Meanwhile, my visualizer now gives a misleading under-count of the number of upstream disabled and failed jobs. It seems quite weird that this count should grow as more jobs complete successfully. If an important job is being blocked by a failure, I don't want to have to wait until every other dependency is complete to learn about it.

Am I missing something here?

Also, note that this change is equivalent to just changing the UPSTREAM_SEVERITY_ORDER to

UPSTREAM_SEVERITY_ORDER = (
    '',
    UPSTREAM_DISABLED,
    UPSTREAM_FAILED,
    UPSTREAM_MISSING_INPUT,
    UPSTREAM_RUNNING,
)

Which I think is pretty weird, as it makes depending on a RUNNING job more severe than depending on a DISABLED job.

…1789)" This reverts commit b12be00.

Tarrasch · 2016-11-11T02:57:30Z

Meanwhile, my visualizer now gives a misleading under-count of the number of upstream disabled and failed jobs. It seems quite weird that this count should grow as more jobs complete successfully. If an important job is being blocked by a failure, I don't want to have to wait until every other dependency is complete to learn about it.

Am I missing something here?

What you say is correct. The interpretation of "task X is an upstream disabled task" changed with this patch:

Before it meant "task X has some dependency which is disabled* (but no dependency is worse off)"
Now it reads "Nothing on X or it's dependencies is progressing! All non-complete dependencies of X are (at least) disabled*".

The *star because it's actually disabled+upstream disabled.

I think both numbers are pretty definitions are hard to grasp and many luigi users don't care to learn the difference. But if any, I actually do agree that the former is more intuitive. I'm not sure if it's better though. The old definition is very sensitive. Imagine a big enterprise and a leaf task from team Z has failed, now team A knows nothing about the details of team Z's tasks, but they transitively depend on it. But still their task is marked as reddish even though most dependencies are progressing. However, team Z is probably fully aware of the problem, they've got their error emails and can see that their task is "FAILED" in the scheduler.

I'm kind of open to reverting this if you really think the old way to count is more useful. As long as the test case from #1790 still passes.

The only thing I'm against is making the two ways of counting configurable, as it is too detailed for the typical luigi user to care about.

daveFNbuck · 2016-11-11T04:12:29Z

My point of view is that if any upstream task is failed or disabled, your
task will not run. It's a bit weird to not mark it reddish until it
completely stops getting new dependencies run. Seeing an accurate upstream
failed count makes it easier to know that there's a serious problem. Seeing
a job you're tracking marked as upstream failed tells you that there's
something you need to look into to make sure it gets done. Maybe you need
to fix it, maybe you need to ping another team to make sure they're on it.

I'll put something together to handle the test case. I didn't see that was
there!

On Thu, Nov 10, 2016 at 6:57 PM, Arash Rouhani notifications@github.com
wrote:

Meanwhile, my visualizer now gives a misleading under-count of the number
of upstream disabled and failed jobs. It seems quite weird that this count
should grow as more jobs complete successfully. If an important job is
being blocked by a failure, I don't want to have to wait until every other
dependency is complete to learn about it.

Am I missing something here?

What you say is correct. The interpretation of "task X is an upstream
disabled task" changed with this patch:

Before it meant "task X has some dependency which is disabled* (but
no dependency is worse off)"

Now it reads "Nothing on X or it's dependencies is progressing! All
non-complete dependencies of X are (at least) disabled*".

The *star because it's actually disabled+upstream disabled.

I think both numbers are pretty definitions are hard to grasp and many
luigi users don't care to learn the difference. But if any, I actually do
agree that the former is more intuitive. I'm not sure if it's better
though. The old definition is very sensitive. Imagine a big enterprise and
a leaf task from team Z has failed, now team A knows nothing about the
details of team Z's tasks, but they transitively depend on it. But still
their task is marked as reddish even though most dependencies are
progressing. However, team Z is probably fully aware of the problem,
they've got their error emails and can see that their task is "FAILED" in
the scheduler.

I'm kind of open to reverting this if you really think the old way to
count is more useful. As long as the test case from #1790
#1790 still passes.

The only thing I'm against is making the two ways of counting
configurable, as it is too detailed for the typical luigi user to care
about.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1789 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AB_rbemSxz1GM5dKsKmJ2lGWSYILVKPuks5q89mngaJpZM4JW6Rw
.

javrasya · 2016-11-11T06:45:35Z

Seeing an accurate upstream
failed count makes it easier to know that there's a serious problem. Seeing
a job you're tracking marked as upstream failed tells you that there's
something you need to look into to make sure it gets done

@daveFNbuck you are right, but upstreams are not just used for that; one of upstream status which is UPSTREAM_DISABLED, is also used to decide whether workers should keep working or not. I mean, this is not just used to mark it as "hey there is a problem, lets take a look". (There is no problem with UPSTREAM_FAILED by the way.). Whenever a wrapper task is UPSTREAM_DISABLED, this also means making luigi stops working. This is why we change the old behavior which was calculating upstream status as MAX among upstreams to MIN.

Here is how luigi uses UPSTREAM_DISABLED.

            if task.status == PENDING and in_workers:
                upstream_status = self._upstream_status(task.id, upstream_table)
                if upstream_status != UPSTREAM_DISABLED:
                    locally_pending_tasks += 1
                    if len(task.workers) == 1 and not assistant:
                        n_unique_pending += 1

I understand what you mean, but isn't not letting workers keep working for a task needs to be retried when there is no pending task, is problematic? I think, this is more critical issue that not to let it to retry a task in this case.

If you guys think that kind of monitoring that you mentioned is also important which is I also agree, I think we should do it in a way by not reverting the feature in this PR.

Tarrasch · 2016-11-11T07:30:06Z

Thanks for everyone's input. My summary so far would be that we want these features in luigi:

The visualizer should indicate if there's anything at all downstream for a task that's DISABLED, as that thing most probably requires human intervention to get resolved.
keep-alive-Workers should be smart and not quit just because there's failed stuff left, but they should give up on DISABLED. (Exactly this is tested in tests: Add a testcase confirming #1789 #1790 I recall)

An idea for solution could be to

Remove UPSTREAM_FAILURE from the visualizer. People shouldn't care about it. Failures are normal and people (for dependent tasks) shouldn't be alarmed unnecessarily.
Introduce something called BLOCKED_DISABLED. UPSTREAM_DISABLED is the more sensitive ANY-version and BLOCKED_DISABLED is the ALL-version.
Perhaps BLOCKED_DISABLED doesn't need to be in the visualizer as it might be confusing. I'm not sure about this.

daveFNbuck · 2016-11-11T22:19:45Z

I have a different solution that I think fixes the issues. We just count FAILED tasks as pending in count_pending in the scheduler. That way we keep the worker alive until all jobs are fully blocked by disabled jobs. I have a branch that does this, but it breaks some worker tests because they assume everything will stop once all jobs are failed. So it'll take a bit of effort to fix each of those appropriately. With this, we don't need BLOCKED_DISABLED, as we're more directly getting at the idea of allowing re-runs until jobs are disabled.

I personally like having UPSTREAM_FAILURE but I can see your side too. Someone on my team is also currently adding a READY state to our visualizer that will filter to which just shows PENDING tasks where every dependency is DONE. I think this is more useful than UPSTREAM_FAILURE and keeps the nice balance of 8 items in two rows of 4 if we remove a status when we add this.

Tarrasch · 2016-11-14T02:36:16Z

@daveFNbuck:

As for count-FAILED-as-pending idea. It clings well in my ears, probably because we make clear the intention that they will be re-run.

As for the READY state, I also like it, BLOCKED_DISABLED is a bit long. As you pointed out. We should always keep the 8 items balance in mind too. :)

@daveFNbuck, were you up for eventually implementing any of these ideas?

daveFNbuck · 2016-11-14T06:48:36Z

Yeah, I have an implementation mostly done. I'll be on vacation this week,
but I'll pick it up again after that.

On Sun, Nov 13, 2016 at 6:36 PM, Arash Rouhani notifications@github.com
wrote:

@daveFNbuck https://github.com/daveFNbuck:

As for count-FAILED-as-pending idea. It clings well in my ears, probably
because we make clear the intention that they will be re-run.

As for the READY state, I also like it, BLOCKED_DISABLED is a bit long.
As you pointed out. We should always keep the 8 items balance in mind too.
:)

@daveFNbuck https://github.com/daveFNbuck, were you up for eventually
implementing any of these ideas?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1789 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AB_rbY6TK4_f682ucCM-obMuiYsD4ocRks5q98kmgaJpZM4JW6Rw
.

Tarrasch · 2016-11-14T07:17:30Z

🎉 as in vacation. Have fun! :D

…1789)" This reverts commit b12be00.

This reverts commit b12be00.

Tarrasch reviewed Jul 28, 2016
View reviewed changes

Tarrasch merged commit b12be00 into spotify:master Jul 28, 2016

Tarrasch mentioned this pull request Jul 28, 2016

tests: Add a testcase confirming #1789 #1790

Merged

daveFNbuck added a commit to Houzz/luigi that referenced this pull request Nov 10, 2016

Revert "Marking as minimum upstream severity instead of max (spotify#…

42a8e63

…1789)" This reverts commit b12be00.

daveFNbuck added a commit to Houzz/luigi that referenced this pull request Nov 21, 2016

Revert "Marking as minimum upstream severity instead of max (spotify#…

b143175

…1789)" This reverts commit b12be00.

daveFNbuck mentioned this pull request Nov 21, 2016

Include FAILED and UPSTREAM_FAILED tasks in count_pending #1926

Merged

daveFNbuck added a commit to Houzz/luigi that referenced this pull request Nov 28, 2016

Revert "Marking as minimum upstream severity instead of max (spotify#…

501316f

…1789)" This reverts commit b12be00.

Tarrasch pushed a commit that referenced this pull request Dec 1, 2016

Revert "Marking as minimum upstream severity instead of max (#1789)"

0f7ff4a

This reverts commit b12be00.

This was referenced Jun 29, 2022

no mo enum 34 #3180

Closed

enum34 be gone #3181

Closed

mdragilev mentioned this pull request Jun 28, 2024

for S3 contrib package move to boto3 Affirm/luigi#26

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Marking as minimum upstream severity instead of max #1789

Marking as minimum upstream severity instead of max #1789

javrasya commented Jul 28, 2016 •

edited

Loading

Tarrasch Jul 28, 2016

javrasya Jul 28, 2016

javrasya Jul 28, 2016 •

edited

Loading

Tarrasch Jul 28, 2016 •

edited

Loading

Tarrasch commented Jul 28, 2016

daveFNbuck commented Nov 10, 2016

Tarrasch commented Nov 11, 2016

daveFNbuck commented Nov 11, 2016

The *star because it's actually disabled+upstream disabled.

javrasya commented Nov 11, 2016 •

edited

Loading

Tarrasch commented Nov 11, 2016 •

edited

Loading

daveFNbuck commented Nov 11, 2016

Tarrasch commented Nov 14, 2016

daveFNbuck commented Nov 14, 2016

Tarrasch commented Nov 14, 2016

Marking as minimum upstream severity instead of max #1789

Marking as minimum upstream severity instead of max #1789

Conversation

javrasya commented Jul 28, 2016 • edited Loading

Description

Motivation and Context

Have you tested this? If so, how?

Tarrasch Jul 28, 2016

Choose a reason for hiding this comment

javrasya Jul 28, 2016

Choose a reason for hiding this comment

javrasya Jul 28, 2016 • edited Loading

Choose a reason for hiding this comment

Case which is just swtiching max to min;

The case what is in my commit;

Tarrasch Jul 28, 2016 • edited Loading

Choose a reason for hiding this comment

Tarrasch commented Jul 28, 2016

daveFNbuck commented Nov 10, 2016

Tarrasch commented Nov 11, 2016

daveFNbuck commented Nov 11, 2016

The *star because it's actually disabled+upstream disabled.

javrasya commented Nov 11, 2016 • edited Loading

Tarrasch commented Nov 11, 2016 • edited Loading

daveFNbuck commented Nov 11, 2016

Tarrasch commented Nov 14, 2016

daveFNbuck commented Nov 14, 2016

Tarrasch commented Nov 14, 2016

javrasya commented Jul 28, 2016 •

edited

Loading

javrasya Jul 28, 2016 •

edited

Loading

Tarrasch Jul 28, 2016 •

edited

Loading

javrasya commented Nov 11, 2016 •

edited

Loading

Tarrasch commented Nov 11, 2016 •

edited

Loading