Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Marking as minimum upstream severity instead of max #1789

Merged
merged 1 commit into from
Jul 28, 2016

Conversation

javrasya
Copy link
Contributor

@javrasya javrasya commented Jul 28, 2016

Description

Luigi marks a wrapper task as UPSTREAM_FAILED or UPSTREAM_DISABLED when ANY of its upstream task is FAILED or DISABLED. This should be ALL instead of ANY

Motivation and Context

Luigi marks a wrapper task as UPSTREAM_FAILED or UPSTREAM_DISABLED when ANY of its upstream task is FAILED or DISABLED. Moreover, when the wrapper task is marked as UPSTREAM_DISABLED, luigi shut the worker down as it should do. This causes another PENDING_TASK in the wrapper task not to run because worker is down. This behavior should change. Marking as UPSTREAM_ should be ALL instead of ANY.

It is discussed in PR #1782 . For more detail about the problem, please check it.

This is also related #871.

Have you tested this? If so, how?

Using tox locally and debugging it in PyCharm.

…` when ANY of its upstream task is `FAILED` or `DISABLED`. Moreover, when the wrapper task is marked as `UPSTREAM_DISABLED`, luigi shut the worker down as it should do. This causes another PENDING_TASK in the wrapper task not to run because worker is down. This behavior should change. Marking as `UPSTREAM_` should be ALL instead of ANY
for a_task_id in dep.deps),
key=UPSTREAM_SEVERITY_KEY)
upstream_severities = list(upstream_status_table.get(a_task_id) for a_task_id in dep.deps if a_task_id in upstream_status_table) or ['']
status = min(upstream_severities, key=UPSTREAM_SEVERITY_KEY)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, why wasn't this as simple as just switching min to max?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw a case that break just swtiching max to min but I don't remember it now. Let me try to reproduce it, I will explain it if it is still valid.

Copy link
Contributor Author

@javrasya javrasya Jul 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Tarrasch Here is the case;

When any SUCCESS task exists among upstream tasks, upstream_status_table.get(a_task_id, '') results '' empty string event all other upstream tasks are FAILED or DISABLED. Because there is no record in upstream_status_table for SUCCESS tasks. This will cause the wrapper is as PENDING all the time even there is no PENDING task among its upstream tasks. This will also let worker runs forever.

Here are some SS showing two case;

Case which is just swtiching max to min;

screen shot 2016-07-28 at 11 11 53

As you can se the wrapper task which is TaskLevelRetryPolicy runs all the time, even its dependencies SUCCESS and DISABLED. This is happenning because there is at least one SUCCESS task.

The case what is in my commit;

screen shot 2016-07-28 at 11 14 51

In this case, the empty string upstream severity is ignored. First, it tries to find min among elements not includes empty string. Even there is SUCCESS which causes empty string return,
elmination will be done among severities. You can see it on the SS.

So all I mean is; max works fine, but min doesn't work like it.

Copy link
Contributor

@Tarrasch Tarrasch Jul 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, beautiful. I was just about to tell you that "Hey, we need to ignore SUCCESS states", but now I realize that is what you've worked around here. Good job!

@Tarrasch
Copy link
Contributor

Okay. Let me try to write that test case now and verify that the behaviour changes as I expected it. If it does I merge this followed by sending a PR with my own test case.

@Tarrasch Tarrasch merged commit b12be00 into spotify:master Jul 28, 2016
Tarrasch added a commit to Tarrasch/luigi that referenced this pull request Jul 28, 2016
Previously, only the latter test case passed. Since spotify#1789 both tests do
pass. The hairy details can be understood by reading the test case, it's
comments and the related pull requests.
Tarrasch added a commit to Tarrasch/luigi that referenced this pull request Jul 28, 2016
Previously, only the latter test case passed. Since spotify#1789 both tests do
pass. The hairy details can be understood by reading the test case, it's
comments and the related pull requests.
Tarrasch added a commit to Tarrasch/luigi that referenced this pull request Jul 28, 2016
Previously, only the latter test case passed. Since spotify#1789 both tests do
pass. The hairy details can be understood by reading the test case, it's
comments and the related pull requests.
Tarrasch added a commit that referenced this pull request Jul 29, 2016
Previously, only the latter test case passed. Since #1789 both tests do
pass. The hairy details can be understood by reading the test case, it's
comments and the related pull requests.
@daveFNbuck
Copy link
Contributor

There's really only one narrow case where this PR keeps workers alive: if all runnable tasks are currently being run by other workers. If all of the tasks finish (success or fail) then the worker will die. This will only make a difference if you rely on retrying often and expect new tasks to switch from FAILED to PENDING while the tasks are running.

It seems that the identical effect could be achieved by simply adding a configuration parameter to count RUNNING tasks in the PENDING count in order to keep workers alive. Alternatively, we could add a timeout where workers stay alive until a configurable amount of time has passed without seeing any running tasks.

Meanwhile, my visualizer now gives a misleading under-count of the number of upstream disabled and failed jobs. It seems quite weird that this count should grow as more jobs complete successfully. If an important job is being blocked by a failure, I don't want to have to wait until every other dependency is complete to learn about it.

Am I missing something here?

Also, note that this change is equivalent to just changing the UPSTREAM_SEVERITY_ORDER to

UPSTREAM_SEVERITY_ORDER = (
    '',
    UPSTREAM_DISABLED,
    UPSTREAM_FAILED,
    UPSTREAM_MISSING_INPUT,
    UPSTREAM_RUNNING,
)

Which I think is pretty weird, as it makes depending on a RUNNING job more severe than depending on a DISABLED job.

daveFNbuck added a commit to Houzz/luigi that referenced this pull request Nov 10, 2016
@Tarrasch
Copy link
Contributor

Meanwhile, my visualizer now gives a misleading under-count of the number of upstream disabled and failed jobs. It seems quite weird that this count should grow as more jobs complete successfully. If an important job is being blocked by a failure, I don't want to have to wait until every other dependency is complete to learn about it.

Am I missing something here?

What you say is correct. The interpretation of "task X is an upstream disabled task" changed with this patch:

  • Before it meant "task X has some dependency which is disabled* (but no dependency is worse off)"
  • Now it reads "Nothing on X or it's dependencies is progressing! All non-complete dependencies of X are (at least) disabled*".

The *star because it's actually disabled+upstream disabled.


I think both numbers are pretty definitions are hard to grasp and many luigi users don't care to learn the difference. But if any, I actually do agree that the former is more intuitive. I'm not sure if it's better though. The old definition is very sensitive. Imagine a big enterprise and a leaf task from team Z has failed, now team A knows nothing about the details of team Z's tasks, but they transitively depend on it. But still their task is marked as reddish even though most dependencies are progressing. However, team Z is probably fully aware of the problem, they've got their error emails and can see that their task is "FAILED" in the scheduler.

I'm kind of open to reverting this if you really think the old way to count is more useful. As long as the test case from #1790 still passes.

The only thing I'm against is making the two ways of counting configurable, as it is too detailed for the typical luigi user to care about.

@daveFNbuck
Copy link
Contributor

My point of view is that if any upstream task is failed or disabled, your
task will not run. It's a bit weird to not mark it reddish until it
completely stops getting new dependencies run. Seeing an accurate upstream
failed count makes it easier to know that there's a serious problem. Seeing
a job you're tracking marked as upstream failed tells you that there's
something you need to look into to make sure it gets done. Maybe you need
to fix it, maybe you need to ping another team to make sure they're on it.

I'll put something together to handle the test case. I didn't see that was
there!

On Thu, Nov 10, 2016 at 6:57 PM, Arash Rouhani notifications@github.com
wrote:

Meanwhile, my visualizer now gives a misleading under-count of the number
of upstream disabled and failed jobs. It seems quite weird that this count
should grow as more jobs complete successfully. If an important job is
being blocked by a failure, I don't want to have to wait until every other
dependency is complete to learn about it.

Am I missing something here?

What you say is correct. The interpretation of "task X is an upstream
disabled task" changed with this patch:

  • Before it meant "task X has some dependency which is disabled* (but
    no dependency is worse off)"
  • Now it reads "Nothing on X or it's dependencies is progressing! All
    non-complete dependencies of X are (at least) disabled*".

The *star because it's actually disabled+upstream disabled.

I think both numbers are pretty definitions are hard to grasp and many
luigi users don't care to learn the difference. But if any, I actually do
agree that the former is more intuitive. I'm not sure if it's better
though. The old definition is very sensitive. Imagine a big enterprise and
a leaf task from team Z has failed, now team A knows nothing about the
details of team Z's tasks, but they transitively depend on it. But still
their task is marked as reddish even though most dependencies are
progressing. However, team Z is probably fully aware of the problem,
they've got their error emails and can see that their task is "FAILED" in
the scheduler.

I'm kind of open to reverting this if you really think the old way to
count is more useful. As long as the test case from #1790
#1790 still passes.

The only thing I'm against is making the two ways of counting
configurable, as it is too detailed for the typical luigi user to care
about.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1789 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AB_rbemSxz1GM5dKsKmJ2lGWSYILVKPuks5q89mngaJpZM4JW6Rw
.

@javrasya
Copy link
Contributor Author

javrasya commented Nov 11, 2016

Seeing an accurate upstream
failed count makes it easier to know that there's a serious problem. Seeing
a job you're tracking marked as upstream failed tells you that there's
something you need to look into to make sure it gets done

@daveFNbuck you are right, but upstreams are not just used for that; one of upstream status which is UPSTREAM_DISABLED, is also used to decide whether workers should keep working or not. I mean, this is not just used to mark it as "hey there is a problem, lets take a look". (There is no problem with UPSTREAM_FAILED by the way.). Whenever a wrapper task is UPSTREAM_DISABLED, this also means making luigi stops working. This is why we change the old behavior which was calculating upstream status as MAX among upstreams to MIN.

Here is how luigi uses UPSTREAM_DISABLED.

            if task.status == PENDING and in_workers:
                upstream_status = self._upstream_status(task.id, upstream_table)
                if upstream_status != UPSTREAM_DISABLED:
                    locally_pending_tasks += 1
                    if len(task.workers) == 1 and not assistant:
                        n_unique_pending += 1

I understand what you mean, but isn't not letting workers keep working for a task needs to be retried when there is no pending task, is problematic? I think, this is more critical issue that not to let it to retry a task in this case.

If you guys think that kind of monitoring that you mentioned is also important which is I also agree, I think we should do it in a way by not reverting the feature in this PR.

@Tarrasch
Copy link
Contributor

Tarrasch commented Nov 11, 2016

Thanks for everyone's input. My summary so far would be that we want these features in luigi:

  • The visualizer should indicate if there's anything at all downstream for a task that's DISABLED, as that thing most probably requires human intervention to get resolved.
  • keep-alive-Workers should be smart and not quit just because there's failed stuff left, but they should give up on DISABLED. (Exactly this is tested in tests: Add a testcase confirming #1789 #1790 I recall)

An idea for solution could be to

  • Remove UPSTREAM_FAILURE from the visualizer. People shouldn't care about it. Failures are normal and people (for dependent tasks) shouldn't be alarmed unnecessarily.
  • Introduce something called BLOCKED_DISABLED. UPSTREAM_DISABLED is the more sensitive ANY-version and BLOCKED_DISABLED is the ALL-version.
  • Perhaps BLOCKED_DISABLED doesn't need to be in the visualizer as it might be confusing. I'm not sure about this.

@daveFNbuck
Copy link
Contributor

I have a different solution that I think fixes the issues. We just count FAILED tasks as pending in count_pending in the scheduler. That way we keep the worker alive until all jobs are fully blocked by disabled jobs. I have a branch that does this, but it breaks some worker tests because they assume everything will stop once all jobs are failed. So it'll take a bit of effort to fix each of those appropriately. With this, we don't need BLOCKED_DISABLED, as we're more directly getting at the idea of allowing re-runs until jobs are disabled.

I personally like having UPSTREAM_FAILURE but I can see your side too. Someone on my team is also currently adding a READY state to our visualizer that will filter to which just shows PENDING tasks where every dependency is DONE. I think this is more useful than UPSTREAM_FAILURE and keeps the nice balance of 8 items in two rows of 4 if we remove a status when we add this.

@Tarrasch
Copy link
Contributor

@daveFNbuck:

As for count-FAILED-as-pending idea. It clings well in my ears, probably because we make clear the intention that they will be re-run.

As for the READY state, I also like it, BLOCKED_DISABLED is a bit long. As you pointed out. We should always keep the 8 items balance in mind too. :)

@daveFNbuck, were you up for eventually implementing any of these ideas?

@daveFNbuck
Copy link
Contributor

Yeah, I have an implementation mostly done. I'll be on vacation this week,
but I'll pick it up again after that.

On Sun, Nov 13, 2016 at 6:36 PM, Arash Rouhani notifications@github.com
wrote:

@daveFNbuck https://github.com/daveFNbuck:

As for count-FAILED-as-pending idea. It clings well in my ears, probably
because we make clear the intention that they will be re-run.

As for the READY state, I also like it, BLOCKED_DISABLED is a bit long.
As you pointed out. We should always keep the 8 items balance in mind too.
:)

@daveFNbuck https://github.com/daveFNbuck, were you up for eventually
implementing any of these ideas?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1789 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AB_rbY6TK4_f682ucCM-obMuiYsD4ocRks5q98kmgaJpZM4JW6Rw
.

@Tarrasch
Copy link
Contributor

🎉 as in vacation. Have fun! :D

daveFNbuck added a commit to Houzz/luigi that referenced this pull request Nov 21, 2016
daveFNbuck added a commit to Houzz/luigi that referenced this pull request Nov 28, 2016
Tarrasch pushed a commit that referenced this pull request Dec 1, 2016
This was referenced Jun 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants