-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update retcodes to handle new cases #1771
Conversation
@fabriziodemaria, thanks for your PR! By analyzing the annotation information on this pull request, we identified @erikbern, @daveFNbuck and @freider to be potential reviewers |
Cool! I did a quick glance. One question quickly rose for me. Should we really convey that "unknown-->bad"? I think of unknown like literally unknown. Sometimes it's out of resources (not bad) and sometimes that task is disabled or still have a timeout for it's FAILED state (maybe bad, but the worker that previously failed has already sent error emails and stuff). What I'm saying is that this is a really good patch, because it allows for luigi users to react exactly the way the want based on return codes. But I propose that (1) we don't use |
@@ -481,18 +481,21 @@ We recommend that you copy this set of exit codes to your ``luigi.cfg`` file: | |||
missing_data=20 | |||
task_failed=30 | |||
scheduling_error=35 | |||
unknown_reason=38 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... and (2) I suggest we say 38 or 0
. And then explain more detailed when you might want what number.
I changed |
What about |
28766d3
to
dde2a24
Compare
Name it "pending_for_unknown_reason" maybe? That's what it is. And that hopefully also shows how it is worse than "pending". After all we're using ":|" for |
For when a task does not run successfully because of an unknown reason. Despite | ||
this case can be expected in an error free execution, it does not guarantee | ||
completeness of the root task. Return code 38 is advised only for cases in which | ||
Luigi return code 0 is used by the user to guarantee root tasks' completeness. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missed the comment suggesting "38 or 0". The comment doesn't describe the benefit. I would -1 this convoluted wording as overcomplication.
0 is already the default value. The existing users who perform retcode configuration won't be taken by surprise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, lets not complicate things. The benefit I was looking for was to say that this return code is "no problem, don't worry".
What about using a number but a much lower number? I noticed we already say they are in increasing order of severity (I had forgotten this). Maybe use the number 15
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I was going to suggest 25. (As said, for a human operator, "not run for unknown reason" IMO evaluates worse than "not run for <insert known reason, like missing data>".)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I imagined 10 < x < 30
. So 25 sounds good because of the reason you said.
Sure, let's call it pending for unknown reason. Actually, I oversaw perhaps the most common "reason", that is when you let two separate workers (typically on 2 separate computers) run the same task, but one isn't allowed to run it because the other one has already finished running it. Actually, even better name would probably be "not_run_for_unknown_reason". If we call it "pending_for_unknown_reason", users might think that it's state according to the central scheduler also is pending, that's usually not the case. |
Shouldn't this case be covered in the |
Sometimes but not always. The summary-printing worker will only know that somebody else ran it if it was RUNNING and that worker asked for work at that time. It could be the case that the other worker was fast and the tasks where DONE already when the summary-printing worker asks for work. |
I have one more proposal for this. |
@fabriziodemaria Your proposal sounds good to me in it's entirety. I like it. Did you consider to have the info logs say |
Denial can also occur passively/accidentally, as in the case when connection failure causes task purge, or in the case of an eventual scheduler bug. The "wasn't permitted" presents some controlledness flair when it may not be the case. But I agree that the pattern how worker gets work out is something users often are confused about. The description of the probable cases in the execution summary can include the phrasing "wasn't permitted to run [by scheduler [because X or Y or Z]]". |
because of lack of resources, because the task has been already run by | ||
another worker or because the attempted task is in DISABLED state. | ||
Connectivity issues with the central scheduler might also cause this. | ||
This does not include the cases for which a run is not allowed due to missing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice.
Come to think about it, I find it illogical that already_running has a lower severity in this scheme than the not_run case where a task has already been run and succeeded. But I don't consider that a blocker now.
\o/ Looks good. |
I squashed this as there were to many implementation-detail commits. Hopefully the reworded commit is clearer and more direct to the point of what changed. Good job! :) |
* spotify/master: (24 commits) Add DateSecondParameter to parameter.py (spotify#1779) tox: Specify sphinx dependency better flake8: Unbreak travis build Excludes .tox from flake8 to prevent checking third-party libraries (spotify#1785) README: Remove monthly downloads badge Rename CentralPlannerScheduler to Scheduler (spotify#1781) Remove abstract Scheduler class (spotify#1778) Assistants: Don't affect longevity of tasks (spotify#1772) tests: Skip a inttermittently failing s3 test (spotify#1777) Update retcodes to handle new cases (spotify#1771) tests: Fix warning in remote_scheduler_test.py (spotify#1774) Remove sitecustomize file (spotify#1755) Fix exist method for ftp server Update copy() to return number and size of files copied Remove the confusing "dummy_test_module" directory (spotify#1756) Disable codecov comments on GitHub PRs (spotify#1754) Fix "owner_email" log message. (spotify#1762) docs: Install sphinx 1.4.4 in setup.py (spotify#1761) docs: Set minimum versions for sphinx (spotify#1760) Normalize ListParameter to be Immutable (spotify#1759) ...
Description
This pull request is a continuation of #1612. The purpose of this pull request is to cover more cases for which tasks does not run successfully and Luigi exit code is 0.
More specifically, the following two cases are covered:
Motivation and Context
This relates to #1660.
Have you tested this? If so, how?
I have included unit tests.