Update retcodes to handle new cases #1771

fabriziodemaria · 2016-07-18T09:08:50Z

Description

This pull request is a continuation of #1612. The purpose of this pull request is to cover more cases for which tasks does not run successfully and Luigi exit code is 0.
More specifically, the following two cases are covered:

Tasks are not run because task-limit is reached counts as scheduling_error
Return code can be set for cases in which tasks fail or are left pending for unknown_reason

Motivation and Context

This relates to #1660.

Have you tested this? If so, how?

I have included unit tests.

mention-bot · 2016-07-18T09:08:53Z

@fabriziodemaria, thanks for your PR! By analyzing the annotation information on this pull request, we identified @erikbern, @daveFNbuck and @freider to be potential reviewers

Tarrasch · 2016-07-18T09:26:32Z

Cool! I did a quick glance. One question quickly rose for me. Should we really convey that "unknown-->bad"? I think of unknown like literally unknown. Sometimes it's out of resources (not bad) and sometimes that task is disabled or still have a timeout for it's FAILED state (maybe bad, but the worker that previously failed has already sent error emails and stuff).

What I'm saying is that this is a really good patch, because it allows for luigi users to react exactly the way the want based on return codes. But I propose that (1) we don't use :( as I don't consider it bad. Maybe we make things confusing, but perhaps we could introduce :D for indicating that the root task is done, and unknown stays as :).

Tarrasch · 2016-07-18T09:27:41Z

doc/configuration.rst

@@ -481,18 +481,21 @@ We recommend that you copy this set of exit codes to your ``luigi.cfg`` file:
 missing_data=20
 task_failed=30
 scheduling_error=35
+ unknown_reason=38


... and (2) I suggest we say 38 or 0. And then explain more detailed when you might want what number.

fabriziodemaria · 2016-07-18T12:14:00Z

But I propose that (1) we don't use :( as I don't consider it bad.

I changed :( with :| for the unknown_reason case, would that be good enough in your opinion?

xeago · 2016-07-18T12:37:09Z

What about :¿? I make that to indicate my confusion.

ulzha · 2016-07-18T13:54:05Z

Name it "pending_for_unknown_reason" maybe? That's what it is. And that hopefully also shows how it is worse than "pending". After all we're using ":|" for still_pending_ext ("pending_for_known_reason"). A situation where uncertainty is added is not better.

ulzha · 2016-07-18T14:02:31Z

doc/configuration.rst

+ For when a task does not run successfully because of an unknown reason. Despite
+ this case can be expected in an error free execution, it does not guarantee
+ completeness of the root task. Return code 38 is advised only for cases in which
+ Luigi return code 0 is used by the user to guarantee root tasks' completeness.


Missed the comment suggesting "38 or 0". The comment doesn't describe the benefit. I would -1 this convoluted wording as overcomplication.

0 is already the default value. The existing users who perform retcode configuration won't be taken by surprise.

You're right, lets not complicate things. The benefit I was looking for was to say that this return code is "no problem, don't worry".

What about using a number but a much lower number? I noticed we already say they are in increasing order of severity (I had forgotten this). Maybe use the number 15?

Indeed, I was going to suggest 25. (As said, for a human operator, "not run for unknown reason" IMO evaluates worse than "not run for <insert known reason, like missing data>".)

Ok. I imagined 10 < x < 30. So 25 sounds good because of the reason you said.

Tarrasch · 2016-07-19T02:46:11Z

Sure, let's call it pending for unknown reason. Actually, I oversaw perhaps the most common "reason", that is when you let two separate workers (typically on 2 separate computers) run the same task, but one isn't allowed to run it because the other one has already finished running it.

Actually, even better name would probably be "not_run_for_unknown_reason". If we call it "pending_for_unknown_reason", users might think that it's state according to the central scheduler also is pending, that's usually not the case.

fabriziodemaria · 2016-07-19T07:57:21Z

Actually, I oversaw perhaps the most common "reason", that is when you let two separate workers (typically on 2 separate computers) run the same task, but one isn't allowed to run it because the other one has already finished running it.

Shouldn't this case be covered in the run_by_other_worker set?

Tarrasch · 2016-07-19T08:15:16Z

Shouldn't this case be covered in the run_by_other_worker set?

Sometimes but not always. The summary-printing worker will only know that somebody else ran it if it was RUNNING and that worker asked for work at that time. It could be the case that the other worker was fast and the tasks where DONE already when the summary-printing worker asks for work.

fabriziodemaria · 2016-07-19T08:32:20Z

I have one more proposal for this.
Tasks in this set did not enter the run() phase and there is a certain number of reasons causing this; such reasons have been listed in the various comments of this pull request. In such a context, I think unknown_reason might be misleading since we are not dealing with an unexpected/unknown error. Nevertheless, the set is too broad to better define the specific motivation for which the task didn't run.
What if we simply call the set not_run, and we list the possible causes in the docs. I believe this name fits the other sets' names in terms of descriptive granularity (i.e. completed, already_run, not_run, failed,... ). I would agree on using 25 as error code.
Regarding the info logs and execution summary, did not run might be not enough. We might just mention the most probable causes for it.

Tarrasch · 2016-07-19T08:43:56Z

@fabriziodemaria Your proposal sounds good to me in it's entirety. I like it.

Did you consider to have the info logs say wasn't permitted to run as I suggested before? I think it conveys that the worker did actually ask for work but was denied.

ulzha · 2016-07-19T09:07:42Z

Denial can also occur passively/accidentally, as in the case when connection failure causes task purge, or in the case of an eventual scheduler bug. The "wasn't permitted" presents some controlledness flair when it may not be the case.

But I agree that the pattern how worker gets work out is something users often are confused about. The description of the probable cases in the execution summary can include the phrasing "wasn't permitted to run [by scheduler [because X or Y or Z]]".

ulzha · 2016-07-20T07:26:26Z

doc/configuration.rst

+ because of lack of resources, because the task has been already run by
+ another worker or because the attempted task is in DISABLED state.
+ Connectivity issues with the central scheduler might also cause this.
+ This does not include the cases for which a run is not allowed due to missing


Very nice.

Come to think about it, I find it illogical that already_running has a lower severity in this scheme than the not_run case where a task has already been run and succeeded. But I don't consider that a blocker now.

ulzha · 2016-07-20T07:27:58Z

\o/ Looks good.

Tarrasch · 2016-07-20T08:34:33Z

I squashed this as there were to many implementation-detail commits. Hopefully the reworded commit is clearer and more direct to the point of what changed. Good job! :)

* spotify/master: (24 commits) Add DateSecondParameter to parameter.py (spotify#1779) tox: Specify sphinx dependency better flake8: Unbreak travis build Excludes .tox from flake8 to prevent checking third-party libraries (spotify#1785) README: Remove monthly downloads badge Rename CentralPlannerScheduler to Scheduler (spotify#1781) Remove abstract Scheduler class (spotify#1778) Assistants: Don't affect longevity of tasks (spotify#1772) tests: Skip a inttermittently failing s3 test (spotify#1777) Update retcodes to handle new cases (spotify#1771) tests: Fix warning in remote_scheduler_test.py (spotify#1774) Remove sitecustomize file (spotify#1755) Fix exist method for ftp server Update copy() to return number and size of files copied Remove the confusing "dummy_test_module" directory (spotify#1756) Disable codecov comments on GitHub PRs (spotify#1754) Fix "owner_email" log message. (spotify#1762) docs: Install sphinx 1.4.4 in setup.py (spotify#1761) docs: Set minimum versions for sphinx (spotify#1760) Normalize ListParameter to be Immutable (spotify#1759) ...

fabriziodemaria added 7 commits July 17, 2016 15:25

Add retcode for tasks in state unknown_reason

bca9c50

Reaching task-limit counts as scheduling_error

8ebb073

Add test for unknown_reason exit code

dc1c7c9

Update execution summary for unknwon_reason case

fa906a0

Add test for unknown_reason execution summary

c0d4b9e

Update unknown_reason error description in docs

19e934a

Update scheduling error messages/documentation

e08185e

Tarrasch reviewed Jul 18, 2016
View reviewed changes

Update docs/logs for unknown_reason

54ca4a3

Return code 0 strictly corresponds to success

dde2a24

fabriziodemaria force-pushed the continue-exit-code branch from 28766d3 to dde2a24 Compare July 18, 2016 13:21

ulzha reviewed Jul 18, 2016
View reviewed changes

Change unknown_reason into not_run

df526af

ulzha reviewed Jul 20, 2016
View reviewed changes

Tarrasch merged commit 62b6aa8 into master Jul 20, 2016

Tarrasch deleted the continue-exit-code branch July 20, 2016 08:33

This was referenced Jun 29, 2022

no mo enum 34 #3180

Closed

enum34 be gone #3181

Closed

mdragilev mentioned this pull request Jun 28, 2024

for S3 contrib package move to boto3 Affirm/luigi#26

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update retcodes to handle new cases #1771

Update retcodes to handle new cases #1771

fabriziodemaria commented Jul 18, 2016 •

edited

Loading

mention-bot commented Jul 18, 2016

Tarrasch commented Jul 18, 2016

Tarrasch Jul 18, 2016

fabriziodemaria commented Jul 18, 2016

xeago commented Jul 18, 2016

ulzha commented Jul 18, 2016

ulzha Jul 18, 2016

Tarrasch Jul 19, 2016

ulzha Jul 19, 2016

Tarrasch Jul 19, 2016

Tarrasch commented Jul 19, 2016

fabriziodemaria commented Jul 19, 2016

Tarrasch commented Jul 19, 2016

fabriziodemaria commented Jul 19, 2016

Tarrasch commented Jul 19, 2016

ulzha commented Jul 19, 2016 •

edited

Loading

ulzha Jul 20, 2016

ulzha commented Jul 20, 2016

Tarrasch commented Jul 20, 2016

Update retcodes to handle new cases #1771

Update retcodes to handle new cases #1771

Conversation

fabriziodemaria commented Jul 18, 2016 • edited Loading

Description

Motivation and Context

Have you tested this? If so, how?

mention-bot commented Jul 18, 2016

Tarrasch commented Jul 18, 2016

Tarrasch Jul 18, 2016

Choose a reason for hiding this comment

fabriziodemaria commented Jul 18, 2016

xeago commented Jul 18, 2016

ulzha commented Jul 18, 2016

ulzha Jul 18, 2016

Choose a reason for hiding this comment

Tarrasch Jul 19, 2016

Choose a reason for hiding this comment

ulzha Jul 19, 2016

Choose a reason for hiding this comment

Tarrasch Jul 19, 2016

Choose a reason for hiding this comment

Tarrasch commented Jul 19, 2016

fabriziodemaria commented Jul 19, 2016

Tarrasch commented Jul 19, 2016

fabriziodemaria commented Jul 19, 2016

Tarrasch commented Jul 19, 2016

ulzha commented Jul 19, 2016 • edited Loading

ulzha Jul 20, 2016

Choose a reason for hiding this comment

ulzha commented Jul 20, 2016

Tarrasch commented Jul 20, 2016

fabriziodemaria commented Jul 18, 2016 •

edited

Loading

ulzha commented Jul 19, 2016 •

edited

Loading