Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update retcodes to handle new cases #1771

Merged
merged 10 commits into from
Jul 20, 2016
Merged

Update retcodes to handle new cases #1771

merged 10 commits into from
Jul 20, 2016

Conversation

fabriziodemaria
Copy link
Member

@fabriziodemaria fabriziodemaria commented Jul 18, 2016

Description

This pull request is a continuation of #1612. The purpose of this pull request is to cover more cases for which tasks does not run successfully and Luigi exit code is 0.
More specifically, the following two cases are covered:

  • Tasks are not run because task-limit is reached counts as scheduling_error
  • Return code can be set for cases in which tasks fail or are left pending for unknown_reason

Motivation and Context

This relates to #1660.

Have you tested this? If so, how?

I have included unit tests.

@mention-bot
Copy link

@fabriziodemaria, thanks for your PR! By analyzing the annotation information on this pull request, we identified @erikbern, @daveFNbuck and @freider to be potential reviewers

@Tarrasch
Copy link
Contributor

Cool! I did a quick glance. One question quickly rose for me. Should we really convey that "unknown-->bad"? I think of unknown like literally unknown. Sometimes it's out of resources (not bad) and sometimes that task is disabled or still have a timeout for it's FAILED state (maybe bad, but the worker that previously failed has already sent error emails and stuff).

What I'm saying is that this is a really good patch, because it allows for luigi users to react exactly the way the want based on return codes. But I propose that (1) we don't use :( as I don't consider it bad. Maybe we make things confusing, but perhaps we could introduce :D for indicating that the root task is done, and unknown stays as :).

@@ -481,18 +481,21 @@ We recommend that you copy this set of exit codes to your ``luigi.cfg`` file:
missing_data=20
task_failed=30
scheduling_error=35
unknown_reason=38
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and (2) I suggest we say 38 or 0. And then explain more detailed when you might want what number.

@fabriziodemaria
Copy link
Member Author

But I propose that (1) we don't use :( as I don't consider it bad.

I changed :( with :| for the unknown_reason case, would that be good enough in your opinion?

@xeago
Copy link

xeago commented Jul 18, 2016

What about :¿? I make that to indicate my confusion.

@ulzha
Copy link
Contributor

ulzha commented Jul 18, 2016

Name it "pending_for_unknown_reason" maybe? That's what it is. And that hopefully also shows how it is worse than "pending". After all we're using ":|" for still_pending_ext ("pending_for_known_reason"). A situation where uncertainty is added is not better.

For when a task does not run successfully because of an unknown reason. Despite
this case can be expected in an error free execution, it does not guarantee
completeness of the root task. Return code 38 is advised only for cases in which
Luigi return code 0 is used by the user to guarantee root tasks' completeness.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed the comment suggesting "38 or 0". The comment doesn't describe the benefit. I would -1 this convoluted wording as overcomplication.

0 is already the default value. The existing users who perform retcode configuration won't be taken by surprise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, lets not complicate things. The benefit I was looking for was to say that this return code is "no problem, don't worry".

What about using a number but a much lower number? I noticed we already say they are in increasing order of severity (I had forgotten this). Maybe use the number 15?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I was going to suggest 25. (As said, for a human operator, "not run for unknown reason" IMO evaluates worse than "not run for <insert known reason, like missing data>".)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I imagined 10 < x < 30. So 25 sounds good because of the reason you said.

@Tarrasch
Copy link
Contributor

Sure, let's call it pending for unknown reason. Actually, I oversaw perhaps the most common "reason", that is when you let two separate workers (typically on 2 separate computers) run the same task, but one isn't allowed to run it because the other one has already finished running it.

Actually, even better name would probably be "not_run_for_unknown_reason". If we call it "pending_for_unknown_reason", users might think that it's state according to the central scheduler also is pending, that's usually not the case.

@fabriziodemaria
Copy link
Member Author

Actually, I oversaw perhaps the most common "reason", that is when you let two separate workers (typically on 2 separate computers) run the same task, but one isn't allowed to run it because the other one has already finished running it.

Shouldn't this case be covered in the run_by_other_worker set?

@Tarrasch
Copy link
Contributor

Shouldn't this case be covered in the run_by_other_worker set?

Sometimes but not always. The summary-printing worker will only know that somebody else ran it if it was RUNNING and that worker asked for work at that time. It could be the case that the other worker was fast and the tasks where DONE already when the summary-printing worker asks for work.

@fabriziodemaria
Copy link
Member Author

I have one more proposal for this.
Tasks in this set did not enter the run() phase and there is a certain number of reasons causing this; such reasons have been listed in the various comments of this pull request. In such a context, I think unknown_reason might be misleading since we are not dealing with an unexpected/unknown error. Nevertheless, the set is too broad to better define the specific motivation for which the task didn't run.
What if we simply call the set not_run, and we list the possible causes in the docs. I believe this name fits the other sets' names in terms of descriptive granularity (i.e. completed, already_run, not_run, failed,... ). I would agree on using 25 as error code.
Regarding the info logs and execution summary, did not run might be not enough. We might just mention the most probable causes for it.

@Tarrasch
Copy link
Contributor

@fabriziodemaria Your proposal sounds good to me in it's entirety. I like it.

Did you consider to have the info logs say wasn't permitted to run as I suggested before? I think it conveys that the worker did actually ask for work but was denied.

@ulzha
Copy link
Contributor

ulzha commented Jul 19, 2016

Denial can also occur passively/accidentally, as in the case when connection failure causes task purge, or in the case of an eventual scheduler bug. The "wasn't permitted" presents some controlledness flair when it may not be the case.

But I agree that the pattern how worker gets work out is something users often are confused about. The description of the probable cases in the execution summary can include the phrasing "wasn't permitted to run [by scheduler [because X or Y or Z]]".

because of lack of resources, because the task has been already run by
another worker or because the attempted task is in DISABLED state.
Connectivity issues with the central scheduler might also cause this.
This does not include the cases for which a run is not allowed due to missing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice.

Come to think about it, I find it illogical that already_running has a lower severity in this scheme than the not_run case where a task has already been run and succeeded. But I don't consider that a blocker now.

@ulzha
Copy link
Contributor

ulzha commented Jul 20, 2016

\o/ Looks good.

@Tarrasch Tarrasch merged commit 62b6aa8 into master Jul 20, 2016
@Tarrasch Tarrasch deleted the continue-exit-code branch July 20, 2016 08:33
@Tarrasch
Copy link
Contributor

I squashed this as there were to many implementation-detail commits. Hopefully the reworded commit is clearer and more direct to the point of what changed. Good job! :)

p7k pushed a commit to Celmatix/luigi that referenced this pull request Jul 28, 2016
* spotify/master: (24 commits)
  Add DateSecondParameter to parameter.py (spotify#1779)
  tox: Specify sphinx dependency better
  flake8: Unbreak travis build
  Excludes .tox from flake8 to prevent checking third-party libraries (spotify#1785)
  README: Remove monthly downloads badge
  Rename CentralPlannerScheduler to Scheduler (spotify#1781)
  Remove abstract Scheduler class (spotify#1778)
  Assistants: Don't affect longevity of tasks (spotify#1772)
  tests: Skip a inttermittently failing s3 test (spotify#1777)
  Update retcodes to handle new cases (spotify#1771)
  tests: Fix warning in remote_scheduler_test.py (spotify#1774)
  Remove sitecustomize file (spotify#1755)
  Fix exist method for ftp server
  Update copy() to return number and size of files copied
  Remove the confusing "dummy_test_module" directory (spotify#1756)
  Disable codecov comments on GitHub PRs (spotify#1754)
  Fix "owner_email" log message. (spotify#1762)
  docs: Install sphinx 1.4.4 in setup.py (spotify#1761)
  docs: Set minimum versions for sphinx (spotify#1760)
  Normalize ListParameter to be Immutable (spotify#1759)
  ...
This was referenced Jun 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants