Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catch-up doesn't respect retries limit #369

Closed
mbafford opened this issue Jan 17, 2021 · 3 comments
Closed

Catch-up doesn't respect retries limit #369

mbafford opened this issue Jan 17, 2021 · 3 comments

Comments

@mbafford
Copy link

mbafford commented Jan 17, 2021

Description

The catch-up flag does not respect the retries limit.

If you have a process with the following criteria:

  • Catch-Up (Run All) checked
  • Timeout configured
  • Process (consistently) takes longer than timeout

The task will continuously restart after the initial event is triggered.

It appears the "ran on schedule" flag isn't cleared unless the job is successfully completed. My expectation was it would attempt to run the task to catch-up, but not retry if a failure occurred.

Details

Version 0.8.54

For example, my job below too long is configured to run:

  • at hour:05
  • with a 1 minute timeout
  • None retries
  • catch-up selected
  • script plug-in that just runs sleep 5m

image

image

The run log shows the task starting on schedule, and erroring out, but then running again, despite the retries disabled:

image

The job output confirms it's killed due to maximum run time:

# Job ID: jkk1a270qog
# Event Title: too long
# Hostname: fridgenas
# Date/Time: 2021/01/17 10:11:33 (GMT-5)

[2021/01/17 10:11:33] Sleeping for 5 minutes on a 1 minute max job
Caught SIGTERM, killing child: 1676820
Caught SIGTERM, killing child: 1676820

# Job failed at 2021/01/17 10:12:52 (GMT-5).
# Error: Job Aborted: Exceeded maximum run time (1 minute)
# End of log.

Comments

For my case I will workaround this by disabling the time-out for this job (or the catch-up, since the server isn't likely to be down in a way that would cause problems).

I noticed this because I have multiple backup jobs in a category with a category limit of 1 job at a time. The long-running (and incorrectly configured) prune event was timing out, then re-running, seemingly delaying the entire backup schedule.

@mikeTWC1984
Copy link

That seemed to be a side effect of the default job abortion behavior. Aborted jobs get into the catch-up queue (if catch-up checked) by default. To avoid this no_rewind flag should be set to 1, which only happens if you manually aborting job. I guess it's neither bug or feature, just a weird scenario. I'd agree timeout abortion should come with no_rewind=1 by default

@jhuckaby
Copy link
Owner

Fixed in Cronicle v0.8.56. Thanks @mikeTWC1984 for the assist.

@jhuckaby
Copy link
Owner

Also thanks to @mbafford for the detailed issue report!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants