Catch-up doesn't respect retries limit #369

mbafford · 2021-01-17T15:27:11Z

Description

The catch-up flag does not respect the retries limit.

If you have a process with the following criteria:

Catch-Up (Run All) checked
Timeout configured
Process (consistently) takes longer than timeout

The task will continuously restart after the initial event is triggered.

It appears the "ran on schedule" flag isn't cleared unless the job is successfully completed. My expectation was it would attempt to run the task to catch-up, but not retry if a failure occurred.

Details

Version 0.8.54

For example, my job below too long is configured to run:

at hour:05
with a 1 minute timeout
None retries
catch-up selected
script plug-in that just runs sleep 5m

The run log shows the task starting on schedule, and erroring out, but then running again, despite the retries disabled:

The job output confirms it's killed due to maximum run time:

# Job ID: jkk1a270qog
# Event Title: too long
# Hostname: fridgenas
# Date/Time: 2021/01/17 10:11:33 (GMT-5)

[2021/01/17 10:11:33] Sleeping for 5 minutes on a 1 minute max job
Caught SIGTERM, killing child: 1676820
Caught SIGTERM, killing child: 1676820

# Job failed at 2021/01/17 10:12:52 (GMT-5).
# Error: Job Aborted: Exceeded maximum run time (1 minute)
# End of log.

Comments

For my case I will workaround this by disabling the time-out for this job (or the catch-up, since the server isn't likely to be down in a way that would cause problems).

I noticed this because I have multiple backup jobs in a category with a category limit of 1 job at a time. The long-running (and incorrectly configured) prune event was timing out, then re-running, seemingly delaying the entire backup schedule.

The text was updated successfully, but these errors were encountered:

mikeTWC1984 · 2021-01-19T15:21:23Z

That seemed to be a side effect of the default job abortion behavior. Aborted jobs get into the catch-up queue (if catch-up checked) by default. To avoid this no_rewind flag should be set to 1, which only happens if you manually aborting job. I guess it's neither bug or feature, just a weird scenario. I'd agree timeout abortion should come with no_rewind=1 by default

jhuckaby · 2021-01-23T17:41:17Z

Fixed in Cronicle v0.8.56. Thanks @mikeTWC1984 for the assist.

jhuckaby · 2021-01-23T18:11:06Z

Also thanks to @mbafford for the detailed issue report!

jhuckaby closed this as completed in cdf3a2e Jan 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Catch-up doesn't respect retries limit #369

Catch-up doesn't respect retries limit #369

mbafford commented Jan 17, 2021 •

edited

Loading

mikeTWC1984 commented Jan 19, 2021

jhuckaby commented Jan 23, 2021

jhuckaby commented Jan 23, 2021

Catch-up doesn't respect retries limit #369

Catch-up doesn't respect retries limit #369

Comments

mbafford commented Jan 17, 2021 • edited Loading

Description

Details

Comments

mikeTWC1984 commented Jan 19, 2021

jhuckaby commented Jan 23, 2021

jhuckaby commented Jan 23, 2021

mbafford commented Jan 17, 2021 •

edited

Loading