Cron jobs can be missed if deploys are timed just right #1484

tjefferson08 · 2024-09-11T19:50:54Z

TL;DR

we deployed our app almost exactly at noon ET, and
our cron scheduled for noon did not execute

just glancing at the code in CronManager, this seems possible (albeit unlikely), depending on very specific timing

old worker is shut down with cron JUST about to be run
new worker spins up and starts cron JUST after cron schedule was due

I would think in a blue-green deployment world, this would not be possible 🤔 (but apparently our heroku worker deployments are not blue-green)

bensheldon · 2024-09-11T22:52:55Z

Briefly, I'm imagining that there could be a configuration like config.cron_start_lookback = 5.minutes where it would see if there was a previous enqueue that would have fallen within that time period, and if so, attempt to enqueue the job (with all of the existing uniqueness protections).

I also wonder if it should have a similar lookahead where it's like "if the cron will start in the next [lookback-period] don't bother either" because I imagine if the job is being run every 1 minute or something rapidly like that, it shouldn't be necessary. It's more when the cron entry is like once per day that it's a problem, I imagine.

tjefferson08 · 2024-09-12T01:19:27Z

Sounds like it'd work! Seems the time window value would be determined by the longest possible downtime you'd take during deployment (I suppose ideally this is 0)

Ok obviously huge change... but would it be totally nuts to try and run cron scheduling through ActiveJob's API?

Job.set(wait_until: next_cron_at).perform_later

Altho hm I guess you'd have to wrap the job somehow so you could tell it to reschedule for the next occurrence after executing.

It'd be cool if the cron manager could just boot up, schedule upcoming crons, and be done... possibly even removing the need for the dedicated thread?

bensheldon · 2024-09-12T02:58:00Z

The way that GoodJob's cron works is intentional, but it is a design decision.

GoodJob Cron's design is to be analogous to Unix System Cron, which differentiates between the scheduler and the task. That makes it possible to enqueue jobs at every scheduled time, and not just at the time the that a job (eventually) executes. That separation also makes it easy to cancel or change the schedule without having to search for a waiting job and then find/modify/destroy it.

Also:

I generally consider job execution to be in Application-land. So GoodJob can be relied upon to enqueue a job at the scheduled time, but it's up to the application/operation-configuration to ensure there are execution resources to perform it.
I discourage using scheduled jobs for anything other than error handling/incremental backoff. I discourage putting jobs in the queue that are expected to execute far into the future (or "smearing" jobs to spread out their execution). GoodJob's job table is optimized to function as a queue rather than a general data-store

Sorry, I know that's TMI, but wanted to explain why it is the way it is 😄

Maybe I'd do it differently if I did it again, but only maybe. I think Solid Queue's implementation is interesting/different because it does put it all in the database (whereas GoodJob generally tries to keep stuff in configuration): https://github.com/rails/solid_queue/blob/main/test/dummy/db/schema.rb#L97

eleith · 2024-09-12T03:19:36Z

thanks for the informative response!

this helps us evolve our approach both to how we deploy our workers as well as how we queue up our jobs.

tjefferson08 · 2024-09-12T14:14:37Z

yes, thanks for the explanation 🙏 much appreciated

and thanks for good_job, we love it!

bensheldon · 2024-09-14T17:48:37Z

@jjb helpfully shared an example in sidekiq-cron, which has nice naming ("reschedule_grace_period"): sidekiq-cron/sidekiq-cron#465

Also TIL about Fugit "within": floraison/fugit@89d102d

bensheldon · 2024-09-14T18:54:15Z

I have a PR up for this in #1488

bensheldon mentioned this issue Sep 14, 2024

Add cron_graceful_restart_period to avoid missing recurring jobs that occurred during deployment downtime #1488

Merged

bensheldon closed this as completed in #1488 Oct 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cron jobs can be missed if deploys are timed just right #1484

Cron jobs can be missed if deploys are timed just right #1484

tjefferson08 commented Sep 11, 2024

bensheldon commented Sep 11, 2024

tjefferson08 commented Sep 12, 2024

bensheldon commented Sep 12, 2024

eleith commented Sep 12, 2024

tjefferson08 commented Sep 12, 2024

bensheldon commented Sep 14, 2024

bensheldon commented Sep 14, 2024

Cron jobs can be missed if deploys are timed just right #1484

Cron jobs can be missed if deploys are timed just right #1484

Comments

tjefferson08 commented Sep 11, 2024

bensheldon commented Sep 11, 2024

tjefferson08 commented Sep 12, 2024

bensheldon commented Sep 12, 2024

eleith commented Sep 12, 2024

tjefferson08 commented Sep 12, 2024

bensheldon commented Sep 14, 2024

bensheldon commented Sep 14, 2024