Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cron-like jobs not always executed, possible reasons? #359

Closed
aried3r opened this issue Sep 1, 2021 · 7 comments
Closed

Cron-like jobs not always executed, possible reasons? #359

aried3r opened this issue Sep 1, 2021 · 7 comments

Comments

@aried3r
Copy link
Contributor

aried3r commented Sep 1, 2021

I'm not sure if this is perhaps just a configuration error, but we're seeing some issues with cron-like jobs not always being executed.

We're still experimenting a bit with GoodJob, so configuration is still changing and not final. Currently we have it all in an initializer because we don't want to have configuration in multiple places for now.

Regular jobs using .perform_later work as expected from what we can see, but we're having issues with some cron-like jobs.

As of now have very mild workloads regarding jobs. Perhaps a couple hundred a day, less than 500 even. This will increase as we add more background jobs in our app. And, as you'll see in our configuration below, 2 cron jobs that execute a total of 6 times per week.
Could this perhaps have anything to do with when GoodJob "wakes up", to handle cron-like jobs? I'm not exactly sure how GoodJob works under the hood for these sort of jobs, so our best guess was that we don't have enough workload for GoodJob to "wake up" more often and thus miss the scheduled time.

Could you maybe shed some light on how it works internally in :async mode or in general if there is no difference? We'd love to know when and what is triggering the GoodJob started cron with 2 jobs. log messages. I've found the source but haven't yet tracked down when exactly that is being executed.

Our configuration looks like this:

# config/initializers/good_job.rb
# frozen_string_literal: true

Rails.application.configure do
  config.active_job.queue_adapter = :good_job

  config.good_job.execution_mode = :async
  config.good_job.max_threads = 5
  # Disable polling in development.
  config.good_job.poll_interval = Rails.env.development? ? -1 : 10 # seconds
  config.good_job.shutdown_timeout 60 # seconds
  config.good_job.enable_cron = true
  config.good_job.cron = {
      reminders: { cron: '0 9 * * MON-FRI', class: 'ReminderNotifierJob' },
      clean_up_preserved_jobs: { cron: '0 19 * * SUN', class: 'CleanUpPreservedJobsJob' }
  }
  config.good_job.queues = '*'
end

GoodJob.active_record_parent_class = 'ApplicationRecord'
GoodJob.logger = Logger.new(Rails.root.join('log', 'good_job.log'))

# Useful for analyzing the jobs in the dashboard.
GoodJob.preserve_job_records = true

GoodJob.retry_on_unhandled_error = false

GoodJob.on_thread_error = ->(exception) { ExceptionNotifier.notify_exception(exception) }

# https://github.com/bensheldon/good_job#actionmailer-retries
ActionMailer::MailDeliveryJob.around_perform do |_job, block|
  block.call
rescue StandardError => e
  ExceptionNotifier.notify_exception(e)
  raise
end

And from our logs. On the 30th the cron-like job executed as expected at 09:00.
I also posted some lines before and after in case there's context. But the important line, I assume, is the one from 2021-08-30T09:00:00.061463. At this point, we only had one job scheduled for MON-FRI at 09:00.

I, [2021-08-30T08:37:11.990167 #4465]  INFO -- : Notifier unsubscribed with UNLISTEN
I, [2021-08-30T08:42:10.805318 #5729]  INFO -- : Executed GoodJob a201f554-1fd4-40ae-ad7c-70334d82d0ce
I, [2021-08-30T08:42:10.824844 #5729]  INFO -- : Executed GoodJob da41bdfb-6bcc-4b47-ac8e-2b89d4c24858
I, [2021-08-30T08:59:12.881274 #5460]  INFO -- : GoodJob started scheduler with queues=* max_threads=5.
I, [2021-08-30T08:59:12.948432 #5460]  INFO -- : Notifier subscribed with LISTEN
I, [2021-08-30T08:59:13.696594 #5460]  INFO -- : GoodJob started cron with 1 jobs.
I, [2021-08-30T09:00:00.061463 #5460]  INFO -- : Executed GoodJob 652bfe2c-9ed6-447b-8868-aebdb5db4b2c
I, [2021-08-30T09:04:10.239188 #5460]  INFO -- : Notifier unsubscribed with UNLISTEN
I, [2021-08-30T09:05:44.315151 #5788]  INFO -- : GoodJob started scheduler with queues=* max_threads=5.
I, [2021-08-30T09:05:44.382429 #5788]  INFO -- : Notifier subscribed with LISTEN
I, [2021-08-30T09:05:45.138633 #5788]  INFO -- : GoodJob started cron with 1 jobs.

On August 31st and September 1st we had the following logs. Here just the one from today. By now we have two scheduled jobs, as can be seen in the configuration above. However the job scheduled for MON-FRI at 09:00 is not being executed.

I, [2021-09-01T08:33:33.696067 #18968]  INFO -- : Notifier unsubscribed with UNLISTEN
I, [2021-09-01T08:38:10.310581 #19570]  INFO -- : GoodJob started scheduler with queues=* max_threads=5.
I, [2021-09-01T08:38:10.384731 #19570]  INFO -- : Notifier subscribed with LISTEN
I, [2021-09-01T08:38:11.200807 #19570]  INFO -- : GoodJob started cron with 2 jobs.
I, [2021-09-01T08:43:07.590751 #19570]  INFO -- : Notifier unsubscribed with UNLISTEN
I, [2021-09-01T08:43:41.843787 #19860]  INFO -- : GoodJob started scheduler with queues=* max_threads=5.
I, [2021-09-01T08:43:41.913695 #19860]  INFO -- : Notifier subscribed with LISTEN
I, [2021-09-01T08:43:42.725652 #19860]  INFO -- : GoodJob started cron with 2 jobs.
I, [2021-09-01T08:48:39.165449 #19860]  INFO -- : Notifier unsubscribed with UNLISTEN
I, [2021-09-01T09:07:17.278840 #20672]  INFO -- : GoodJob started scheduler with queues=* max_threads=5.
I, [2021-09-01T09:07:17.347101 #20672]  INFO -- : Notifier subscribed with LISTEN
I, [2021-09-01T09:07:18.147800 #20672]  INFO -- : GoodJob started cron with 2 jobs.
I, [2021-09-01T09:12:44.901158 #20672]  INFO -- : Notifier unsubscribed with UNLISTEN
I, [2021-09-01T09:24:41.532541 #21320]  INFO -- : GoodJob started scheduler with queues=* max_threads=5.
I, [2021-09-01T09:24:41.601807 #21320]  INFO -- : Notifier subscribed with LISTEN
I, [2021-09-01T09:24:42.388406 #21320]  INFO -- : GoodJob started cron with 2 jobs.

I'm posting here, because I wasn't sure if it's a bug or if it should go to Q&A Discussions. Feel free to move this issue to a discussion if this is the wrong place!

Thanks!

@bensheldon
Copy link
Owner

bensheldon commented Sep 1, 2021

Immediately, looking at those logs, it seems like the Rails process is crashing out and restarting over and over:

# startup
I, [2021-09-01T08:43:41.913695 #19860]  INFO -- : Notifier subscribed with LISTEN
I, [2021-09-01T08:43:42.725652 #19860]  INFO -- : GoodJob started cron with 2 jobs.
# shuts down before 9am
I, [2021-09-01T08:48:39.165449 #19860]  INFO -- : Notifier unsubscribed with UNLISTEN
# starts up again after 9am
I, [2021-09-01T09:07:17.278840 #20672]  INFO -- : GoodJob started scheduler with queues=* max_threads=5.
I, [2021-09-01T09:07:18.147800 #20672]  INFO -- : GoodJob started cron with 2 jobs.

If the process that includes GoodJob::Cron, is not running exactly at 9am on Friday when it is set to trigger, the job will not be enqueued.

So my hypothesis is that the process is crashing/restarting. Why? I will try to reproduce that, but hopefully that helps you debug on your end too.

Edit: Here is where the log message originates:

def start
ActiveSupport::Notifications.instrument("cron_manager_start.good_job", cron_jobs: @schedules) do
@running = true
schedules.each_key { |cron_key| create_task(cron_key) }
end
end

@aried3r
Copy link
Contributor Author

aried3r commented Sep 1, 2021

Immediately, looking at those logs, it seems like the Rails process is crashing out and restarting over and over:

A-ha! That's something I haven't considered. We're using passenger+nginx and checking its logs definitely reveal a correlation in timestamps.

[ N 2021-09-01 08:43:11.9332 28219/T4 age/Cor/CoreMain.cpp:1147 ]: Checking whether to disconnect long-running connections for process 19686
[ N 2021-09-01 08:48:39.1641 28219/T4 age/Cor/CoreMain.cpp:1147 ]: Checking whether to disconnect long-running connections for process 19085
[ N 2021-09-01 09:12:44.8984 28219/T4 age/Cor/CoreMain.cpp:1147 ]: Checking whether to disconnect long-running connections for process 20789]: Checking whether to disconnect long-running connections for process 19085

I'll have to look into if passenger + :async mode don't make a good team.

@aried3r
Copy link
Contributor Author

aried3r commented Sep 1, 2021

Yep, that was a good call. I didn't correlate it to passenger as I didn't know the log output indicates process restart.

To quote a few passages of the passenger docs, emphasis mine:

Instead of running an application inside its process space, Passenger launches the application as external processes, and manages them. Passenger load balances traffic between processes, shuts down processes when they're no longer needed or when they misbehave, keeps them running and restarts them when they crash, etc.
https://www.phusionpassenger.com/library/walkthroughs/basics/ruby/process_management.html

By default, Passenger spawns and shuts down application processes according to traffic. This allows it to use more resources during busy times, while conserving resources during idle times.
https://www.phusionpassenger.com/library/config/nginx/optimization/

Which ultimately leads me to believe that setting passenger_pool_idle_time to 0 might solve our issue.

From the docs, emphasis mine:

The maximum number of seconds that an application process may be idle. That is, if an application process hasn't received any traffic after the given number of seconds, then it will be shutdown in order to conserve memory.

When this value is set to 0, application processes will not be shutdown unless it's really necessary.

Setting the value to 0 is recommended if you're on a non-shared host that's only running a few applications, each which must be available at all times.

I can't say for certain that this will solve our issue, but I think the evidence so far would support this theory.

So the default settings of passenger do not play well with :async mode. I'd assume :external mode is different. Initially, because of our low workload regarding background jobs we wanted to stick with :async mode to be economic, like the README suggests, but I'd wager having one GoodJob process running is less taxing than keeping all passenger processes alive at all times. Some testing will definitely have to go into this.

Thank you for the quick answer! In hindsight, if this is indeed the case, it seems obvious.

@bensheldon
Copy link
Owner

fhew, glad that we have a clear culprit (Passenger). I think you're on the right track.

I'd assume :external mode is different.

:external mode assumes you have some other process running (like the cli: $ good_job start; or some complex configuration where you have some webservers running :async, but don't want it to be all of them).

Another idea, if you're running this on your own server,is to simply use a rake task + crontab to schedule those jobs, rather than GoodJob's cron. GoodJob's cron is good, but I would recommend using a system/platform-level scheduler if you have it.

@aried3r
Copy link
Contributor Author

aried3r commented Sep 6, 2021

Another idea, if you're running this on your own server,is to simply use a rake task + crontab to schedule those jobs, rather than GoodJob's cron. GoodJob's cron is good, but I would recommend using a system/platform-level scheduler if you have it.

Definitely! Critical tasks are being scheduled using crontab, but we also want to provide our app developers to easily schedule "nice to have" tasks where we have a relative certainty that they'll be executed. Adapting our passenger config should yield the proper results.

We're working and testing different configs now and I'll report back if it worked. Feel free to close this issue already, since it doesn't seem to be an issue with this gem.

@clajiness
Copy link

@aried3r, sorry for waking an ancient thread, but did you ever sort out your Passenger config? I ran into the same issue, but wasn't sure if anyone else effectively solved it. Thanks!

@buncis
Copy link

buncis commented Jun 3, 2023

@clajiness I think here #373 its added on the readme

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants