Work is not being picked up at the expected rate #802

jgrau · 2023-01-16T19:31:48Z

I recently switched from the default queue configuration (*) to having dedicated thread pools for separate queues. I should say that I don't know for sure if that was the change that introduced what I'm seeing now but I'm fairly confident that it is.

I'm experiencing that my workers are very slow at picking up new jobs and it feels like not all the threads are actually working. This is my config/initializers/good_job.rb:

Rails.application.configure do
  # Use the same logger as rails is using
  config.good_job.logger = Rails.logger

  # Set the parent class for all our active record models
  config.good_job.active_record_parent_class = "DirectConnection"

  # To begin with we keep the job records for inspection
  # but we should not preservice jobs once we're confident
  # everything is running smoothly.
  config.good_job.preserve_job_records = true

  # If GoodJob receives an error we want to log it to Sentry.
  config.good_job.retry_on_unhandled_error = false
  config.good_job.on_thread_error = ->(exception) do
    Sentry.capture_exception(exception)
  end

  # The plus sign means that the queues are prioritized from first to last
  # The star at the end will perform any other queue last. This will catch e.g. if no queue name
  # is given, in which case the queue name is 'default'.
  # The number specifies the number of threads to dedicated to that queue. The default thread count is 5.
  # This configuration will result in 2 + 2 + 5 = 9 threads.
  config.good_job.queues = "+high_priority:2;low_priority:2;*"

  config.good_job.enable_cron = true
  config.good_job.cron = { <remove for brevity but I have 10 cron entries> }

  # Use high_priority queue for emails
  config.action_mailer.deliver_later_queue_name = "high_priority"
end

# Any configuration in ApplicationJob will have to
# be duplicated on ActionMailer::MailDeliveryJob because
# ActionMailer uses a custom class, ActionMailer::MailDeliveryJob,
# which inherits from ActiveJob::Base, rather than your
# applications ApplicationJob.
ActionMailer::MailDeliveryJob.retry_on(
  StandardError,
  wait: :exponentially_longer,
  attempts: 10,
)

# Log all mail delivery errors in jobs to sentry
ActionMailer::MailDeliveryJob.around_perform do |_job, block|
  block.call
rescue StandardError => e
  Sentry.capture_exception(e)
  raise
end

I run in :external execution mode locally and I can see the 1 worker correctly:

But in production I don't see the processes:

Also notice that there's only 1 running job while there's a lot of queued jobs waiting to be picked up. With my configuration I would expect that there was around 7 running jobs (2 from the dedicated threads on the low_priority queue and 5 from the wildcard (*)).

I'm running with one worker instance (replica). I tried to run 3 worker pods and the count of running jobs did not change significantly.

Do you have any clues on what could be going on?

The text was updated successfully, but these errors were encountered:

bensheldon · 2023-01-16T19:42:01Z

That's no good!

Quickly, the thing that sticks out to me is that you're not seeing a Process in production. That tells me that your good_job worker process is not healthy, and would explain why your jobs are not running.

jgrau · 2023-01-16T19:45:47Z

Yeah something is not healthy. The worker is there and it is working. I am seeing that it is failing health checks though:

Here's my kubernetes manifest in case that's useful:

      containers:
        - name: main
          securityContext: {}
          image: registry.digitalocean.com/landfolk/api:latest
          imagePullPolicy: IfNotPresent
          command:
            - bundle
            - exec
            - good_job
            - start
          ports:
            - name: probe-port
              containerPort: 7001
          startupProbe:
            httpGet:
              path: '/status/started'
              port: probe-port
            failureThreshold: 30
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: '/status/connected'
              port: probe-port
            failureThreshold: 1
            periodSeconds: 10
          resources:
            requests:
              memory: '2Gi'
              cpu: '1000m'
            limits:
              memory: '2Gi'
              cpu: '1000m'
          env:
            - name: DB_POOL
              value: '18'
            - name: GOOD_JOB_PROBE_PORT
              value: '7001'
          envFrom:
            - configMapRef:
                name: api
            - secretRef:
                name: api
            - secretRef:
                name: aws
            - secretRef:
                name: postgres
            - secretRef:
                name: sentry

jgrau · 2023-01-16T19:47:14Z

I'm pretty sure it's not a resource problem:

jgrau · 2023-01-16T19:50:05Z

The info I have on the failing healthchecks is not a lot but here it is:

bensheldon · 2023-01-16T20:01:01Z

Those healthchecks look pretty suspicious. They aren't simply returning an http error status, they're not even connecting. That's not good.

Your URL and port configuration look correct and consistent. I'm stumped!

Can you verify locally that the probe port is running?

If so, can you open that worker to the internet and see if you can access the health check. My thoughts are:

the probe server is broken
the url/port that Kube is checking is wrong
there's some Kube networking config we're overlooking and that port isn't correctly wired up

jgrau · 2023-01-16T20:22:48Z

I'll try debugging the healthchecks but I want to point out that work is being performed. For example right now it has been running for 80 minutes and just one restart (due to health check fail)

jgrau · 2023-01-16T20:27:01Z

I could curl the healthchecks from a pod inside the cluster:

Startedapi-shell-754f67c6f9-sb6wc:/app# curl -v 10.244.4.253:7001/status/connected
*   Trying 10.244.4.253:7001...
* Connected to 10.244.4.253 (10.244.4.253) port 7001 (#0)
> GET /status/connected HTTP/1.1
> Host: 10.244.4.253:7001
> User-Agent: curl/7.87.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Server: WEBrick/1.7.0 (Ruby/3.1.3/2022-11-24)
< Date: Mon, 16 Jan 2023 20:25:02 GMT
< Content-Length: 9
< Connection: Keep-Alive
< 
* Connection #0 to host 10.244.4.253 left intact
Connectedapi-shell-754f67c6f9-sb6wc:/app# curl -v 10.244.4.253:7001/status/started
*   Trying 10.244.4.253:7001...
* Connected to 10.244.4.253 (10.244.4.253) port 7001 (#0)
> GET /status/started HTTP/1.1
> Host: 10.244.4.253:7001
> User-Agent: curl/7.87.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Server: WEBrick/1.7.0 (Ruby/3.1.3/2022-11-24)
< Date: Mon, 16 Jan 2023 20:25:04 GMT
< Content-Length: 7
< Connection: Keep-Alive
< 
* Connection #0 to host 10.244.4.253 left intact
Startedapi-shell-754f67c6f9-sb6wc:/app#

but of cause if kubernetes is pinging another url/port then that doesn't mean much...

bensheldon · 2023-01-16T20:36:08Z

Maybe I misunderstood.

The fire I see here is that there aren't any GoodJob::Process records being created. Everything else is smoke.

But, where maybe I misunderstood was thinking health checks were continuously failing leading to something like:

GoodJob starts up
Maybe a job or two gets executed
Health check probe fails
Kube bounces the process
Repeat

But if the health checks are failing because the process is in a bad state, then I'm back to stumped. Are you able to find logs of like "Notifier started LISTENing"? The Notifier is also what creates the GoodJob::Process record for the Dashboard, and also what the connected status check is reporting on (that the Notifier is listening)

jgrau · 2023-01-16T20:42:18Z

Hmm... could this be a dashboard issue? It looks like the GoodJob::Process is there:

Landfolk(production):001> GoodJob::Process.count
=> 1
Landfolk(production):002> GoodJob::Process.first
=> 
#<GoodJob::Process:0x00007fc9be61a370                                                
 id: "dfdbb125-242a-4f80-8596-1897de9c4a6a",                                         
 created_at: Mon, 16 Jan 2023 19:23:49.382429000 UTC +00:00,                         
 updated_at: Mon, 16 Jan 2023 19:23:49.382429000 UTC +00:00,                         
 state:                                                                              
  {"id"=>"dfdbb125-242a-4f80-8596-1897de9c4a6a",                                     
   "pid"=>1,                                                                         
   "hostname"=>"api-worker-57d4f68648-dfqrg",                                        
   "proctitle"=>"/app/vendor/bundle/ruby/3.1.0/bin/good_job",                        
   "schedulers"=>                                                                    
    ["GoodJob::Scheduler(queues=+high_priority max_threads=2)", "GoodJob::Scheduler(queues=low_priority max_threads=2)", "GoodJob::Scheduler(queues=* max_threads=5)"],
   "cron_enabled"=>true,                                                             
   "preserve_job_records"=>true,                                                     
   "retry_on_unhandled_error"=>false}>

Yet the dashboard looks like this:

jgrau · 2023-01-16T20:50:29Z

Are you able to find logs of like "Notifier started LISTENing"? The Notifier is also what creates the GoodJob::Process record for the Dashboard, and also what the connected status check is reporting on (that the Notifier is listening)

Yes, but there's not a lot of info (at least from what I can tell):

Here's what happened just prior to the UNLISTEN

2023-01-16T19:23:17.827321149Z stdout F I, [2023-01-16T19:23:17.659144 #1]  INFO -- : [GoodJob] [1] [GoodJob::Scheduler(queues=* max_threads=5)-thread-9] Executed GoodJob 27c95602-f88b-423f-8982-4b734435a6ec
2023-01-16T19:23:17.827231225Z stdout F I, [2023-01-16T19:23:17.658771 #1]  INFO -- : [ActiveJob] [ActiveStorage::AnalyzeJob] [426a2ae2-8408-4ed6-bff9-df482302ba01] Performed ActiveStorage::AnalyzeJob (Job ID: 426a2ae2-8408-4ed6-bff9-df482302ba01) from GoodJob(default) in 1341.07ms
2023-01-16T19:23:16.730746744Z stdout F I, [2023-01-16T19:23:16.728357 #1]  INFO -- : [ActiveJob] [ActiveStorage::AnalyzeJob] [426a2ae2-8408-4ed6-bff9-df482302ba01] �[36m  S3 Storage (100.1ms) �[0m�[34mDownloaded file from key: o0xhpz0y2oxtbobg9ohqmv25zqxc�[0m
2023-01-16T19:23:16.591349076Z stdout F I, [2023-01-16T19:23:16.586900 #1]  INFO -- : [ActiveJob] [ActiveStorage::AnalyzeJob] [426a2ae2-8408-4ed6-bff9-df482302ba01] �[36m  S3 Storage (264.6ms) �[0m�[34mDownloaded file from key: o0xhpz0y2oxtbobg9ohqmv25zqxc�[0m
2023-01-16T19:23:16.591338653Z stdout F I, [2023-01-16T19:23:16.321154 #1]  INFO -- : [ActiveJob] [ActiveStorage::AnalyzeJob] [426a2ae2-8408-4ed6-bff9-df482302ba01] Performing ActiveStorage::AnalyzeJob (Job ID: 426a2ae2-8408-4ed6-bff9-df482302ba01) from GoodJob(default) enqueued at 2023-01-16T19:12:37Z with arguments: #<GlobalID:0x00007f423a18d6b0 @uri=#<URI::GID gid://landfolk/ActiveStorage::Blob/c5e6865b-8c24-46e4-b708-4c600ce83560>>
2023-01-16T19:23:16.591326091Z stdout F I, [2023-01-16T19:23:16.288075 #1]  INFO -- : [GoodJob] [1] [GoodJob::Scheduler(queues=* max_threads=5)-thread-11] Executed GoodJob 32297c4f-2a4e-492c-9968-805021e10c41
2023-01-16T19:23:16.591258611Z stdout F I, [2023-01-16T19:23:16.287769 #1]  INFO -- : [ActiveJob] [ActiveStorage::AnalyzeJob] [6b0c7336-6bda-42e5-9f44-29abf0e7087f] Performed ActiveStorage::AnalyzeJob (Job ID: 6b0c7336-6bda-42e5-9f44-29abf0e7087f) from GoodJob(default) in 9290.82ms
2023-01-16T19:23:07.29184909Z stdout F I, [2023-01-16T19:23:07.283798 #1]  INFO -- : [ActiveJob] [ActiveStorage::AnalyzeJob] [6b0c7336-6bda-42e5-9f44-29abf0e7087f] �[36m  S3 Storage (149.4ms) �[0m�[34mDownloaded file from key: pq2f4vupvnc4doa4sct7sqvxa2ez�[0m
2023-01-16T19:23:07.095051208Z stdout F I, [2023-01-16T19:23:07.088416 #1]  INFO -- : [ActiveJob] [ActiveStorage::AnalyzeJob] [6b0c7336-6bda-42e5-9f44-29abf0e7087f] �[36m  S3 Storage (86.6ms) �[0m�[34mDownloaded file from key: pq2f4vupvnc4doa4sct7sqvxa2ez�[0m
2023-01-16T19:23:07.095039217Z stdout F I, [2023-01-16T19:23:07.000743 #1]  INFO -- : [ActiveJob] [ActiveStorage::AnalyzeJob] [6b0c7336-6bda-42e5-9f44-29abf0e7087f] Performing ActiveStorage::AnalyzeJob (Job ID: 6b0c7336-6bda-42e5-9f44-29abf0e7087f) from GoodJob(default) enqueued at 2023-01-16T19:12:37Z with arguments: #<GlobalID:0x00007f423b6711f8 @uri=#<URI::GID gid://landfolk/ActiveStorage::Blob/f8af224a-ec18-4669-9d22-aebfbf9514bf>>
2023-01-16T19:23:07.09502387Z stdout F I, [2023-01-16T19:23:06.968640 #1]  INFO -- : [GoodJob] [1] [GoodJob::Scheduler(queues=* max_threads=5)-thread-9] Executed GoodJob 854ebaf0-1bf4-413f-aa5f-3cbb28700405
2023-01-16T19:23:07.094905987Z stdout F I, [2023-01-16T19:23:06.968299 #1]  INFO -- : [ActiveJob] [ActiveStorage::AnalyzeJob] [5ec6cae4-dbb0-43c3-8daa-abe1bcd2da41] Performed ActiveStorage::AnalyzeJob (Job ID: 5ec6cae4-dbb0-43c3-8daa-abe1bcd2da41) from GoodJob(default) in 3544.12ms
2023-01-16T19:23:06.094875574Z stdout F I, [2023-01-16T19:23:06.092621 #1]  INFO -- : [ActiveJob] [ActiveStorage::AnalyzeJob] [5ec6cae4-dbb0-43c3-8daa-abe1bcd2da41] �[36m  S3 Storage (73.4ms) �[0m�[34mDownloaded file from key: s3wp5pr0duyuupidmnlvtvm91d85�[0m
2023-01-16T19:23:05.980051069Z stdout F I, [2023-01-16T19:23:05.631672 #1]  INFO -- : [ActiveJob] [ActiveStorage::AnalyzeJob] [5ec6cae4-dbb0-43c3-8daa-abe1bcd2da41] �[36m  S3 Storage (1506.1ms) �[0m�[34mDownloaded file from key: s3wp5pr0duyuupidmnlvtvm91d85�[0m
2023-01-16T19:23:05.980043465Z stdout F I, [2023-01-16T19:23:04.623369 #1]  INFO -- : [GoodJob] [1] GoodJob shutting down scheduler...
2023-01-16T19:23:05.98003717Z stdout F I, [2023-01-16T19:23:04.622958 #1]  INFO -- : [GoodJob] [1] GoodJob scheduler is shut down.
2023-01-16T19:23:05.980030016Z stdout F I, [2023-01-16T19:23:04.525566 #1]  INFO -- : [GoodJob] [1] [GoodJob::Scheduler(queues=low_priority max_threads=2)-thread-2] Executed GoodJob e7f4cac5-a764-4d87-b100-9d3af8e5d78b
2023-01-16T19:23:05.980022145Z stdout F I, [2023-01-16T19:23:04.525248 #1]  INFO -- : [ActiveJob] [ImportContentTranslationsJob] [e99e7f87-6f04-4376-9d33-7ebf8c8dc894] Performed ImportContentTranslationsJob (Job ID: e99e7f87-6f04-4376-9d33-7ebf8c8dc894) from GoodJob(low_priority) in 55823.37ms
2023-01-16T19:23:05.980015197Z stdout F I, [2023-01-16T19:23:04.425711 #1]  INFO -- : [GoodJob] [1] GoodJob shutting down scheduler...
2023-01-16T19:23:05.980006429Z stdout F I, [2023-01-16T19:23:04.124584 #1]  INFO -- : [ActiveJob] [ActiveStorage::AnalyzeJob] [5ec6cae4-dbb0-43c3-8daa-abe1bcd2da41] Performing ActiveStorage::AnalyzeJob (Job ID: 5ec6cae4-dbb0-43c3-8daa-abe1bcd2da41) from GoodJob(default) enqueued at 2023-01-16T19:12:37Z with arguments: #<GlobalID:0x00007f42417b65f8 @uri=#<URI::GID gid://landfolk/ActiveStorage::Blob/be0e720e-3760-42d6-8c9c-1fb893486c09>>
2023-01-16T19:23:05.979993292Z stdout F I, [2023-01-16T19:23:03.823577 #1]  INFO -- : [GoodJob] Notifier unsubscribed with UNLISTEN

bensheldon · 2023-01-16T20:52:03Z

The Dashboard should only show records that have an active Advisory Lock on them (to be more accurate, the Notifier creates the record and then takes a lock on it).

The advisory lock is to differentiate records that are left over after a SIGKILL.

bensheldon · 2023-01-16T21:04:40Z

Idea: try removing the health check from your Kube config. That would at least remove the question of whether Kube is restarting the process problematically.

jgrau · 2023-01-16T21:13:12Z

Idea: try removing the health check from your Kube config. That would at least remove the question of whether Kube is restarting the process problematically.

Thanks for the idea. The change is on its way online.

I don't have any experience with pg locks but could my issue be related to that? Actually the reason I started diving deep on the good_job setup was because we started to experience multiple workers picking up the same job. That could sound like a problem with locking...

bensheldon · 2023-01-16T21:23:58Z

Aha! I believe you're setting the Active Record parent class incorrectly.

There isn't a config.good_job.active_record_parent_class. It must be assigned directly on the GoodJob constant.

https://github.com/bensheldon/good_job#pgbouncer-compatibility

jgrau · 2023-01-16T21:24:43Z

I literally discovered that the moment you wrote that post!

jgrau · 2023-01-16T21:56:07Z

Yes, that fixed it! ❤️

@bensheldon thank you so much for all your help and this great gem ❤️

jgrau closed this as completed Jan 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Work is not being picked up at the expected rate #802

Work is not being picked up at the expected rate #802

jgrau commented Jan 16, 2023 •

edited

Loading

bensheldon commented Jan 16, 2023

jgrau commented Jan 16, 2023

jgrau commented Jan 16, 2023

jgrau commented Jan 16, 2023

bensheldon commented Jan 16, 2023

jgrau commented Jan 16, 2023

jgrau commented Jan 16, 2023 •

edited

Loading

bensheldon commented Jan 16, 2023

jgrau commented Jan 16, 2023

jgrau commented Jan 16, 2023

bensheldon commented Jan 16, 2023

bensheldon commented Jan 16, 2023

jgrau commented Jan 16, 2023

bensheldon commented Jan 16, 2023

jgrau commented Jan 16, 2023

jgrau commented Jan 16, 2023

Work is not being picked up at the expected rate #802

Work is not being picked up at the expected rate #802

Comments

jgrau commented Jan 16, 2023 • edited Loading

bensheldon commented Jan 16, 2023

jgrau commented Jan 16, 2023

jgrau commented Jan 16, 2023

jgrau commented Jan 16, 2023

bensheldon commented Jan 16, 2023

jgrau commented Jan 16, 2023

jgrau commented Jan 16, 2023 • edited Loading

bensheldon commented Jan 16, 2023

jgrau commented Jan 16, 2023

jgrau commented Jan 16, 2023

bensheldon commented Jan 16, 2023

bensheldon commented Jan 16, 2023

jgrau commented Jan 16, 2023

bensheldon commented Jan 16, 2023

jgrau commented Jan 16, 2023

jgrau commented Jan 16, 2023

jgrau commented Jan 16, 2023 •

edited

Loading

jgrau commented Jan 16, 2023 •

edited

Loading