Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Work is not being picked up at the expected rate #802

Closed
jgrau opened this issue Jan 16, 2023 · 16 comments
Closed

Work is not being picked up at the expected rate #802

jgrau opened this issue Jan 16, 2023 · 16 comments

Comments

@jgrau
Copy link
Contributor

jgrau commented Jan 16, 2023

Hi @bensheldon

I recently switched from the default queue configuration (*) to having dedicated thread pools for separate queues. I should say that I don't know for sure if that was the change that introduced what I'm seeing now but I'm fairly confident that it is.

I'm experiencing that my workers are very slow at picking up new jobs and it feels like not all the threads are actually working. This is my config/initializers/good_job.rb:

Rails.application.configure do
  # Use the same logger as rails is using
  config.good_job.logger = Rails.logger

  # Set the parent class for all our active record models
  config.good_job.active_record_parent_class = "DirectConnection"

  # To begin with we keep the job records for inspection
  # but we should not preservice jobs once we're confident
  # everything is running smoothly.
  config.good_job.preserve_job_records = true

  # If GoodJob receives an error we want to log it to Sentry.
  config.good_job.retry_on_unhandled_error = false
  config.good_job.on_thread_error = ->(exception) do
    Sentry.capture_exception(exception)
  end

  # The plus sign means that the queues are prioritized from first to last
  # The star at the end will perform any other queue last. This will catch e.g. if no queue name
  # is given, in which case the queue name is 'default'.
  # The number specifies the number of threads to dedicated to that queue. The default thread count is 5.
  # This configuration will result in 2 + 2 + 5 = 9 threads.
  config.good_job.queues = "+high_priority:2;low_priority:2;*"

  config.good_job.enable_cron = true
  config.good_job.cron = { <remove for brevity but I have 10 cron entries> }

  # Use high_priority queue for emails
  config.action_mailer.deliver_later_queue_name = "high_priority"
end

# Any configuration in ApplicationJob will have to
# be duplicated on ActionMailer::MailDeliveryJob because
# ActionMailer uses a custom class, ActionMailer::MailDeliveryJob,
# which inherits from ActiveJob::Base, rather than your
# applications ApplicationJob.
ActionMailer::MailDeliveryJob.retry_on(
  StandardError,
  wait: :exponentially_longer,
  attempts: 10,
)

# Log all mail delivery errors in jobs to sentry
ActionMailer::MailDeliveryJob.around_perform do |_job, block|
  block.call
rescue StandardError => e
  Sentry.capture_exception(e)
  raise
end

I run in :external execution mode locally and I can see the 1 worker correctly:

Screenshot 2023-01-16 at 20 13 21

But in production I don't see the processes:

Screenshot 2023-01-16 at 20 12 28

Also notice that there's only 1 running job while there's a lot of queued jobs waiting to be picked up. With my configuration I would expect that there was around 7 running jobs (2 from the dedicated threads on the low_priority queue and 5 from the wildcard (*)).

I'm running with one worker instance (replica). I tried to run 3 worker pods and the count of running jobs did not change significantly.

Do you have any clues on what could be going on?

@bensheldon
Copy link
Owner

That's no good!

Quickly, the thing that sticks out to me is that you're not seeing a Process in production. That tells me that your good_job worker process is not healthy, and would explain why your jobs are not running.

@jgrau
Copy link
Contributor Author

jgrau commented Jan 16, 2023

Yeah something is not healthy. The worker is there and it is working. I am seeing that it is failing health checks though:

Screenshot 2023-01-16 at 20 43 07

Here's my kubernetes manifest in case that's useful:

      containers:
        - name: main
          securityContext: {}
          image: registry.digitalocean.com/landfolk/api:latest
          imagePullPolicy: IfNotPresent
          command:
            - bundle
            - exec
            - good_job
            - start
          ports:
            - name: probe-port
              containerPort: 7001
          startupProbe:
            httpGet:
              path: '/status/started'
              port: probe-port
            failureThreshold: 30
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: '/status/connected'
              port: probe-port
            failureThreshold: 1
            periodSeconds: 10
          resources:
            requests:
              memory: '2Gi'
              cpu: '1000m'
            limits:
              memory: '2Gi'
              cpu: '1000m'
          env:
            - name: DB_POOL
              value: '18'
            - name: GOOD_JOB_PROBE_PORT
              value: '7001'
          envFrom:
            - configMapRef:
                name: api
            - secretRef:
                name: api
            - secretRef:
                name: aws
            - secretRef:
                name: postgres
            - secretRef:
                name: sentry

@jgrau
Copy link
Contributor Author

jgrau commented Jan 16, 2023

I'm pretty sure it's not a resource problem:

Screenshot 2023-01-16 at 20 46 30

Screenshot 2023-01-16 at 20 46 24

@jgrau
Copy link
Contributor Author

jgrau commented Jan 16, 2023

The info I have on the failing healthchecks is not a lot but here it is:
Screenshot 2023-01-16 at 20 49 19

@bensheldon
Copy link
Owner

Those healthchecks look pretty suspicious. They aren't simply returning an http error status, they're not even connecting. That's not good.

Your URL and port configuration look correct and consistent. I'm stumped!

Can you verify locally that the probe port is running?

If so, can you open that worker to the internet and see if you can access the health check. My thoughts are:

  • the probe server is broken
  • the url/port that Kube is checking is wrong
  • there's some Kube networking config we're overlooking and that port isn't correctly wired up

@jgrau
Copy link
Contributor Author

jgrau commented Jan 16, 2023

I'll try debugging the healthchecks but I want to point out that work is being performed. For example right now it has been running for 80 minutes and just one restart (due to health check fail)

@jgrau
Copy link
Contributor Author

jgrau commented Jan 16, 2023

I could curl the healthchecks from a pod inside the cluster:

Startedapi-shell-754f67c6f9-sb6wc:/app# curl -v 10.244.4.253:7001/status/connected
*   Trying 10.244.4.253:7001...
* Connected to 10.244.4.253 (10.244.4.253) port 7001 (#0)
> GET /status/connected HTTP/1.1
> Host: 10.244.4.253:7001
> User-Agent: curl/7.87.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Server: WEBrick/1.7.0 (Ruby/3.1.3/2022-11-24)
< Date: Mon, 16 Jan 2023 20:25:02 GMT
< Content-Length: 9
< Connection: Keep-Alive
< 
* Connection #0 to host 10.244.4.253 left intact
Connectedapi-shell-754f67c6f9-sb6wc:/app# curl -v 10.244.4.253:7001/status/started
*   Trying 10.244.4.253:7001...
* Connected to 10.244.4.253 (10.244.4.253) port 7001 (#0)
> GET /status/started HTTP/1.1
> Host: 10.244.4.253:7001
> User-Agent: curl/7.87.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Server: WEBrick/1.7.0 (Ruby/3.1.3/2022-11-24)
< Date: Mon, 16 Jan 2023 20:25:04 GMT
< Content-Length: 7
< Connection: Keep-Alive
< 
* Connection #0 to host 10.244.4.253 left intact
Startedapi-shell-754f67c6f9-sb6wc:/app# 

but of cause if kubernetes is pinging another url/port then that doesn't mean much...

@bensheldon
Copy link
Owner

Maybe I misunderstood.

The fire I see here is that there aren't any GoodJob::Process records being created. Everything else is smoke.

But, where maybe I misunderstood was thinking health checks were continuously failing leading to something like:

  • GoodJob starts up
  • Maybe a job or two gets executed
  • Health check probe fails
  • Kube bounces the process
  • Repeat

But if the health checks are failing because the process is in a bad state, then I'm back to stumped. Are you able to find logs of like "Notifier started LISTENing"? The Notifier is also what creates the GoodJob::Process record for the Dashboard, and also what the connected status check is reporting on (that the Notifier is listening)

@jgrau
Copy link
Contributor Author

jgrau commented Jan 16, 2023

Hmm... could this be a dashboard issue? It looks like the GoodJob::Process is there:

Landfolk(production):001> GoodJob::Process.count
=> 1
Landfolk(production):002> GoodJob::Process.first
=> 
#<GoodJob::Process:0x00007fc9be61a370                                                
 id: "dfdbb125-242a-4f80-8596-1897de9c4a6a",                                         
 created_at: Mon, 16 Jan 2023 19:23:49.382429000 UTC +00:00,                         
 updated_at: Mon, 16 Jan 2023 19:23:49.382429000 UTC +00:00,                         
 state:                                                                              
  {"id"=>"dfdbb125-242a-4f80-8596-1897de9c4a6a",                                     
   "pid"=>1,                                                                         
   "hostname"=>"api-worker-57d4f68648-dfqrg",                                        
   "proctitle"=>"/app/vendor/bundle/ruby/3.1.0/bin/good_job",                        
   "schedulers"=>                                                                    
    ["GoodJob::Scheduler(queues=+high_priority max_threads=2)", "GoodJob::Scheduler(queues=low_priority max_threads=2)", "GoodJob::Scheduler(queues=* max_threads=5)"],
   "cron_enabled"=>true,                                                             
   "preserve_job_records"=>true,                                                     
   "retry_on_unhandled_error"=>false}>    

Yet the dashboard looks like this:

Screenshot 2023-01-16 at 21 41 49

@jgrau
Copy link
Contributor Author

jgrau commented Jan 16, 2023

Are you able to find logs of like "Notifier started LISTENing"? The Notifier is also what creates the GoodJob::Process record for the Dashboard, and also what the connected status check is reporting on (that the Notifier is listening)

Yes, but there's not a lot of info (at least from what I can tell):

Screenshot 2023-01-16 at 21 47 13

Here's what happened just prior to the UNLISTEN

2023-01-16T19:23:17.827321149Z stdout F I, [2023-01-16T19:23:17.659144 #1]  INFO -- : [GoodJob] [1] [GoodJob::Scheduler(queues=* max_threads=5)-thread-9] Executed GoodJob 27c95602-f88b-423f-8982-4b734435a6ec
2023-01-16T19:23:17.827231225Z stdout F I, [2023-01-16T19:23:17.658771 #1]  INFO -- : [ActiveJob] [ActiveStorage::AnalyzeJob] [426a2ae2-8408-4ed6-bff9-df482302ba01] Performed ActiveStorage::AnalyzeJob (Job ID: 426a2ae2-8408-4ed6-bff9-df482302ba01) from GoodJob(default) in 1341.07ms
2023-01-16T19:23:16.730746744Z stdout F I, [2023-01-16T19:23:16.728357 #1]  INFO -- : [ActiveJob] [ActiveStorage::AnalyzeJob] [426a2ae2-8408-4ed6-bff9-df482302ba01] �[36m  S3 Storage (100.1ms) �[0m�[34mDownloaded file from key: o0xhpz0y2oxtbobg9ohqmv25zqxc�[0m
2023-01-16T19:23:16.591349076Z stdout F I, [2023-01-16T19:23:16.586900 #1]  INFO -- : [ActiveJob] [ActiveStorage::AnalyzeJob] [426a2ae2-8408-4ed6-bff9-df482302ba01] �[36m  S3 Storage (264.6ms) �[0m�[34mDownloaded file from key: o0xhpz0y2oxtbobg9ohqmv25zqxc�[0m
2023-01-16T19:23:16.591338653Z stdout F I, [2023-01-16T19:23:16.321154 #1]  INFO -- : [ActiveJob] [ActiveStorage::AnalyzeJob] [426a2ae2-8408-4ed6-bff9-df482302ba01] Performing ActiveStorage::AnalyzeJob (Job ID: 426a2ae2-8408-4ed6-bff9-df482302ba01) from GoodJob(default) enqueued at 2023-01-16T19:12:37Z with arguments: #<GlobalID:0x00007f423a18d6b0 @uri=#<URI::GID gid://landfolk/ActiveStorage::Blob/c5e6865b-8c24-46e4-b708-4c600ce83560>>
2023-01-16T19:23:16.591326091Z stdout F I, [2023-01-16T19:23:16.288075 #1]  INFO -- : [GoodJob] [1] [GoodJob::Scheduler(queues=* max_threads=5)-thread-11] Executed GoodJob 32297c4f-2a4e-492c-9968-805021e10c41
2023-01-16T19:23:16.591258611Z stdout F I, [2023-01-16T19:23:16.287769 #1]  INFO -- : [ActiveJob] [ActiveStorage::AnalyzeJob] [6b0c7336-6bda-42e5-9f44-29abf0e7087f] Performed ActiveStorage::AnalyzeJob (Job ID: 6b0c7336-6bda-42e5-9f44-29abf0e7087f) from GoodJob(default) in 9290.82ms
2023-01-16T19:23:07.29184909Z stdout F I, [2023-01-16T19:23:07.283798 #1]  INFO -- : [ActiveJob] [ActiveStorage::AnalyzeJob] [6b0c7336-6bda-42e5-9f44-29abf0e7087f] �[36m  S3 Storage (149.4ms) �[0m�[34mDownloaded file from key: pq2f4vupvnc4doa4sct7sqvxa2ez�[0m
2023-01-16T19:23:07.095051208Z stdout F I, [2023-01-16T19:23:07.088416 #1]  INFO -- : [ActiveJob] [ActiveStorage::AnalyzeJob] [6b0c7336-6bda-42e5-9f44-29abf0e7087f] �[36m  S3 Storage (86.6ms) �[0m�[34mDownloaded file from key: pq2f4vupvnc4doa4sct7sqvxa2ez�[0m
2023-01-16T19:23:07.095039217Z stdout F I, [2023-01-16T19:23:07.000743 #1]  INFO -- : [ActiveJob] [ActiveStorage::AnalyzeJob] [6b0c7336-6bda-42e5-9f44-29abf0e7087f] Performing ActiveStorage::AnalyzeJob (Job ID: 6b0c7336-6bda-42e5-9f44-29abf0e7087f) from GoodJob(default) enqueued at 2023-01-16T19:12:37Z with arguments: #<GlobalID:0x00007f423b6711f8 @uri=#<URI::GID gid://landfolk/ActiveStorage::Blob/f8af224a-ec18-4669-9d22-aebfbf9514bf>>
2023-01-16T19:23:07.09502387Z stdout F I, [2023-01-16T19:23:06.968640 #1]  INFO -- : [GoodJob] [1] [GoodJob::Scheduler(queues=* max_threads=5)-thread-9] Executed GoodJob 854ebaf0-1bf4-413f-aa5f-3cbb28700405
2023-01-16T19:23:07.094905987Z stdout F I, [2023-01-16T19:23:06.968299 #1]  INFO -- : [ActiveJob] [ActiveStorage::AnalyzeJob] [5ec6cae4-dbb0-43c3-8daa-abe1bcd2da41] Performed ActiveStorage::AnalyzeJob (Job ID: 5ec6cae4-dbb0-43c3-8daa-abe1bcd2da41) from GoodJob(default) in 3544.12ms
2023-01-16T19:23:06.094875574Z stdout F I, [2023-01-16T19:23:06.092621 #1]  INFO -- : [ActiveJob] [ActiveStorage::AnalyzeJob] [5ec6cae4-dbb0-43c3-8daa-abe1bcd2da41] �[36m  S3 Storage (73.4ms) �[0m�[34mDownloaded file from key: s3wp5pr0duyuupidmnlvtvm91d85�[0m
2023-01-16T19:23:05.980051069Z stdout F I, [2023-01-16T19:23:05.631672 #1]  INFO -- : [ActiveJob] [ActiveStorage::AnalyzeJob] [5ec6cae4-dbb0-43c3-8daa-abe1bcd2da41] �[36m  S3 Storage (1506.1ms) �[0m�[34mDownloaded file from key: s3wp5pr0duyuupidmnlvtvm91d85�[0m
2023-01-16T19:23:05.980043465Z stdout F I, [2023-01-16T19:23:04.623369 #1]  INFO -- : [GoodJob] [1] GoodJob shutting down scheduler...
2023-01-16T19:23:05.98003717Z stdout F I, [2023-01-16T19:23:04.622958 #1]  INFO -- : [GoodJob] [1] GoodJob scheduler is shut down.
2023-01-16T19:23:05.980030016Z stdout F I, [2023-01-16T19:23:04.525566 #1]  INFO -- : [GoodJob] [1] [GoodJob::Scheduler(queues=low_priority max_threads=2)-thread-2] Executed GoodJob e7f4cac5-a764-4d87-b100-9d3af8e5d78b
2023-01-16T19:23:05.980022145Z stdout F I, [2023-01-16T19:23:04.525248 #1]  INFO -- : [ActiveJob] [ImportContentTranslationsJob] [e99e7f87-6f04-4376-9d33-7ebf8c8dc894] Performed ImportContentTranslationsJob (Job ID: e99e7f87-6f04-4376-9d33-7ebf8c8dc894) from GoodJob(low_priority) in 55823.37ms
2023-01-16T19:23:05.980015197Z stdout F I, [2023-01-16T19:23:04.425711 #1]  INFO -- : [GoodJob] [1] GoodJob shutting down scheduler...
2023-01-16T19:23:05.980006429Z stdout F I, [2023-01-16T19:23:04.124584 #1]  INFO -- : [ActiveJob] [ActiveStorage::AnalyzeJob] [5ec6cae4-dbb0-43c3-8daa-abe1bcd2da41] Performing ActiveStorage::AnalyzeJob (Job ID: 5ec6cae4-dbb0-43c3-8daa-abe1bcd2da41) from GoodJob(default) enqueued at 2023-01-16T19:12:37Z with arguments: #<GlobalID:0x00007f42417b65f8 @uri=#<URI::GID gid://landfolk/ActiveStorage::Blob/be0e720e-3760-42d6-8c9c-1fb893486c09>>
2023-01-16T19:23:05.979993292Z stdout F I, [2023-01-16T19:23:03.823577 #1]  INFO -- : [GoodJob] Notifier unsubscribed with UNLISTEN

@bensheldon
Copy link
Owner

The Dashboard should only show records that have an active Advisory Lock on them (to be more accurate, the Notifier creates the record and then takes a lock on it).

The advisory lock is to differentiate records that are left over after a SIGKILL.

@bensheldon
Copy link
Owner

Idea: try removing the health check from your Kube config. That would at least remove the question of whether Kube is restarting the process problematically.

@jgrau
Copy link
Contributor Author

jgrau commented Jan 16, 2023

Idea: try removing the health check from your Kube config. That would at least remove the question of whether Kube is restarting the process problematically.

Thanks for the idea. The change is on its way online.

I don't have any experience with pg locks but could my issue be related to that? Actually the reason I started diving deep on the good_job setup was because we started to experience multiple workers picking up the same job. That could sound like a problem with locking...

@bensheldon
Copy link
Owner

Aha! I believe you're setting the Active Record parent class incorrectly.

There isn't a config.good_job.active_record_parent_class. It must be assigned directly on the GoodJob constant.

https://github.com/bensheldon/good_job#pgbouncer-compatibility

@jgrau
Copy link
Contributor Author

jgrau commented Jan 16, 2023

I literally discovered that the moment you wrote that post!

@jgrau
Copy link
Contributor Author

jgrau commented Jan 16, 2023

Yes, that fixed it! ❤️

Screenshot 2023-01-16 at 22 54 55

@bensheldon thank you so much for all your help and this great gem ❤️

@jgrau jgrau closed this as completed Jan 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

2 participants