-
-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Work is not being picked up at the expected rate #802
Comments
That's no good! Quickly, the thing that sticks out to me is that you're not seeing a Process in production. That tells me that your |
Yeah something is not healthy. The worker is there and it is working. I am seeing that it is failing health checks though: Here's my kubernetes manifest in case that's useful: containers:
- name: main
securityContext: {}
image: registry.digitalocean.com/landfolk/api:latest
imagePullPolicy: IfNotPresent
command:
- bundle
- exec
- good_job
- start
ports:
- name: probe-port
containerPort: 7001
startupProbe:
httpGet:
path: '/status/started'
port: probe-port
failureThreshold: 30
periodSeconds: 10
livenessProbe:
httpGet:
path: '/status/connected'
port: probe-port
failureThreshold: 1
periodSeconds: 10
resources:
requests:
memory: '2Gi'
cpu: '1000m'
limits:
memory: '2Gi'
cpu: '1000m'
env:
- name: DB_POOL
value: '18'
- name: GOOD_JOB_PROBE_PORT
value: '7001'
envFrom:
- configMapRef:
name: api
- secretRef:
name: api
- secretRef:
name: aws
- secretRef:
name: postgres
- secretRef:
name: sentry |
Those healthchecks look pretty suspicious. They aren't simply returning an http error status, they're not even connecting. That's not good. Your URL and port configuration look correct and consistent. I'm stumped! Can you verify locally that the probe port is running? If so, can you open that worker to the internet and see if you can access the health check. My thoughts are:
|
I'll try debugging the healthchecks but I want to point out that work is being performed. For example right now it has been running for 80 minutes and just one restart (due to health check fail) |
I could curl the healthchecks from a pod inside the cluster:
but of cause if kubernetes is pinging another url/port then that doesn't mean much... |
Maybe I misunderstood. The fire I see here is that there aren't any GoodJob::Process records being created. Everything else is smoke. But, where maybe I misunderstood was thinking health checks were continuously failing leading to something like:
But if the health checks are failing because the process is in a bad state, then I'm back to stumped. Are you able to find logs of like "Notifier started LISTENing"? The Notifier is also what creates the GoodJob::Process record for the Dashboard, and also what the |
Hmm... could this be a dashboard issue? It looks like the GoodJob::Process is there: Landfolk(production):001> GoodJob::Process.count
=> 1
Landfolk(production):002> GoodJob::Process.first
=>
#<GoodJob::Process:0x00007fc9be61a370
id: "dfdbb125-242a-4f80-8596-1897de9c4a6a",
created_at: Mon, 16 Jan 2023 19:23:49.382429000 UTC +00:00,
updated_at: Mon, 16 Jan 2023 19:23:49.382429000 UTC +00:00,
state:
{"id"=>"dfdbb125-242a-4f80-8596-1897de9c4a6a",
"pid"=>1,
"hostname"=>"api-worker-57d4f68648-dfqrg",
"proctitle"=>"/app/vendor/bundle/ruby/3.1.0/bin/good_job",
"schedulers"=>
["GoodJob::Scheduler(queues=+high_priority max_threads=2)", "GoodJob::Scheduler(queues=low_priority max_threads=2)", "GoodJob::Scheduler(queues=* max_threads=5)"],
"cron_enabled"=>true,
"preserve_job_records"=>true,
"retry_on_unhandled_error"=>false}> Yet the dashboard looks like this: |
Yes, but there's not a lot of info (at least from what I can tell): Here's what happened just prior to the UNLISTEN
|
The Dashboard should only show records that have an active Advisory Lock on them (to be more accurate, the Notifier creates the record and then takes a lock on it). The advisory lock is to differentiate records that are left over after a SIGKILL. |
Idea: try removing the health check from your Kube config. That would at least remove the question of whether Kube is restarting the process problematically. |
Thanks for the idea. The change is on its way online. I don't have any experience with pg locks but could my issue be related to that? Actually the reason I started diving deep on the good_job setup was because we started to experience multiple workers picking up the same job. That could sound like a problem with locking... |
Aha! I believe you're setting the Active Record parent class incorrectly. There isn't a https://github.com/bensheldon/good_job#pgbouncer-compatibility |
I literally discovered that the moment you wrote that post! |
Yes, that fixed it! ❤️ @bensheldon thank you so much for all your help and this great gem ❤️ |
Hi @bensheldon
I recently switched from the default queue configuration (
*
) to having dedicated thread pools for separate queues. I should say that I don't know for sure if that was the change that introduced what I'm seeing now but I'm fairly confident that it is.I'm experiencing that my workers are very slow at picking up new jobs and it feels like not all the threads are actually working. This is my
config/initializers/good_job.rb
:I run in
:external
execution mode locally and I can see the 1 worker correctly:But in production I don't see the processes:
Also notice that there's only 1 running job while there's a lot of queued jobs waiting to be picked up. With my configuration I would expect that there was around 7 running jobs (2 from the dedicated threads on the
low_priority
queue and 5 from the wildcard (*
)).I'm running with one worker instance (replica). I tried to run 3 worker pods and the count of running jobs did not change significantly.
Do you have any clues on what could be going on?
The text was updated successfully, but these errors were encountered: