Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How should I tune good job settings given my job workload of ~1 millon jobs/hr? #1595

Open
salmonsteak1 opened this issue Feb 1, 2025 · 4 comments

Comments

@salmonsteak1
Copy link

Hello, I'm looking to test how GoodJob performs under a relatively heavy workload of around a million jobs being executed every hour. My current setup has every job class to be in its own queue, and I've defaulted the --queue setting to be *.

I've set GOOD_JOB_MAX_THREADS to 98, and as advised, I've set the DB pool size to be a relatively high number of 200.

Since GoodJob doesn't support pgbouncer in transaction mode, I'm connecting to the DB directly using these configurations. The number of goodjob executors I'm expecting to run in production (at maximum) is around 1800. I'm testing this against quite a beefy postgres instance but it seems like the CPU usage peaks at over 60% (for 16 cores), and the memory usage is maxxed out at 143GB (which caused the postgres instance to crash). This doesn't include the web instances that I have yet to account for, and no traffic is actually being served to this postgres db too.

I was wondering what's the best way to tune Good Job settings such that it is able to handle these number of executors, as well as web instances that will be hitting the postgres DB. Thank you!

@bensheldon
Copy link
Owner

GoodJob might not be the right tool at your scale. I always recommend Sidekiq Enterprise when talking about tens of millions of jobs

That said, most of the performance tuning advice I can give based on your numbers would be the same: that’s too many threads per process. 5-15 threads is more realistic. Leave the DB pool size at 200 (the number doesn’t matter so much as it’s big). You’ll need to scale horizontally across however many processes it takes.

For GoodJob specifically: pulling from the * queue is most efficient, so that’s good. You should set up GoodJob to delete job records after they perform to reduce the table size.

If you can give an EXPLAIN ANALYZE of the lock query I can maybe help more. You’re pushing the scale pretty heavy.

@salmonsteak1
Copy link
Author

Thanks for the advice @bensheldon! I'm considering Sidekiq Enterprise too, but it seems like there will be quite a bit of refactoring within the codebase to effectively use the full suite of Sidekiq Enterprise features without ActiveJob (due to some limitations of using ActiveJob with Sidekiq)

I'm actually using Solid Queue right now, and while it works under normal workloads, it doesn't seem too optimized for Postgres, and we're constantly seeing our CPU spike to 100%.

I'm afraid the best option would be to just bite the bullet just move to Sidekiq, since I'm sure that the number of jobs I'm processing will only increase from here.

Regarding the lock query, did you mean this query?

SELECT
  "good_job_processes".*
FROM
  "good_job_processes"
LEFT JOIN
  pg_locks
ON
  pg_locks.locktype = $1
  AND pg_locks.objsubid = $2
  AND pg_locks.classid = ($3 || SUBSTR(MD5($4 || $5 || "good_job_processes"."id"::text), $6, $7))::bit(32)::int
  AND pg_locks.objid = (($8 || SUBSTR(MD5($9 || $10 || "good_job_processes"."id"::text), $11, $12))::bit(64) << $13)::bit(32)::int
WHERE
  ("good_job_processes"."lock_type" = $14
    AND "pg_locks"."locktype" IS NULL
    OR "good_job_processes"."lock_type" IS NULL
    AND "good_job_processes"."updated_at" < $15)
ORDER BY
  "good_job_processes"."id" ASC
LIMIT
  $16

If so, here's the query plan for it:

Image

@salmonsteak1
Copy link
Author

Ah, after reading the README again, I think I might have overprovisioned my workers. Could I clarify, if I were to define GOOD_JOB_MAX_THREADS as 98 and leave --queue to be the default, does that mean each executor will have 1 process with 98 threads (and also, immediately create 98 connections)? Or is that just a maximum limit to be defined, and GoodJob "auto provisions" more threads as needed?

@francois
Copy link

francois commented Feb 5, 2025

Like Ben said, --max-threads 98 means there will be 98 connections to the database and 98 OS threads processing jobs simultaneously. Due to the GVL (Global Virtual Machine Lock), only 1 Ruby thread at a time can actually run Ruby code. The others will be sitting on their hands, waiting for the GVL to free so they can acquire it.

You must scale out to multiple processes, each one handling just a portion of the jobs.

This also depends on how long each job lasts. Let's imagine that each job takes 1 second to complete. That means you would need 1 million seconds of CPU-time. Spread out over an hour, that would be about 278 jobs executing simultaneously (1M / 3600 = 277.87). If your jobs last about 5 seconds, you need 5 times that many jobs processing simultaneously, or about 1400 (5 000 000 ÷ 3600 = 1388.89). Each one of those threads will require a database connection. Your PostgreSQL server will need to be beefy indeed to support that many connections.

In a nutshell, each process has many threads, and each thread can process one job at a time. For every Ruby line of code, only one thread at a time in the process will be executing Ruby. The others will be twiddling their thumbs. If your jobs are I/O-bound (not so easy to determine), the number of threads per process/worker could be increased, but not to "ridiculous" numbers like 98.

Going back to the 1 sec/job, where we need 278 jobs executing simultaneously... With a --max-threads 5, you would need 56 processes, spread out over how many machines you want.

It could be 56 machines each running 1 process, and each process running 5 threads (56 * 1 * 5 = 280 simultaneous jobs). Or, it could be 12 machines, each running 10 processes, each process running 3 threads (12 * 12 * 3 = 360 simultaneous jobs). In these last two sentences, you can replace "processes" with "workers": they mean the same thing. And the "simultaneous jobs" bit? That's the number of PG connections you'll need to support just for the workers.

I hope this helped!

Might I inquire as to what you're running that requires 1M jobs/hour?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Inbox
Development

No branches or pull requests

3 participants