Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does good_job hold on to advisory locks for finished jobs? #177

Closed
deepakinseattle opened this issue Dec 15, 2020 · 9 comments
Closed

Does good_job hold on to advisory locks for finished jobs? #177

deepakinseattle opened this issue Dec 15, 2020 · 9 comments
Labels
bug Something isn't working

Comments

@deepakinseattle
Copy link

First off, thanks for this Gem. We've been running it under a light workload in production for a few months now, and it's been terrific.

Second, a question, possibly looking for advise or to help troubleshoot. Our configuration/stack looks like the following

What Version
ruby ruby 2.7.2p137 (2020-10-01 revision 5445e04352) [x86_64-linux-musl]
rails 6.0.3.4
good_job 1.3.4
postgresql PostgreSQL 12.3 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11), 64-bit
worker bundle exec good_job start #1 process, 5 threads
preserve_job_records GoodJob.preserve_job_records = true
connection pool pool: <%= ENV.fetch("RAILS_MAX_THREADS") { 5 } %> #RAILS_MAX_THREADS is set to 5 as well

Yesterday we began processing a larger volume of jobs (~15K/hour), and a few hours after we started doing that, began to notice the following exception

ActiveRecord::StatementInvalid: PG::OutOfMemory: ERROR:  out of shared memory
HINT:  You might need to increase max_locks_per_transaction.

We bumped the max_locks_per_transactions to 1024 (from the default of 64) to buy us some time to troubleshoot.

Question : It appears GoodJob seems to keep advisory locks around for finished jobs. Is that expected? I would think that besides just the storage overhead, preserve_job_records wouldn't need to hold on to advisory locks.

Appendix

In Rails :

irb(main):001:0> GoodJob::Job.finished.first.advisory_locked?
=> true
irb(main):002:0> GoodJob::Job.finished.advisory_locked.count
=> 247742  #This query took about 3-4 seconds to perform:

Number of locks

database=> select locktype, count(locktype) from pg_locks group by locktype;
   locktype    | count
---------------+--------
 virtualxid    |      5
 object        |      1
 advisory      | 255268
 transactionid |      1
 relation      |     23
(5 rows)

Number of connections

database => select pid as process_id,
       usename as username,
       datname as database_name,
       client_addr as client_address,
       application_name,
       backend_start,
       state,
       state_change
from pg_stat_activity;
 process_id |     username      |   database_name   | client_address |        application_name        |         backend_start         |        state        |         state_change
------------+-------------------+-------------------+----------------+--------------------------------+-------------------------------+---------------------+-------------------------------
      26497 |                   |                   |                |                                | 2020-12-15 02:21:54.315069+00 |                     |
      26500 | redacted          |                   |                |                                | 2020-12-15 02:21:54.315814+00 |                     |
      26501 | redacted_redacted | redacted_redacted | 192.168.29.124 | /usr/local/bundle/bin/good_job | 2020-12-15 02:21:54.320826+00 | idle                | 2020-12-15 02:21:54.394067+00
      26502 | redacted_redacted | redacted_redacted | 192.168.29.124 | GoodJob::Notifier              | 2020-12-15 02:21:54.401358+00 | idle                | 2020-12-15 02:21:54.423085+00
      26789 | redacted_redacted | redacted_redacted | 192.168.29.124 | /usr/local/bundle/bin/good_job | 2020-12-15 02:21:55.35472+00  | active              | 2020-12-15 18:03:24.603505+00
      26803 | redacted_redacted | redacted_redacted | 192.168.29.124 | /usr/local/bundle/bin/good_job | 2020-12-15 02:21:55.547899+00 | active              | 2020-12-15 18:03:24.476213+00
      26804 | redacted_redacted | redacted_redacted | 192.168.29.124 | /usr/local/bundle/bin/good_job | 2020-12-15 02:21:55.85596+00  | idle in transaction | 2020-12-15 18:03:24.663866+00
      26912 | redacted          | redacted          | 127.0.0.1      | PostgreSQL JDBC Driver         | 2020-12-15 02:22:05.433505+00 | idle                | 2020-12-15 18:03:19.285195+00
      27044 | redacted          | redacted          | 127.0.0.1      |                                | 2020-12-15 02:22:07.521112+00 | idle                | 2020-12-15 18:03:24.342632+00
       5033 | redacted_redacted | redacted_redacted | 192.168.48.39  | psql                           | 2020-12-15 18:02:47.951493+00 | active              | 2020-12-15 18:03:24.623074+00
      30076 | redacted_redacted | redacted_redacted | 192.168.29.124 | /usr/local/bundle/bin/good_job | 2020-12-15 02:25:33.759964+00 | active              | 2020-12-15 18:03:24.587388+00
      26495 |                   |                   |                |                                | 2020-12-15 02:21:54.316586+00 |                     |
      26494 |                   |                   |                |                                | 2020-12-15 02:21:54.314331+00 |                     |
      26496 |                   |                   |                |                                | 2020-12-15 02:21:54.314747+00 |                     |
(14 rows)
@bensheldon
Copy link
Owner

Yikes! Advisory locks should only be held for the duration of each job. If you're seeing the advisory locks being held, that is a bug.

I'm fairly certain there are tests that assert that advisory locks are released. The job code that locks it is here:

def self.perform_with_advisory_lock
good_job = nil
result = nil
error = nil
unfinished.priority_ordered.only_scheduled.limit(1).with_advisory_lock do |good_jobs|
good_job = good_jobs.first
# TODO: Determine why some records are fetched without an advisory lock at all
break unless good_job&.owns_advisory_lock?
result, error = good_job.perform
end
[good_job, result, error] if good_job
end

And the advisory lock implementation of #with_advisory_lock is here:

def with_advisory_lock
raise ArgumentError, "Must provide a block" unless block_given?
records = advisory_lock.to_a
begin
yield(records)
ensure
records.each(&:advisory_unlock)
end
end
end

@bensheldon bensheldon added the bug Something isn't working label Dec 15, 2020
@deepakinseattle
Copy link
Author

Ok, thanks, Ben. I'll work on getting a vanilla rails app with a repro and get back to you.

@deepakinseattle
Copy link
Author

@bensheldon, I think I've got a pretty straightforward repro in this repo : https://github.com/deepakinseattle/good_job_locks_repro

It's a stock Rails 6.0.3.4 app with good_job 1.3.4, configured to preserve job records, and operate in the external execution mode

The job itself doesn't do much.

The README of the repo has the steps to reproduce the issue, but the TL;DR is


# Enqueue 100 jobs
irb(main):004:0> 100.times {|i| LocksReproJob.perform_later(SecureRandom.uuid)}

# Check in on them a minute later
irb(main):006:0> puts GoodJob::Job.all.count, GoodJob::Job.finished.count, GoodJob::Job.finished.advisory_locked.count
   (0.5ms)  SELECT COUNT(*) FROM "good_jobs"
   (0.3ms)  SELECT COUNT(*) FROM "good_jobs" WHERE "good_jobs"."finished_at" IS NOT NULL
   (1.0ms)  SELECT COUNT(*) FROM "good_jobs" LEFT JOIN pg_locks ON pg_locks.locktype = 'advisory' AND pg_locks.objsubid = 1 AND pg_locks.classid = ('x' || substr(md5('good_jobs' || "good_jobs"."id"::text), 1, 16))::bit(32)::int AND pg_locks.objid = (('x' || substr(md5('good_jobs' || "good_jobs"."id"::text), 1, 16))::bit(64) << 32)::bit(32)::int WHERE "good_jobs"."finished_at" IS NOT NULL AND "pg_locks"."locktype" IS NOT NULL
100
100
95

# Enqueue another 100 jobs
irb(main):007:0> 100.times {|i| LocksReproJob.perform_later(SecureRandom.uuid)}

# Check in on them a minute later
irb(main):008:0> puts GoodJob::Job.all.count, GoodJob::Job.finished.count, GoodJob::Job.finished.advisory_locked.count
   (0.5ms)  SELECT COUNT(*) FROM "good_jobs"
   (0.3ms)  SELECT COUNT(*) FROM "good_jobs" WHERE "good_jobs"."finished_at" IS NOT NULL
   (1.5ms)  SELECT COUNT(*) FROM "good_jobs" LEFT JOIN pg_locks ON pg_locks.locktype = 'advisory' AND pg_locks.objsubid = 1 AND pg_locks.classid = ('x' || substr(md5('good_jobs' || "good_jobs"."id"::text), 1, 16))::bit(32)::int AND pg_locks.objid = (('x' || substr(md5('good_jobs' || "good_jobs"."id"::text), 1, 16))::bit(64) << 32)::bit(32)::int WHERE "good_jobs"."finished_at" IS NOT NULL AND "pg_locks"."locktype" IS NOT NULL
200
200
186

Let me know if I can provide any more details or help in any other way.

@bensheldon
Copy link
Owner

bensheldon commented Dec 16, 2020

@deepakinseattle Thank you! This is really helpful.

You've helped track down one bug related to locked jobs not being unlocked when they are not preserved #178

But I'm curious because your repro is preserving jobs. I'll dig into that next.

@deepakinseattle
Copy link
Author

Great, thanks for the update, Ben!

@bensheldon
Copy link
Owner

bensheldon commented Dec 16, 2020

@deepakinseattle I found the problem: Postgres 12 changes the behavior of CTEs, which GoodJob uses to lock jobs.

I have PR #179 that addresses the problem (🎉 ) and I also was able to reproduce with your repo and verify that it is fixed (🎉🎉 ).

I can release a patch tomorrow.

@deepakinseattle
Copy link
Author

Great news, Ben, thanks so much! I'm happy to try out the patch and report back if it's helpful to you. I really appreciate the engagement and quick turn around!

@bensheldon
Copy link
Owner

@deepakinseattle I just published GoodJob v1.3.5 that includes the change. Hope it fixes the problem!

@deepakinseattle
Copy link
Author

Ben, tested it out in that vanilla rails app, and things look great. There are exactly 0 advisory locks after the jobs are finished.

Thanks again for the quick turnaround on this. I'd love to buy you a beer or coffee the next time I'm down in the Bay Area (post COVID).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants