Return to using executor.wrap around Scheduler execution task #99

bensheldon · 2020-08-26T05:36:32Z

@sj26 I'm seeing some behavior indicating that #97 is losing advisory locks. Would you be able to take a look?

It looks similar to what's described here: rails/rails#26956 (comment)

Off the top of my head, the most potentially-surprising outcome I'd anticipate is that executor.wrap inside with_connection would cause the connection to be returned early

...at least, that's my hypothesis given that the Advisory Locks are connection-based, implying that the connection taking the advisory lock is a different connection than the one trying to release it (the WARNING: you don't own a lock of type ExclusiveLock).

$ bin/rspec spec/integration/scheduler_spec.rb

Randomized with seed 533
.WARNING:  you don't own a lock of type ExclusiveLock
WARNING:  you don't own a lock of type ExclusiveLock
WARNING:  you don't own a lock of type ExclusiveLock
WARNING:  you don't own a lock of type ExclusiveLock
F

Failures:

  1) Schedule Integration when there are a large number of jobs pops items off of the queue and runs them
     Failure/Error: expect(RUN_JOBS.size).to eq number_of_jobs

       expected: 500
            got: 501

       (compared using ==)
     # ./spec/integration/scheduler_spec.rb:80:in `block (3 levels) in <top (required)>'
     # ./spec/support/reset_rails_queue_adapter.rb:4:in `block (2 levels) in <top (required)>'
     # ./spec/support/database_cleaner.rb:13:in `block (3 levels) in <top (required)>'
     # ./spec/support/database_cleaner.rb:12:in `block (2 levels) in <top (required)>'

Finished in 7.81 seconds (files took 3.96 seconds to load)
2 examples, 1 failure

Failed examples:

rspec ./spec/integration/scheduler_spec.rb:52 # Schedule Integration when there are a large number of jobs pops items off of the queue and runs them

Randomized with seed 533

sj26 · 2020-08-26T06:40:57Z

Ah, hmm, perhaps the executor within activejob is checking-in the connection too early, now. Hmm. Perhaps we need to restore the executor around the perform block only, before it takes the advisory lock.

I'll take a look.

bensheldon · 2020-08-26T15:13:22Z

@sj26 I found something interesting/unfortunate:

It seems like using with_connection doesn't appropriately retain the same connection leading to a situation where the advisory lock is owned by a different connection than the one that tries to unlock it.
Even when I go back to the old executor.wrap there is, occasionally, a record that is returned from the query that isn't advisory locked at all. This seems to happen consistently when the scheduler is running hot (1000 jobs), so fyi there may be some intersecting issues here.

For the latter issue of the query not reliably returning an advisory locked record, I can patch an additional guard clause. It's not ideal with an extra database query, but it's good until I can dig more into that.

I do feel confident that returning the executor.wrap addresses the former issue of connection retention.

I'm very fascinated by all of this and appreciate your efforts 🙌

sj26 · 2020-08-26T23:10:42Z

Sorry, I ran out of time to look at this yesterday. But I think what you're proposing here is correct.

ActiveRecord installs query cache handling into the executor:

https://github.com/rails/rails/blob/7101489293ece071013417aeed2da3aa1b0927c9/activerecord/lib/active_record/railtie.rb#L244

It releases connections when the executor completes, regardless of whether the connection was already held before the executor starts:

https://github.com/rails/rails/blob/7101489293ece071013417aeed2da3aa1b0927c9/activerecord/lib/active_record/query_cache.rb#L38-L46

So by the time the ActiveJob has returned we have lost the connection which took the advisory lock.

Double-wrapping the executor is harmless, the second wrap (done within ActiveJob) will simply yield:

https://github.com/rails/rails/blob/7101489293ece071013417aeed2da3aa1b0927c9/activesupport/lib/active_support/execution_wrapper.rb#L84

I would push the executor wrap into the perform method rather than putting it in the future for clarity about what is being wrapped a why, but this should fix the issue. 👍

This shouldn't reintroduce the deadlock because these perform methods only run for the duration of a job. The issue was the notifier being run in an executor which never returns. As long as we don't re-wrap the notifier I think we should be good. 🙌

bensheldon · 2020-08-27T03:22:07Z

@sj26 I appreciate all of your help. Especially digging into what's happening here and all the improvements 🎉

I would push the executor wrap into the perform method rather than putting it in the future for clarity about what is being wrapped a why,

I have it in the Scheduler because it felt closer to the responsibility of the thread to set up a proper execution environment, rather than the Performer or Job-level code to be aware that it was running inside of a thread. That said, I'll admit it's perhaps un-YAGNI thinking.

I think it's becoming clear that the querying/locking implementation may need to change and that might drive my thinking and and the code further.

sj26 · 2020-08-27T04:16:36Z

Nice, I tried this out locally with my async puma setup and code reload worked without deadlocks. 👍

gadimbaylisahil · 2020-08-27T19:36:03Z

break unless good_job&.owns_advisory_lock?

This change broke my implementation in an application with
multi-tenancy enabled with Postgres Schemas. Using https://github.com/rails-on-services/apartment gem.

I will try to also look into why the advisory lock isn't owned. Or at least will try to create an example application with same setup.

Demonstrate loss of advisory lock with with_connection

28985b0

Add temporary lock-state debugging statements

6153980

bensheldon changed the title ~~Demonstrate loss of advisory lock with with_connection~~ Return to using executor.wrap around Scheduler execution task Aug 26, 2020

Return to using executor.wrap around Scheduler execution task

fe9fdd4

bensheldon merged commit 1218e8f into main Aug 27, 2020

bensheldon deleted the db_connection_leak branch August 27, 2020 03:22

bensheldon mentioned this pull request Aug 31, 2020

Jobs are not processed in multi schema setup. Apartment + GoodJob ( post 1.1.2 ) #117

Closed

bensheldon added the bug Something isn't working label Jan 9, 2021

bensheldon mentioned this pull request Jan 29, 2021

Question about locking internals #212

Closed

bensheldon mentioned this pull request Oct 19, 2023

Remove pg_advisory_unlock_all() after job is run; only verify blank finished_at (and not lock presence) before performing job #1113

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Return to using executor.wrap around Scheduler execution task #99

Return to using executor.wrap around Scheduler execution task #99

bensheldon commented Aug 26, 2020 •

edited

Loading

sj26 commented Aug 26, 2020 •

edited

Loading

bensheldon commented Aug 26, 2020 •

edited

Loading

sj26 commented Aug 26, 2020 •

edited

Loading

bensheldon commented Aug 27, 2020

sj26 commented Aug 27, 2020

gadimbaylisahil commented Aug 27, 2020

Return to using executor.wrap around Scheduler execution task #99

Return to using executor.wrap around Scheduler execution task #99

Conversation

bensheldon commented Aug 26, 2020 • edited Loading

sj26 commented Aug 26, 2020 • edited Loading

bensheldon commented Aug 26, 2020 • edited Loading

sj26 commented Aug 26, 2020 • edited Loading

bensheldon commented Aug 27, 2020

sj26 commented Aug 27, 2020

gadimbaylisahil commented Aug 27, 2020

bensheldon commented Aug 26, 2020 •

edited

Loading

sj26 commented Aug 26, 2020 •

edited

Loading

bensheldon commented Aug 26, 2020 •

edited

Loading

sj26 commented Aug 26, 2020 •

edited

Loading