Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition with concurency control #378

Closed
antulik opened this issue Sep 15, 2021 · 3 comments · Fixed by #433
Closed

Race condition with concurency control #378

antulik opened this issue Sep 15, 2021 · 3 comments · Fixed by #433
Labels
documentation Improvements or additions to documentation hacktoberfest Issues that are good for Hacktoberfest participants

Comments

@antulik
Copy link

antulik commented Sep 15, 2021

There is a race condition bug when the enqueued jobs can exceed the limit. I mentioned the bug in #366

Steps to reproduce (tested on de2184b):

  1. Apply the patch below
  2. cd spec/test_app and bundle exec good_job start
  3. repeat step 2, so there are 2 processes are running
  4. Ensure good_jobs table is empty (clear all rows if not)
  5. Wait for cron to queue jobs
  6. Observe 2 jobs queued while enqueue_limit: 1, perform_limit: 0

Expected: 1 job to be queued.

Index: spec/test_app/config/application.rb
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/spec/test_app/config/application.rb b/spec/test_app/config/application.rb
--- a/spec/test_app/config/application.rb	(revision de2184b9c4e85a0bdfdf49b4c13cc15c2f36c4c8)
+++ b/spec/test_app/config/application.rb	(date 1631663503812)
@@ -20,6 +20,7 @@
     # config.middleware.insert_before Rack::Sendfile, ActionDispatch::DebugLocks
     config.log_level = :debug
 
+    config.good_job.enable_cron = true
     config.good_job.cron = {
       example: {
         cron: '*/5 * * * * *', # every 5 seconds
Index: spec/test_app/app/jobs/example_job.rb
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/spec/test_app/app/jobs/example_job.rb b/spec/test_app/app/jobs/example_job.rb
--- a/spec/test_app/app/jobs/example_job.rb	(revision de2184b9c4e85a0bdfdf49b4c13cc15c2f36c4c8)
+++ b/spec/test_app/app/jobs/example_job.rb	(date 1631663953153)
@@ -1,10 +1,19 @@
 class ExampleJob < ApplicationJob
+  include GoodJob::ActiveJobExtensions::Concurrency
+
   ExpectedError = Class.new(StandardError)
   DeadError = Class.new(StandardError)
 
   retry_on DeadError, attempts: 3
 
+  good_job_control_concurrency_with(
+    enqueue_limit: 1,
+    perform_limit: 0,
+    key: -> { "key" }
+  )
+
   def perform(type = :success)
+    sleep(2)
     type = type.to_sym
 
     if type == :success
@bensheldon
Copy link
Owner

bensheldon commented Sep 15, 2021

Thanks for documenting this. I have an explanation for why it's happening, but unfortunately, I don't have a solution as I think it's inherent in the reason why enqueue_limit was changed to be exclusive of perform_limit in #317.

Here's what's happening:

  1. Process 1: Cron does an enqueue_limit check, and sees there are 0 jobs enqueued, 0 jobs performing, and enqueues Job 1.
  2. Process 1: The Scheduler fetches-and-lock the next record, which is Job 1. Job 1 is now locked and in a "performing" state.
  3. Process 2: Cron does a enqueue_limit check, and sees there are 0 jobs enqueued, 1 job performing, and enqueues Job 2.
  4. Process 1: Within Job 1, Concurrency does a perform_limit check, sees that it's exceeded, and aborts Job 1, to retry later. Job 1 is unlocked and exits a "performing" state.
  5. ...there are now 2 jobs enqueued.

The problem lies within the perform_limit check, which happens after a job is already performing; it's abortive, rather than preventative.

#317 happened to prevent a raise-condition on perform_limit's trailing edge, and what I think is being described here is a race condition on perform_limits's leading edge. It's troublesome because I think solving it technically will be complicated.

I will think some more about this and please tell me if what I wrote here helps you understand what's happening.

@antulik
Copy link
Author

antulik commented Sep 15, 2021

@bensheldon thanks for explaining, it does make sense.

I can confirm it is correct. The number of extra jobs is linked to the number of worker threads.

I don't have any suggestions for fixing it, but I would recommend mentioning that in readme. E.g.

Currently enqueue_limit does not guarantee the accurate limit. When jobs are queued in parallel (e.g. cron) the job count could exceed the limit. If you need 100% accuracy use total_limit instead.

@bensheldon bensheldon added hacktoberfest Issues that are good for Hacktoberfest participants documentation Improvements or additions to documentation labels Oct 1, 2021
@bensheldon
Copy link
Owner

@antulik thank you for working through this with me. I've updated the readme in #433 to document this behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation hacktoberfest Issues that are good for Hacktoberfest participants
Projects
None yet
2 participants