Race condition with concurency control #378

antulik · 2021-09-15T00:08:16Z

There is a race condition bug when the enqueued jobs can exceed the limit. I mentioned the bug in #366

Steps to reproduce (tested on de2184b):

Apply the patch below
cd spec/test_app and bundle exec good_job start
repeat step 2, so there are 2 processes are running
Ensure good_jobs table is empty (clear all rows if not)
Wait for cron to queue jobs
Observe 2 jobs queued while enqueue_limit: 1, perform_limit: 0

Expected: 1 job to be queued.

Index: spec/test_app/config/application.rb
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/spec/test_app/config/application.rb b/spec/test_app/config/application.rb
--- a/spec/test_app/config/application.rb	(revision de2184b9c4e85a0bdfdf49b4c13cc15c2f36c4c8)
+++ b/spec/test_app/config/application.rb	(date 1631663503812)
@@ -20,6 +20,7 @@
     # config.middleware.insert_before Rack::Sendfile, ActionDispatch::DebugLocks
     config.log_level = :debug
 
+    config.good_job.enable_cron = true
     config.good_job.cron = {
       example: {
         cron: '*/5 * * * * *', # every 5 seconds
Index: spec/test_app/app/jobs/example_job.rb
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/spec/test_app/app/jobs/example_job.rb b/spec/test_app/app/jobs/example_job.rb
--- a/spec/test_app/app/jobs/example_job.rb	(revision de2184b9c4e85a0bdfdf49b4c13cc15c2f36c4c8)
+++ b/spec/test_app/app/jobs/example_job.rb	(date 1631663953153)
@@ -1,10 +1,19 @@
 class ExampleJob < ApplicationJob
+  include GoodJob::ActiveJobExtensions::Concurrency
+
   ExpectedError = Class.new(StandardError)
   DeadError = Class.new(StandardError)
 
   retry_on DeadError, attempts: 3
 
+  good_job_control_concurrency_with(
+    enqueue_limit: 1,
+    perform_limit: 0,
+    key: -> { "key" }
+  )
+
   def perform(type = :success)
+    sleep(2)
     type = type.to_sym
 
     if type == :success

The text was updated successfully, but these errors were encountered:

bensheldon · 2021-09-15T04:43:13Z

Thanks for documenting this. I have an explanation for why it's happening, but unfortunately, I don't have a solution as I think it's inherent in the reason why enqueue_limit was changed to be exclusive of perform_limit in #317.

Here's what's happening:

Process 1: Cron does an enqueue_limit check, and sees there are 0 jobs enqueued, 0 jobs performing, and enqueues Job 1.
Process 1: The Scheduler fetches-and-lock the next record, which is Job 1. Job 1 is now locked and in a "performing" state.
Process 2: Cron does a enqueue_limit check, and sees there are 0 jobs enqueued, 1 job performing, and enqueues Job 2.
Process 1: Within Job 1, Concurrency does a perform_limit check, sees that it's exceeded, and aborts Job 1, to retry later. Job 1 is unlocked and exits a "performing" state.
...there are now 2 jobs enqueued.

The problem lies within the perform_limit check, which happens after a job is already performing; it's abortive, rather than preventative.

#317 happened to prevent a raise-condition on perform_limit's trailing edge, and what I think is being described here is a race condition on perform_limits's leading edge. It's troublesome because I think solving it technically will be complicated.

I will think some more about this and please tell me if what I wrote here helps you understand what's happening.

antulik · 2021-09-15T23:26:13Z

@bensheldon thanks for explaining, it does make sense.

I can confirm it is correct. The number of extra jobs is linked to the number of worker threads.

I don't have any suggestions for fixing it, but I would recommend mentioning that in readme. E.g.

Currently enqueue_limit does not guarantee the accurate limit. When jobs are queued in parallel (e.g. cron) the job count could exceed the limit. If you need 100% accuracy use total_limit instead.

bensheldon · 2021-10-22T19:56:56Z

@antulik thank you for working through this with me. I've updated the readme in #433 to document this behavior.

antulik mentioned this issue Sep 15, 2021

Concurrency control for all queued jobs #366

Closed

bensheldon added hacktoberfest Issues that are good for Hacktoberfest participants documentation Improvements or additions to documentation labels Oct 1, 2021

bensheldon mentioned this issue Oct 22, 2021

Add Readme note about race conditions in Concurrency's enqueue_limit and `perform_limit #433

Merged

bensheldon closed this as completed in #433 Oct 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition with concurency control #378

Race condition with concurency control #378

antulik commented Sep 15, 2021

bensheldon commented Sep 15, 2021 •

edited

Loading

antulik commented Sep 15, 2021

bensheldon commented Oct 22, 2021

Race condition with concurency control #378

Race condition with concurency control #378

Comments

antulik commented Sep 15, 2021

bensheldon commented Sep 15, 2021 • edited Loading

antulik commented Sep 15, 2021

bensheldon commented Oct 22, 2021

bensheldon commented Sep 15, 2021 •

edited

Loading