[12.0][FIX] under pressure #131

sbidoul · 2019-03-08T10:50:47Z

Under pressure, ie when starting a job takes
more than one second, the jobrunner requeues
jobs it attempted to start. This reveals a race condition
that was identified in a TODO that exists since I first
wrote the job runner. At the time, it
did not matter because the requeueing of started job
did not exist. Now it needs to be fixed.

To avoid interactions with the ORM cache, the approach
is to lock the job record and ensure it is in the correct state
before loading it, and changing it to started state
in _try_perform_job.

Under pressure, ie when starting a job takes more than one second, the jobrunner requeues jobs it attempted to start. This reveals a race condition that was identified in a TODO that exists since I first wrote the job runner. At the time, it did not matter because the requeing of started job did not exist. Now it needs to be fixed. To avoid interactions with the ORM cache, the approach is to lock the job record and ensure it is in the correct state before loading it, and changing it to started state in _try_perform_job.

sbidoul · 2019-03-08T10:52:10Z

The visible effect of that race condition was the same job being run more than once in parallel.

sbidoul · 2019-03-08T11:00:42Z

@guewen this may be the root cause of #130, #41 and maybe #120

NotReadableJobError was never raised. _load_job is not necessary because the job is now guaranteed to exist.

Although no issue has been observed in practice, it sounds safer to commit changes to the job state with the same env cr used to load it from database.

sbidoul · 2019-03-08T14:39:27Z

queue_job/controllers/main.py

-                            'instead of enqueued in /runjob',
-                            job.uuid, job.state)
-            return
-


This is where the race condition occured: two simultaneous runs of the same job both reach this point in enqueued state.

guewen · 2019-03-11T07:53:51Z

queue_job/controllers/main.py

+        env.cr.execute(
+            "SELECT state FROM queue_job "
+            "WHERE uuid=%s AND state=%s "
+            "FOR UPDATE",


Do we want to use NOWAIT and catch concurrent errors, to avoid locking the second worker?

@guewen I thought about that and opted for the simpler code. My thinking is the benefit is marginal (a nicer log in rare situations) for a more complex code. The lock will never be held long (ie until the job is started immediately after).

Didn't realize it was only until it starts, better to keep it simple yes :)

sbidoul · 2019-03-28T09:30:15Z

This is working fine in production.

sbidoul added 2 commits March 8, 2019 12:28

[IMP] remove _load_job and NotReadableJobError

b511c3a

NotReadableJobError was never raised. _load_job is not necessary because the job is now guaranteed to exist.

[IMP] commit job state changes with same env used to create it

d3e6e92

Although no issue has been observed in practice, it sounds safer to commit changes to the job state with the same env cr used to load it from database.

sbidoul force-pushed the 12.0-under-pressure-sbi branch from d483695 to d3e6e92 Compare March 8, 2019 14:33

sbidoul commented Mar 8, 2019

View reviewed changes

sbidoul added the needs review label Mar 8, 2019

guewen reviewed Mar 11, 2019

View reviewed changes

guewen approved these changes Mar 11, 2019

View reviewed changes

ThomasBinsfeld approved these changes Mar 28, 2019

View reviewed changes

OCA-git-bot added the approved label Mar 28, 2019

guewen merged commit 490f246 into OCA:12.0 Mar 28, 2019

sbidoul deleted the 12.0-under-pressure-sbi branch March 28, 2019 11:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[12.0][FIX] under pressure #131

[12.0][FIX] under pressure #131

Uh oh!

sbidoul commented Mar 8, 2019

Uh oh!

sbidoul commented Mar 8, 2019 •

edited

Loading

Uh oh!

sbidoul commented Mar 8, 2019

Uh oh!

sbidoul Mar 8, 2019

Uh oh!

guewen Mar 11, 2019

Uh oh!

sbidoul Mar 11, 2019

Uh oh!

guewen Mar 11, 2019

Uh oh!

sbidoul commented Mar 28, 2019

Uh oh!

Uh oh!

Uh oh!

[12.0][FIX] under pressure #131

[12.0][FIX] under pressure #131

Uh oh!

Conversation

sbidoul commented Mar 8, 2019

Uh oh!

sbidoul commented Mar 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sbidoul commented Mar 8, 2019

Uh oh!

sbidoul Mar 8, 2019

Choose a reason for hiding this comment

Uh oh!

guewen Mar 11, 2019

Choose a reason for hiding this comment

Uh oh!

sbidoul Mar 11, 2019

Choose a reason for hiding this comment

Uh oh!

guewen Mar 11, 2019

Choose a reason for hiding this comment

Uh oh!

sbidoul commented Mar 28, 2019

Uh oh!

Uh oh!

sbidoul commented Mar 8, 2019 •

edited

Loading