-
-
Notifications
You must be signed in to change notification settings - Fork 509
[12.0][FIX] under pressure #131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Under pressure, ie when starting a job takes more than one second, the jobrunner requeues jobs it attempted to start. This reveals a race condition that was identified in a TODO that exists since I first wrote the job runner. At the time, it did not matter because the requeing of started job did not exist. Now it needs to be fixed. To avoid interactions with the ORM cache, the approach is to lock the job record and ensure it is in the correct state before loading it, and changing it to started state in _try_perform_job.
The visible effect of that race condition was the same job being run more than once in parallel. |
NotReadableJobError was never raised. _load_job is not necessary because the job is now guaranteed to exist.
Although no issue has been observed in practice, it sounds safer to commit changes to the job state with the same env cr used to load it from database.
d483695
to
d3e6e92
Compare
'instead of enqueued in /runjob', | ||
job.uuid, job.state) | ||
return | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where the race condition occured: two simultaneous runs of the same job both reach this point in enqueued state.
env.cr.execute( | ||
"SELECT state FROM queue_job " | ||
"WHERE uuid=%s AND state=%s " | ||
"FOR UPDATE", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to use NOWAIT and catch concurrent errors, to avoid locking the second worker?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@guewen I thought about that and opted for the simpler code. My thinking is the benefit is marginal (a nicer log in rare situations) for a more complex code. The lock will never be held long (ie until the job is started immediately after).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't realize it was only until it starts, better to keep it simple yes :)
This is working fine in production. |
Under pressure, ie when starting a job takes
more than one second, the jobrunner requeues
jobs it attempted to start. This reveals a race condition
that was identified in a TODO that exists since I first
wrote the job runner. At the time, it
did not matter because the requeueing of started job
did not exist. Now it needs to be fixed.
To avoid interactions with the ORM cache, the approach
is to lock the job record and ensure it is in the correct state
before loading it, and changing it to started state
in _try_perform_job.