Skip to content

[12.0][FIX] under pressure #131

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Mar 28, 2019
Merged

[12.0][FIX] under pressure #131

merged 3 commits into from
Mar 28, 2019

Conversation

sbidoul
Copy link
Member

@sbidoul sbidoul commented Mar 8, 2019

Under pressure, ie when starting a job takes
more than one second, the jobrunner requeues
jobs it attempted to start
. This reveals a race condition
that was identified in a TODO that exists since I first
wrote the job runner. At the time, it
did not matter because the requeueing of started job
did not exist. Now it needs to be fixed.

To avoid interactions with the ORM cache, the approach
is to lock the job record and ensure it is in the correct state
before loading it, and changing it to started state
in _try_perform_job.

Under pressure, ie when starting a job takes
more than one second, the jobrunner requeues
jobs it attempted to start. This reveals a race condition
that was identified in a TODO that exists since I first
wrote the job runner. At the time, it
did not matter because the requeing of started job
did not exist. Now it needs to be fixed.

To avoid interactions with the ORM cache, the approach
is to lock the job record and ensure it is in the correct state
before loading it, and changing it to started state
in _try_perform_job.
@sbidoul
Copy link
Member Author

sbidoul commented Mar 8, 2019

The visible effect of that race condition was the same job being run more than once in parallel.

@sbidoul
Copy link
Member Author

sbidoul commented Mar 8, 2019

@guewen this may be the root cause of #130, #41 and maybe #120

sbidoul added 2 commits March 8, 2019 12:28
NotReadableJobError was never raised.
_load_job is not necessary because the job is now guaranteed
to exist.
Although no issue has been observed in practice, it sounds safer
to commit changes to the job state with the same env cr used
to load it from database.
@sbidoul sbidoul force-pushed the 12.0-under-pressure-sbi branch from d483695 to d3e6e92 Compare March 8, 2019 14:33
'instead of enqueued in /runjob',
job.uuid, job.state)
return

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where the race condition occured: two simultaneous runs of the same job both reach this point in enqueued state.

env.cr.execute(
"SELECT state FROM queue_job "
"WHERE uuid=%s AND state=%s "
"FOR UPDATE",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to use NOWAIT and catch concurrent errors, to avoid locking the second worker?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guewen I thought about that and opted for the simpler code. My thinking is the benefit is marginal (a nicer log in rare situations) for a more complex code. The lock will never be held long (ie until the job is started immediately after).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't realize it was only until it starts, better to keep it simple yes :)

@sbidoul
Copy link
Member Author

sbidoul commented Mar 28, 2019

This is working fine in production.

@guewen guewen merged commit 490f246 into OCA:12.0 Mar 28, 2019
@sbidoul sbidoul deleted the 12.0-under-pressure-sbi branch March 28, 2019 11:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants