transitioner: Fix race condition with file_upload_handler #3603

stream1972 · 2020-04-15T07:05:53Z

Description of the Change
Fixes race condition between transitoner and file_upload_handler, leading to following symptoms:

The workunit have two results successfully uploaded, but it's "stuck" and not requested to validate;
User are complaining "why it's not validating";
Eventually, workunit will be validated when what-should-be a "no reply" timeout on one of results expires. which may be days or even weeks.

The problem is caused by incorrect value of transition_time of affected workunit. One of possible, but not limited to, sequences of events leading to race condition in standard scenario (min_quorum=2) is:

Transitioner started processing a workunit which has one completed and one "in progress" result;
Transitioner did a database scan and recorded that WU have one good result;
A result which was "in progress" state has been uploaded;
File_upload_handler updated database with "transition_time=now" to signal changes;
Transitioner is not aware of these changes. It updates transition_time database field again, but in this case with transitioner's default value (in this scenario it'll be "no reply" timeout of ex-"in progress" result).
Finally, we have a workunit with two uploaded results but transition_time set far in the future.

The problem has been widely exposed on my Private GFN Server which have to deal with high amount of short tasks.

Solution
Since transitioner knows original transition_time of workunit it processes, check that it wasn't changed in progress. If it was, something has been updated in background and workunit will be scheduled for immediate retransition on the next scan. This operation must be atomic so it is done at database level (in SQL).

Alternate Designs

Release Notes

codecov · 2020-04-15T07:37:49Z

Codecov Report

Merging #3603 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #3603   +/-   ##
=======================================
  Coverage   18.62%   18.62%           
=======================================
  Files         107      107           
  Lines        8793     8793           
  Branches     1545     1545           
=======================================
  Hits         1638     1638           
  Misses       7041     7041           
  Partials      114      114

TheAspens · 2020-05-11T14:50:50Z

Good catch on the bug. To be clear though it is not the file_upload_handler that updated transition time but it is the scheduler that updates the transition time.
Your changes look good. Thanks.

davidpanderson · 2020-05-12T00:12:47Z

The validator also updates the transition time, possibly to a future time.
I think we need to do a "careful update" there too.

But there's a more basic problem: workunit.transition_time is an integer,
and things can happen in the same second.
We could reduce the error probability to near zero by changing this to a double
and use microsecond timing.

stream1972 · 2020-05-12T07:27:47Z

The validator also updates the transition time, possibly to a future time.
I think we need to do a "careful update" there too.

I didn't noticed problems related to race condition on transition time between transitioneer and validator, but similar problem really exist and happened few times on my server. Lucky it has lesser importance because it's almost unnoticeable by users and will auto-recover itself eventually.

If a validator and transitioneer are processing same workunit at same time, transitioner can see workunit in "logically broken" intermediate state when all results are validated and their state is updated, but workunit.canonical_resultid field is not updated yet (alas, server code almost never uses SQL transactions). Transitoneer have a safety check for this situation and will postpone processing of such WU to a fixed period, 7 or 10 days (don't remember exact value). It does not causes any user complains because workunit is validated normally. It just moves to "assimilated" state with a few days delay, but eventually will do this. May be changing transitioneer safety timeout to lesser value like 24 hours should be sufficient.

But there's a more basic problem: workunit.transition_time is an integer,
and things can happen in the same second.
We could reduce the error probability to near zero by changing this to a double
and use microsecond timing.

No. This field will be either "in the future" (then we wait until future arrives) either "in the past" (then we run transitoneer). Precision is not important here.

transitioner: Fix race condition with file_upload_handler

b3fb4c1

TheAspens merged commit 8ae81c1 into BOINC:master May 11, 2020

stream1972 deleted the stream_for_merge branch May 12, 2020 07:01

AenBleidd added this to the Server Release 1.4.1 milestone Aug 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

transitioner: Fix race condition with file_upload_handler #3603

transitioner: Fix race condition with file_upload_handler #3603

stream1972 commented Apr 15, 2020

codecov bot commented Apr 15, 2020

TheAspens commented May 11, 2020

davidpanderson commented May 12, 2020

stream1972 commented May 12, 2020

transitioner: Fix race condition with file_upload_handler #3603

transitioner: Fix race condition with file_upload_handler #3603

Conversation

stream1972 commented Apr 15, 2020

codecov bot commented Apr 15, 2020

Codecov Report

TheAspens commented May 11, 2020

davidpanderson commented May 12, 2020

stream1972 commented May 12, 2020