Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

transitioner: Fix race condition with file_upload_handler #3603

Merged
merged 1 commit into from
May 11, 2020

Conversation

stream1972
Copy link
Contributor

Description of the Change
Fixes race condition between transitoner and file_upload_handler, leading to following symptoms:

  • The workunit have two results successfully uploaded, but it's "stuck" and not requested to validate;
  • User are complaining "why it's not validating";
  • Eventually, workunit will be validated when what-should-be a "no reply" timeout on one of results expires. which may be days or even weeks.

The problem is caused by incorrect value of transition_time of affected workunit. One of possible, but not limited to, sequences of events leading to race condition in standard scenario (min_quorum=2) is:

  • Transitioner started processing a workunit which has one completed and one "in progress" result;
  • Transitioner did a database scan and recorded that WU have one good result;
  • A result which was "in progress" state has been uploaded;
  • File_upload_handler updated database with "transition_time=now" to signal changes;
  • Transitioner is not aware of these changes. It updates transition_time database field again, but in this case with transitioner's default value (in this scenario it'll be "no reply" timeout of ex-"in progress" result).
  • Finally, we have a workunit with two uploaded results but transition_time set far in the future.

The problem has been widely exposed on my Private GFN Server which have to deal with high amount of short tasks.

Solution
Since transitioner knows original transition_time of workunit it processes, check that it wasn't changed in progress. If it was, something has been updated in background and workunit will be scheduled for immediate retransition on the next scan. This operation must be atomic so it is done at database level (in SQL).

Alternate Designs

Release Notes

@codecov
Copy link

codecov bot commented Apr 15, 2020

Codecov Report

Merging #3603 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #3603   +/-   ##
=======================================
  Coverage   18.62%   18.62%           
=======================================
  Files         107      107           
  Lines        8793     8793           
  Branches     1545     1545           
=======================================
  Hits         1638     1638           
  Misses       7041     7041           
  Partials      114      114           

@TheAspens
Copy link
Member

Good catch on the bug. To be clear though it is not the file_upload_handler that updated transition time but it is the scheduler that updates the transition time.
Your changes look good. Thanks.

@TheAspens TheAspens merged commit 8ae81c1 into BOINC:master May 11, 2020
@davidpanderson
Copy link
Contributor

The validator also updates the transition time, possibly to a future time.
I think we need to do a "careful update" there too.

But there's a more basic problem: workunit.transition_time is an integer,
and things can happen in the same second.
We could reduce the error probability to near zero by changing this to a double
and use microsecond timing.

@stream1972 stream1972 deleted the stream_for_merge branch May 12, 2020 07:01
@stream1972
Copy link
Contributor Author

The validator also updates the transition time, possibly to a future time.
I think we need to do a "careful update" there too.

I didn't noticed problems related to race condition on transition time between transitioneer and validator, but similar problem really exist and happened few times on my server. Lucky it has lesser importance because it's almost unnoticeable by users and will auto-recover itself eventually.

If a validator and transitioneer are processing same workunit at same time, transitioner can see workunit in "logically broken" intermediate state when all results are validated and their state is updated, but workunit.canonical_resultid field is not updated yet (alas, server code almost never uses SQL transactions). Transitoneer have a safety check for this situation and will postpone processing of such WU to a fixed period, 7 or 10 days (don't remember exact value). It does not causes any user complains because workunit is validated normally. It just moves to "assimilated" state with a few days delay, but eventually will do this. May be changing transitioneer safety timeout to lesser value like 24 hours should be sufficient.

But there's a more basic problem: workunit.transition_time is an integer,
and things can happen in the same second.
We could reduce the error probability to near zero by changing this to a double
and use microsecond timing.

No. This field will be either "in the future" (then we wait until future arrives) either "in the past" (then we run transitoneer). Precision is not important here.

@AenBleidd AenBleidd added this to the Server Release 1.4.1 milestone Aug 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants