Skip to content
This repository has been archived by the owner on Jan 30, 2020. It is now read-only.

agent,engine: fix bugs in rescheduling for replaced units #1698

Merged

Conversation

dongsupark
Copy link
Contributor

So far in engine reconciler, decide() returns a bool variable to determine whether it's possible to schedule or not. Reschedule was not handled in a proper way.

So let's create a new tri-state for job.JobAction, i.e. JobActionSchedule, JobActionUnschedule, and JobActionReschedule. And return the tri-state variable instead of bool. That way AbleToRun() can check return values from HasConflict() and HasReplace() in a correct manner.

With that, we can fix bugs in rescheduling for replaced units. While the 'Replaces' option has been supported since #1572, the engine didn't actually unschedule units to be replaced. It was a bug.

So let's implement GetReplacedUnit() to expose the replaced unit from AgentState to the engine reconciler. And make the engine reconciler unschedule the replaced unit, and schedule the current unit. The engine scheduler's decision structure needs to also track if the unit needs to be rescheduled, to be used by the scheduling path.

@dongsupark dongsupark self-assigned this Nov 1, 2016
@dongsupark dongsupark force-pushed the dongsu/engine-fix-replace-reschedule branch 2 times, most recently from 30e9570 to aac597e Compare November 3, 2016 18:09
Dongsu Park added 3 commits November 4, 2016 10:52
So far decide() returns a bool variable to determine whether it's
possible to schedule or not. Reschedule was not handled in a proper way.
So let's create a new tri-state for job.JobAction, i.e.
JobActionSchedule, JobActionUnschedule, and JobActionReschedule.
And return the tri-state variable instead of bool.

That way AbleToRun() can check return values from HasConflict() and
HasReplace() in a correct manner.
While the 'Replaces' option has been supported since
coreos#1572, the engine didn't actually
unschedule units to be replaced. It was a bug.

So let's implement GetReplacedUnit() to expose the replaced unit from
AgentState to the engine reconciler. And make the engine reconciler
unschedule the replaced unit, and schedule the current unit.
The engine scheduler's decision structure needs to have a helper for
the rescheduling case, by simply scheduling the replaced unit to a
free machine.
Now that the AbleToRun() returns JobAction instead of bool,
unit tests also need to be changed.
@dongsupark dongsupark force-pushed the dongsu/engine-fix-replace-reschedule branch 2 times, most recently from a060f9d to a78572a Compare November 4, 2016 10:22
@dongsupark
Copy link
Contributor Author

Updated.
Refined the scheduling logic, updated functional tests to do correct tests for unit replacement.
As it's about fixing bugs, I'm going to merge it soon, on Monday next week at latest.

@dongsupark dongsupark closed this Nov 4, 2016
@dongsupark dongsupark reopened this Nov 4, 2016
@dongsupark
Copy link
Contributor Author

Oops, I didn't mean to close. It was just a bug in the github UI displaying nothing even after writing a comment.

For correct tests, TestScheduleReplace() now does the following tests.

* Start 4 units. replace.0 on m0, replace.[12] on m1, and replace-kick0
  on m0. Doing that we could trigger a situation of unit replacement.
* Ensure that machine of kick0 unit is not the same as that of unit 0,
  also that machine of kick0 is the same as the original machine m0.
* Make use of WaitForState() to avoid races with unit state publisher.
@dongsupark dongsupark force-pushed the dongsu/engine-fix-replace-reschedule branch from a78572a to f382b13 Compare November 4, 2016 11:25
@dongsupark dongsupark merged commit 2b2eda2 into coreos:master Nov 7, 2016
@dongsupark dongsupark deleted the dongsu/engine-fix-replace-reschedule branch November 7, 2016 10:44
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant