Skip to content

Commit

Permalink
Trevor King comments
Browse files Browse the repository at this point in the history
  • Loading branch information
dofinn committed Jul 24, 2020
1 parent c760ac9 commit 4fb2944
Showing 1 changed file with 3 additions and 5 deletions.
8 changes: 3 additions & 5 deletions enhancements/maintenance/machine-maintenance-operator.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,15 +101,13 @@ AWS event types:
### machinemaintenance controller
The machinemaintenance controller will iterate through machine CRs and reconcile identified mainteances. It will be responsible for first validating the state of the cluster prior to executing anything on a target object. Validating the state of the cluster will include will initially check for only is the cluster upgrading or is a maintenance already being performed. More use-cases can be added as seen fit.

If the cluster fails validation (for example is upgrading), the controller will requeue the object and process it again according to its `SyncPeriod` which would currently be proposed at 60 minutes.
If the cluster fails validation (for example is already draining another machine), the controller will requeue the object try again later.

After cluster validation, the controller will ascertain if its in a maintenance window where is can execute maintenances (See open question 2). If in the case no maintenance windows are defined, the controller will continue as true. If the maintenance window logic is only applicable in OSD, the operator could validate if its "managed" prior to expecting these resources.

The event type is then sourced from the CR and then is resolved by either deleting a target machine CR (so the machine-api creates a new one) or raising an alert for manual intervention (master maintenance scheduled).

This is a very UDP type of approach.

The MMO could also store state of its actions in its on machinemaintenance CR that it would create from a machine CR.
The MMO will also store state of its actions in its on machinemaintenance CR that it would create from a machine CR.

### Example machinemaintenance CR

Expand All @@ -135,7 +133,7 @@ The MMO would then delete this CR after validating the target machine has been d
An alert will be raised for the following conditions:

* Unable to schedule maintenance during customer defined window and prior to cloud provider maintenance deadline
* Post maintenance verification failed
* Post maintenance validation failed
* Node does not return after shutdown
* Node unable to drain
* Node unable to shutdown
Expand Down

0 comments on commit 4fb2944

Please sign in to comment.