Machine-maintenance operator proposal #341

dofinn · 2020-05-26T12:05:04Z

No description provided.

dofinn · 2020-05-26T12:08:16Z

@cblecker @jharrington22 @jewzaam @jeremyeder @michaelgugino @derekwaynecarr Could you please provide feedback on this proposal. In particular whom i have listed as reviewers/approvers. Many Thanks.

enhancements/maintenance/machine-maintenance-operator.md

michaelgugino · 2020-05-26T23:54:02Z

I think there are some good ideas here, especially looking at the maintenance events in AWS. As noted on the mailing list, the machine-api already retrieves this data via the describeInstances call, we can put that information into the status field of the machine, an annotation, or what have you. The MMO can then react accordingly, IMO, just delete the machine that's going to go down for maintenance well ahead of time.

We also need to consider other cloud providers. For instance, GCP only makes this kind of data available via metadata, and they only give a 60 second head's up, and only if you've looked at this value in the metadata previously. In the case of GCP, they live migrate, so for most hosts, ideally there are no issues. For GPU instances, they give a 60 minute warning, and stop the instance. The instance will be recreated elsewhere on restart.

In both the AWS and GCP case where an instance is stopped and recreated elsewhere, it's not clear what impacts that will have on the kubelet. In particular, m5d instances provide NVME drives which can be partitioned and utilized on /var/lib/containers (or whatever the mount is) for local container storage. If the stuff on the root EBS expects there to be a bunch of items in /var/lib/containers, that might cause an issue. Further more, since it's a new instance, will the disks even get setup? Ignition is set to run only once. In any case, probably best to proactively delete the instance rather than hoping we can restart the host after the maintenance window is complete.

dofinn · 2020-05-27T11:19:34Z

@michaelgugino Thank you for the feedback. Its becoming apparent that juggling perspectives of running a single unmanaged cluster vs a fleet of managed is difficult to define expectations of operators.

I now agree with you on extending the machine-api to store maintenance event status. This keeps the core logic solid and then something like the MMO can extend onto and apply its own policies. Maybe the MMO would be best suited strictly handling maintenance events that the machine-api would publish to the machine CRs. Any unhealthy nodes ("stopped" or "unhealthy behind LB" can be remediated with a MHC policy.

jeremyeder · 2020-06-03T13:26:49Z

What would happen if the mmo detects a maintenance event while an upgrade is underway?

dofinn · 2020-06-04T00:41:31Z

What would happen if the mmo detects a maintenance event while an upgrade is underway?

@jeremyeder -> Controller can re-queue the object and reconcile it again according to the SyncPeriod.

https://github.com/openshift/enhancements/pull/341/files#diff-9576061bff3e7a1922fa6c3ab7de01daR97-R99

michaelgugino · 2020-06-09T13:58:12Z

What would happen if the mmo detects a maintenance event while an upgrade is underway?

@jeremyary how does one detect an upgrade is underway?

jeremyary · 2020-06-09T14:05:16Z

What would happen if the mmo detects a maintenance event while an upgrade is underway?

@jeremyary how does one detect an upgrade is underway?

assuming wrong Jeremy ^_^ @jeremyeder

dofinn · 2020-06-09T21:56:33Z

@michaelgugino we should be able to pull the status from PROGRESSING

oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.3     True        False         41h     Cluster version is 4.4.3

elmiko

i think this is an interesting idea, i can see it being useful for operators to better tune their openshift clusters. i added some suggestions and also some questions i have.

enhancements/maintenance/machine-maintenance-operator.md

elmiko · 2020-06-24T20:53:43Z

enhancements/maintenance/machine-maintenance-operator.md

+  maintenance: "in-progress"
+```
+
+The MMO would then delete this CR after validating the target machine has been deleted and new one created indicating that the maintenance is completed. 


would there be any way to know that the maintenance had completed after the CR is deleted?

my concern is that the maintenance would complete and then the deletion happens and due to timing something might miss that signal. in these cases is there some backup or audit trail (perhaps events) ?

would there be any way to know that the maintenance had completed after the CR is deleted?

Make the exact same query to find any scheduled events for target node. If != nil, either the maintenance didn't clear or a new one is now there.

Co-authored-by: Michael McCune <msm@opbstudios.com>

openshift-ci-robot · 2020-07-06T03:30:14Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: dofinn
To complete the pull request process, please assign imcleod
You can assign the PR to them by writing /assign @imcleod in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Co-authored-by: Michael McCune <msm@opbstudios.com>

wking · 2020-07-06T18:26:54Z

/retitle Machine-maintenance operator proposal

Unpacking the current title's "MMO" acronym.

wking · 2020-07-06T18:32:26Z

enhancements/maintenance/machine-maintenance-operator.md

+### Implementation Details/Notes/Constraints [optional]
+
+Constraints: 
+* This implementation will require the machine-api to query cloud providers for scheduled maintenances and publish them in the machine's CR. 


Seems like the current proposal records these in MachineMaintenance resources, instead of writing to the Machine resource. Makes sense to me, because you don't want to be fighting the machine-API trying to write to the same Machine resource, but probably need to update this line to talk about "the machine-maintenance resource".

Thanks. Resolved: c29b1c0

wking · 2020-07-06T18:35:49Z

enhancements/maintenance/machine-maintenance-operator.md

+* This implementation will require the machine-api to query cloud providers for scheduled maintenances and publish them in the machine's CR. 
+* GCP only allows maintenances to be queried from the node itself -> `curl http://metadata.google.internal/computeMetadata/v1/instance/maintenance-event -H "Metadata-Flavor: Google"`
+
+This operator will iterate through the machineList{} and inspect each machine CR for scheduled maintenances. If a maintenance is found, the controller will validate the state of the cluster prior to performing any maintenance. For example; is the cluster upgrading? is the cluster already performing a maintenance?


"is the cluster upgrading" should be orthogonal. The limiting conditions are "are we already attempting to cordon/drain another machine in this pool?" and "can we cordon and drain this machine?". Workloads on the machines should be protected by PDBs and such to prevent eviction that would impact cluster health or provided services.

Thanks. Resolved: 4556f9e

wking · 2020-07-06T18:43:29Z

enhancements/maintenance/machine-maintenance-operator.md

+
+These processes will only hold true for infra and worker roles of machines within the cluster. 
+
+If a scheduled maintenance is detected for a master node, an alert should be raised.


Why is this? Does scheduled maintenance wipe the current disk or something?

Admins should be aware that a provider maintenance is planned for a control plane node? Give them a chance to be proactive.

wking · 2020-07-06T18:44:44Z

enhancements/maintenance/machine-maintenance-operator.md

+
+Not every cloud provider has the same kind of maintenances that inherit that same degradation on the cluster.  Some of the features may need to be toggleable per provider. 
+
+Example: (AWS maintenance)[https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-instances-status-check_sched.html#types-of-scheduled-events] vs (GCP maintenance)[https://cloud.google.com/compute/docs/storing-retrieving-metadata#maintenanceevents].


Please explain in the enhancement text the points that readers are expected to take from this comparison.

Thanks. Resolved: c760ac9

wking · 2020-07-06T18:46:23Z

enhancements/maintenance/machine-maintenance-operator.md

+### machinemaintenance controller
+The machinemaintenance controller will iterate through machine CRs and reconcile identified maintenances. It will be responsible for first validating the state of the cluster prior to executing anything on a target object. Validating the state of the cluster will include will initially check for only is the cluster upgrading or is a maintenance already being performed. More use-cases can be added as seen fit. 
+
+If the cluster fails validation (for example is upgrading), the controller will requeue the object and process it again according to its `SyncPeriod` which would currently be proposed at 60 minutes. 


I think "60 minutes" is overly-specific for an enhancement proposal. Can we just say "... process it again later." and leave the duration as an internal implementation detail?

Thanks. Resolved: 4fb2944

wking · 2020-07-06T18:48:02Z

enhancements/maintenance/machine-maintenance-operator.md

+
+The event type is then sourced from the CR and then is resolved by either deleting a target machine CR (so the machine-api creates a new one) or raising an alert for manual intervention (master maintenance scheduled). 
+
+This is a very UDP type of approach. 


It's not clear to me what this comparison adds. If you want to talk about a particular feature of this approach, can you talk about it directly in the enhancement text, instead of referencing another protocol and leaving it to the reader to look for parallels.

Removed protocol reference.

wking · 2020-07-06T18:50:10Z

enhancements/maintenance/machine-maintenance-operator.md

+An alert will be raised for the following conditions:
+
+* Unable to schedule maintenance during customer defined window and prior to cloud provider maintenance deadline
+* Post maintenance verification failed


Do you define post-maintenance verification in the enhancement? If so, I'm having trouble finding it.

https://github.com/openshift/enhancements/pull/341/files#diff-9576061bff3e7a1922fa6c3ab7de01daR130

wking · 2020-07-06T19:43:43Z

enhancements/maintenance/machine-maintenance-operator.md

+* Post maintenance verification failed
+* Node does not return after shutdown
+* Node unable to drain
+* Node unable to shutdown


I would expect drain/shutdown to be MachineHealthCheck / machine-API operator concerns that a machine-maintenance operator would not need to handle directly.

MMO would only alert if it was the process that triggered the drain. Nodes being unable to perform drain/shutdown would require and alert for manual intervention.

openshift-bot · 2020-10-30T13:31:06Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

dofinn · 2020-10-31T03:11:33Z

/lifecycle frozen

dhellmann · 2021-09-21T21:17:35Z

This proposal is over a year old. As part of recent efforts to clean up old pull requests, I am removing the life-cycle/frozen label to allow it to age out and be closed. If the proposal is still active, please restore the label.

/remove-lifecycle frozen

openshift-bot · 2021-10-19T21:43:44Z

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2021-10-26T21:50:29Z

Stale enhancement proposals rot after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2021-11-02T21:56:32Z

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

openshift-ci · 2021-11-02T21:58:02Z

@openshift-bot: Closed this PR.

In response to this:

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

MMO enhancement proposal

9503ba1

openshift-ci-robot requested review from bparees and jsafrane May 26, 2020 12:05

removed tabs

4839d15

michaelgugino reviewed May 26, 2020

View reviewed changes

enhancements/maintenance/machine-maintenance-operator.md Outdated Show resolved Hide resolved