Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Machine-maintenance operator proposal #341

Closed
wants to merge 17 commits into from
Closed
Changes from 12 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
237 changes: 237 additions & 0 deletions enhancements/maintenance/machine-maintenance-operator.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,237 @@
---
title: machine-maintenance-operator
authors:
- "@dofinn"
reviewers:
- "@cblecker"
- "@jharrington22"
- "@jewzaam"
- "@jeremyeder"
- "@michaelgugino"
- "@derekwaynecarr"
approvers:
- "@michaelgugino"
- "@derekwaynecarr"
- "@bison"
- "@enxebre"
creation-date: 2020-05-26
last-updated: 2020-05-28
status: provisional

# machine-maintenance-operator

## Release Signoff Checklist

- [ ] Enhancement is `implementable`
- [ ] Design details are appropriately documented from clear requirements
- [ ] Test plan is defined
- [ ] Graduation criteria for dev preview, tech preview, GA
- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/)

## Open Questions
>1. How applicable is this use-case to baremetal installations?

>2. The ability to declare maintenance windows is currently being discussed in OSD. Nothing has been designed in concrete yet, meaning that this proposal has an undesigned dependency. However, development could commence on this operator and have the maintenance functionality retroactively fitted. Is this a blocker for its acceptance?

>3. How applicable will this be to a cloud provider like GCP that performs live migrations of their VMs?

## Summary

This enhancement proposal explores the idea of having a machine-maintenance-operator(MMO) that will inspect each machine CR for scheduled events for each infra/worker machine in the cluster. Upon finding a maintenance for a machine, the MMO will seek to execute this maintenance manually at the earliest convenient time as defined by an administrator.

A POC of the idea can be found [here](https://github.com/dofinn/machine-maintenance-operator).

## Motivation

Currently the machine-api marks a machine as stopped when it has been terminated/stopped by the cloud provider. A provider may terminate/stop an instance when there is an associated maintenance scheduled for it. This is also the case for users manually terminating/stopping machines via the console. The MMO is a proactive approach to executing the required action (for example, delete target machine CR) to enable the machine-api to manage machinesets. This operator would work in conjunction with MachineHealthCheck implementation.

## Requirements
That the machine-api collects scheduled maintenances from cloud providers and posts them in the status of each machine's CR

### Goals

List the specific goals of the proposal. How will we know that this has succeeded?
The machine-maintenance-operator will:
* detect maintenances from machine CRs and execute them manually at their earliest and most convenient time

### Non-Goals

This is not a solution for managing maintenances or machine state for master machine roles.

## Proposal

### User Stories [optional]

#### Story 1
As a customer using OKD/OCP/OSD, I want my cluster/s to be proactive in handling events that are initiated by the cloud provider hosting my cluster. This will assure me that OKD/OCP/OSD accounts for machines in the cluster that require terminating or stopping to be proactively maintained and addressed with intention.

### Implementation Details/Notes/Constraints [optional]

Constraints:
* This implementation will require the machine-api to query cloud providers for scheduled maintenances and publish them in the machine's CR.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the current proposal records these in MachineMaintenance resources, instead of writing to the Machine resource. Makes sense to me, because you don't want to be fighting the machine-API trying to write to the same Machine resource, but probably need to update this line to talk about "the machine-maintenance resource".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Resolved: c29b1c0

* GCP only allows maintenances to be queried from the node itself -> `curl http://metadata.google.internal/computeMetadata/v1/instance/maintenance-event -H "Metadata-Flavor: Google"`

This operator will iterate through the machineList{} and inspect each machine CR for scheduled maintenances. If a maintenance is found, the controller will validate the state of the cluster prior to performing any maintenance. For example; is the cluster upgrading? is the cluster already performing a maintenance?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"is the cluster upgrading" should be orthogonal. The limiting conditions are "are we already attempting to cordon/drain another machine in this pool?" and "can we cordon and drain this machine?". Workloads on the machines should be protected by PDBs and such to prevent eviction that would impact cluster health or provided services.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Resolved: 4556f9e


These processes will only hold true for infra and worker roles of machines within the cluster.

If a scheduled maintenance is detected for a master node, an alert should be raised.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this? Does scheduled maintenance wipe the current disk or something?

Copy link
Contributor Author

@dofinn dofinn Jul 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Admins should be aware that a provider maintenance is planned for a control plane node? Give them a chance to be proactive.


Not every cloud provider has the same kind of maintenances that inherit that same degradation on the cluster. Some of the features may need to be toggleable per provider.
dofinn marked this conversation as resolved.
Show resolved Hide resolved

Example: (AWS maintenance)[https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-instances-status-check_sched.html#types-of-scheduled-events] vs (GCP maintenance)[https://cloud.google.com/compute/docs/storing-retrieving-metadata#maintenanceevents].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain in the enhancement text the points that readers are expected to take from this comparison.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Resolved: c760ac9



### Risks and Mitigations

## Design Details

AWS event types:
* instance-stop
* instance-retirement
* instance-reboot
* system-reboot
* system-maintenance

### machinemaintenance controller
The machinemaintenance controller will iterate through machine CRs and reconcile identified maintenances. It will be responsible for first validating the state of the cluster prior to executing anything on a target object. Validating the state of the cluster will include will initially check for only is the cluster upgrading or is a maintenance already being performed. More use-cases can be added as seen fit.

If the cluster fails validation (for example is upgrading), the controller will requeue the object and process it again according to its `SyncPeriod` which would currently be proposed at 60 minutes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "60 minutes" is overly-specific for an enhancement proposal. Can we just say "... process it again later." and leave the duration as an internal implementation detail?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Resolved: 4fb2944


After cluster validation, the controller will ascertain if its in a maintenance window where it can execute maintenances (See open question 2). In the case that no maintenance windows are defined, the controller will continue as true. If the maintenance window logic is only applicable in OSD, the operator could validate if its "managed" prior to expecting these resources.

The event type is then sourced from the CR and then is resolved by either deleting a target machine CR (so the machine-api creates a new one) or raising an alert for manual intervention (master maintenance scheduled).

This is a very UDP type of approach.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me what this comparison adds. If you want to talk about a particular feature of this approach, can you talk about it directly in the enhancement text, instead of referencing another protocol and leaving it to the reader to look for parallels.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed protocol reference.


The MMo could also look to store the state of its actions in its own machinemaintenance CR that it would create from a machine CR.

### Example machinemaintenance CR

```
apiVersion: machinemaintenance.managed.openshift.io/v1alpha1
kind: MachineMaintenance
metadata:
name: mm-dofinn-20201705-blz22-worker-ap-southeast-2a-984h4
namespace: machine-maintenance-operator
spec:
maintenancescheduled: false
eventcode: "instance-stop"
eventid: "instance-event-01d0903276a5d038c"
machineid: "i-04fdc4494938e5c4a"
machinelink: "dofinn-20201705-blz22-worker-ap-southeast-2a-984h4"
status:
maintenance: "in-progress"
```

The MMO would then delete this CR after validating the target machine has been deleted and new one created indicating that the maintenance is completed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would there be any way to know that the maintenance had completed after the CR is deleted?

my concern is that the maintenance would complete and then the deletion happens and due to timing something might miss that signal. in these cases is there some backup or audit trail (perhaps events) ?

Copy link
Contributor Author

@dofinn dofinn Jul 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would there be any way to know that the maintenance had completed after the CR is deleted?

Make the exact same query to find any scheduled events for target node. If != nil, either the maintenance didn't clear or a new one is now there.


### Alerting
An alert will be raised for the following conditions:

* Unable to schedule maintenance during customer defined window and prior to cloud provider maintenance deadline
* Post maintenance verification failed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you define post-maintenance verification in the enhancement? If so, I'm having trouble finding it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* Node does not return after shutdown
* Node unable to drain
* Node unable to shutdown
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect drain/shutdown to be MachineHealthCheck / machine-API operator concerns that a machine-maintenance operator would not need to handle directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MMO would only alert if it was the process that triggered the drain. Nodes being unable to perform drain/shutdown would require and alert for manual intervention.

* Same event still scheduled for machine after performing maintenance
* Maintenance taking longer then X minutes.

### Test Plan

**Note:** *Section not required until targeted at a release.*

Consider the following in developing a test plan for this enhancement:
- Will there be e2e and integration tests, in addition to unit tests?
- How will it be tested in isolation vs with other components?

No need to outline all of the test cases, just the general strategy. Anything
that would count as tricky in the implementation and anything particularly
challenging to test should be called out.

All code is expected to have adequate tests (eventually with coverage
expectations).

### Graduation Criteria

**Note:** *Section not required until targeted at a release.*

Define graduation milestones.

These may be defined in terms of API maturity, or as something else. Initial proposal
should keep this high-level with a focus on what signals will be looked at to
determine graduation.

Consider the following in developing the graduation criteria for this
enhancement:
- Maturity levels - `Dev Preview`, `Tech Preview`, `GA`
- Deprecation

Clearly define what graduation means.

#### Examples

These are generalized examples to consider, in addition to the aforementioned
[maturity levels][maturity-levels].

##### Dev Preview -> Tech Preview

- Ability to utilize the enhancement end to end
- End user documentation, relative API stability
- Sufficient test coverage
- Gather feedback from users rather than just developers

##### Tech Preview -> GA

- More testing (upgrade, downgrade, scale)
- Sufficient time for feedback
- Available by default

**For non-optional features moving to GA, the graduation criteria must include
end to end tests.**

##### Removing a deprecated feature

- Announce deprecation and support policy of the existing feature
- Deprecate the feature

### Upgrade / Downgrade Strategy

The operator should follow (OLM)[https://github.com/operator-framework/operator-lifecycle-manager] for lifecycle management.

### Version Skew Strategy

How will the component handle version skew with other components?
What are the guarantees? Make sure this is in the test plan.

Consider the following in developing a version skew strategy for this
enhancement:
- During an upgrade, we will always have skew among components, how will this impact your work?
- Does this enhancement involve coordinating behavior in the control plane and
in the kubelet? How does an n-2 kubelet without this feature available behave
when this feature is used?
- Will any other components on the node change? For example, changes to CSI, CRI
or CNI may require updating that component before the kubelet.

## Implementation History

Major milestones in the life cycle of a proposal should be tracked in `Implementation
History`.

## Drawbacks

It is not applicable to baremetal installations.

## Alternatives

Similar to the `Drawbacks` section the `Alternatives` section is used to
highlight and record other possible approaches to delivering the value proposed
by an enhancement.

## Infrastructure Needed [optional]

Use this section if you need things from the project. Examples include a new
subproject, repos requested, github details, and/or testing infrastructure.

Listing these here allows the community to get the process for these resources
started right away.