-
Notifications
You must be signed in to change notification settings - Fork 475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Machine-maintenance operator proposal #341
Changes from 12 commits
9503ba1
4839d15
9ee258a
c5a5a6a
2fcc04d
dd14020
365d5b1
cd32157
7daa0af
5dd9697
07d7c99
77f35c0
34748a2
c29b1c0
4556f9e
c760ac9
4fb2944
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,237 @@ | ||
--- | ||
title: machine-maintenance-operator | ||
authors: | ||
- "@dofinn" | ||
reviewers: | ||
- "@cblecker" | ||
- "@jharrington22" | ||
- "@jewzaam" | ||
- "@jeremyeder" | ||
- "@michaelgugino" | ||
- "@derekwaynecarr" | ||
approvers: | ||
- "@michaelgugino" | ||
- "@derekwaynecarr" | ||
- "@bison" | ||
- "@enxebre" | ||
creation-date: 2020-05-26 | ||
last-updated: 2020-05-28 | ||
status: provisional | ||
|
||
# machine-maintenance-operator | ||
|
||
## Release Signoff Checklist | ||
|
||
- [ ] Enhancement is `implementable` | ||
- [ ] Design details are appropriately documented from clear requirements | ||
- [ ] Test plan is defined | ||
- [ ] Graduation criteria for dev preview, tech preview, GA | ||
- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/) | ||
|
||
## Open Questions | ||
>1. How applicable is this use-case to baremetal installations? | ||
|
||
>2. The ability to declare maintenance windows is currently being discussed in OSD. Nothing has been designed in concrete yet, meaning that this proposal has an undesigned dependency. However, development could commence on this operator and have the maintenance functionality retroactively fitted. Is this a blocker for its acceptance? | ||
|
||
>3. How applicable will this be to a cloud provider like GCP that performs live migrations of their VMs? | ||
|
||
## Summary | ||
|
||
This enhancement proposal explores the idea of having a machine-maintenance-operator(MMO) that will inspect each machine CR for scheduled events for each infra/worker machine in the cluster. Upon finding a maintenance for a machine, the MMO will seek to execute this maintenance manually at the earliest convenient time as defined by an administrator. | ||
|
||
A POC of the idea can be found [here](https://github.com/dofinn/machine-maintenance-operator). | ||
|
||
## Motivation | ||
|
||
Currently the machine-api marks a machine as stopped when it has been terminated/stopped by the cloud provider. A provider may terminate/stop an instance when there is an associated maintenance scheduled for it. This is also the case for users manually terminating/stopping machines via the console. The MMO is a proactive approach to executing the required action (for example, delete target machine CR) to enable the machine-api to manage machinesets. This operator would work in conjunction with MachineHealthCheck implementation. | ||
|
||
## Requirements | ||
That the machine-api collects scheduled maintenances from cloud providers and posts them in the status of each machine's CR | ||
|
||
### Goals | ||
|
||
List the specific goals of the proposal. How will we know that this has succeeded? | ||
The machine-maintenance-operator will: | ||
* detect maintenances from machine CRs and execute them manually at their earliest and most convenient time | ||
|
||
### Non-Goals | ||
|
||
This is not a solution for managing maintenances or machine state for master machine roles. | ||
|
||
## Proposal | ||
|
||
### User Stories [optional] | ||
|
||
#### Story 1 | ||
As a customer using OKD/OCP/OSD, I want my cluster/s to be proactive in handling events that are initiated by the cloud provider hosting my cluster. This will assure me that OKD/OCP/OSD accounts for machines in the cluster that require terminating or stopping to be proactively maintained and addressed with intention. | ||
|
||
### Implementation Details/Notes/Constraints [optional] | ||
|
||
Constraints: | ||
* This implementation will require the machine-api to query cloud providers for scheduled maintenances and publish them in the machine's CR. | ||
* GCP only allows maintenances to be queried from the node itself -> `curl http://metadata.google.internal/computeMetadata/v1/instance/maintenance-event -H "Metadata-Flavor: Google"` | ||
|
||
This operator will iterate through the machineList{} and inspect each machine CR for scheduled maintenances. If a maintenance is found, the controller will validate the state of the cluster prior to performing any maintenance. For example; is the cluster upgrading? is the cluster already performing a maintenance? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "is the cluster upgrading" should be orthogonal. The limiting conditions are "are we already attempting to cordon/drain another machine in this pool?" and "can we cordon and drain this machine?". Workloads on the machines should be protected by PDBs and such to prevent eviction that would impact cluster health or provided services. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks. Resolved: 4556f9e |
||
|
||
These processes will only hold true for infra and worker roles of machines within the cluster. | ||
|
||
If a scheduled maintenance is detected for a master node, an alert should be raised. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why is this? Does scheduled maintenance wipe the current disk or something? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Admins should be aware that a provider maintenance is planned for a control plane node? Give them a chance to be proactive. |
||
|
||
Not every cloud provider has the same kind of maintenances that inherit that same degradation on the cluster. Some of the features may need to be toggleable per provider. | ||
dofinn marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Example: (AWS maintenance)[https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-instances-status-check_sched.html#types-of-scheduled-events] vs (GCP maintenance)[https://cloud.google.com/compute/docs/storing-retrieving-metadata#maintenanceevents]. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please explain in the enhancement text the points that readers are expected to take from this comparison. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks. Resolved: c760ac9 |
||
|
||
|
||
### Risks and Mitigations | ||
|
||
## Design Details | ||
|
||
AWS event types: | ||
* instance-stop | ||
* instance-retirement | ||
* instance-reboot | ||
* system-reboot | ||
* system-maintenance | ||
|
||
### machinemaintenance controller | ||
The machinemaintenance controller will iterate through machine CRs and reconcile identified maintenances. It will be responsible for first validating the state of the cluster prior to executing anything on a target object. Validating the state of the cluster will include will initially check for only is the cluster upgrading or is a maintenance already being performed. More use-cases can be added as seen fit. | ||
|
||
If the cluster fails validation (for example is upgrading), the controller will requeue the object and process it again according to its `SyncPeriod` which would currently be proposed at 60 minutes. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think "60 minutes" is overly-specific for an enhancement proposal. Can we just say "... process it again later." and leave the duration as an internal implementation detail? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks. Resolved: 4fb2944 |
||
|
||
After cluster validation, the controller will ascertain if its in a maintenance window where it can execute maintenances (See open question 2). In the case that no maintenance windows are defined, the controller will continue as true. If the maintenance window logic is only applicable in OSD, the operator could validate if its "managed" prior to expecting these resources. | ||
|
||
The event type is then sourced from the CR and then is resolved by either deleting a target machine CR (so the machine-api creates a new one) or raising an alert for manual intervention (master maintenance scheduled). | ||
|
||
This is a very UDP type of approach. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's not clear to me what this comparison adds. If you want to talk about a particular feature of this approach, can you talk about it directly in the enhancement text, instead of referencing another protocol and leaving it to the reader to look for parallels. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Removed protocol reference. |
||
|
||
The MMo could also look to store the state of its actions in its own machinemaintenance CR that it would create from a machine CR. | ||
|
||
### Example machinemaintenance CR | ||
|
||
``` | ||
apiVersion: machinemaintenance.managed.openshift.io/v1alpha1 | ||
kind: MachineMaintenance | ||
metadata: | ||
name: mm-dofinn-20201705-blz22-worker-ap-southeast-2a-984h4 | ||
namespace: machine-maintenance-operator | ||
spec: | ||
maintenancescheduled: false | ||
eventcode: "instance-stop" | ||
eventid: "instance-event-01d0903276a5d038c" | ||
machineid: "i-04fdc4494938e5c4a" | ||
machinelink: "dofinn-20201705-blz22-worker-ap-southeast-2a-984h4" | ||
status: | ||
maintenance: "in-progress" | ||
``` | ||
|
||
The MMO would then delete this CR after validating the target machine has been deleted and new one created indicating that the maintenance is completed. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. would there be any way to know that the maintenance had completed after the CR is deleted? my concern is that the maintenance would complete and then the deletion happens and due to timing something might miss that signal. in these cases is there some backup or audit trail (perhaps events) ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Make the exact same query to find any scheduled events for target node. If != nil, either the maintenance didn't clear or a new one is now there. |
||
|
||
### Alerting | ||
An alert will be raised for the following conditions: | ||
|
||
* Unable to schedule maintenance during customer defined window and prior to cloud provider maintenance deadline | ||
* Post maintenance verification failed | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you define post-maintenance verification in the enhancement? If so, I'm having trouble finding it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
* Node does not return after shutdown | ||
* Node unable to drain | ||
* Node unable to shutdown | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would expect drain/shutdown to be MachineHealthCheck / machine-API operator concerns that a machine-maintenance operator would not need to handle directly. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. MMO would only alert if it was the process that triggered the drain. Nodes being unable to perform drain/shutdown would require and alert for manual intervention. |
||
* Same event still scheduled for machine after performing maintenance | ||
* Maintenance taking longer then X minutes. | ||
|
||
### Test Plan | ||
|
||
**Note:** *Section not required until targeted at a release.* | ||
|
||
Consider the following in developing a test plan for this enhancement: | ||
- Will there be e2e and integration tests, in addition to unit tests? | ||
- How will it be tested in isolation vs with other components? | ||
|
||
No need to outline all of the test cases, just the general strategy. Anything | ||
that would count as tricky in the implementation and anything particularly | ||
challenging to test should be called out. | ||
|
||
All code is expected to have adequate tests (eventually with coverage | ||
expectations). | ||
|
||
### Graduation Criteria | ||
|
||
**Note:** *Section not required until targeted at a release.* | ||
|
||
Define graduation milestones. | ||
|
||
These may be defined in terms of API maturity, or as something else. Initial proposal | ||
should keep this high-level with a focus on what signals will be looked at to | ||
determine graduation. | ||
|
||
Consider the following in developing the graduation criteria for this | ||
enhancement: | ||
- Maturity levels - `Dev Preview`, `Tech Preview`, `GA` | ||
- Deprecation | ||
|
||
Clearly define what graduation means. | ||
|
||
#### Examples | ||
|
||
These are generalized examples to consider, in addition to the aforementioned | ||
[maturity levels][maturity-levels]. | ||
|
||
##### Dev Preview -> Tech Preview | ||
|
||
- Ability to utilize the enhancement end to end | ||
- End user documentation, relative API stability | ||
- Sufficient test coverage | ||
- Gather feedback from users rather than just developers | ||
|
||
##### Tech Preview -> GA | ||
|
||
- More testing (upgrade, downgrade, scale) | ||
- Sufficient time for feedback | ||
- Available by default | ||
|
||
**For non-optional features moving to GA, the graduation criteria must include | ||
end to end tests.** | ||
|
||
##### Removing a deprecated feature | ||
|
||
- Announce deprecation and support policy of the existing feature | ||
- Deprecate the feature | ||
|
||
### Upgrade / Downgrade Strategy | ||
|
||
The operator should follow (OLM)[https://github.com/operator-framework/operator-lifecycle-manager] for lifecycle management. | ||
|
||
### Version Skew Strategy | ||
|
||
How will the component handle version skew with other components? | ||
What are the guarantees? Make sure this is in the test plan. | ||
|
||
Consider the following in developing a version skew strategy for this | ||
enhancement: | ||
- During an upgrade, we will always have skew among components, how will this impact your work? | ||
- Does this enhancement involve coordinating behavior in the control plane and | ||
in the kubelet? How does an n-2 kubelet without this feature available behave | ||
when this feature is used? | ||
- Will any other components on the node change? For example, changes to CSI, CRI | ||
or CNI may require updating that component before the kubelet. | ||
|
||
## Implementation History | ||
|
||
Major milestones in the life cycle of a proposal should be tracked in `Implementation | ||
History`. | ||
|
||
## Drawbacks | ||
|
||
It is not applicable to baremetal installations. | ||
|
||
## Alternatives | ||
|
||
Similar to the `Drawbacks` section the `Alternatives` section is used to | ||
highlight and record other possible approaches to delivering the value proposed | ||
by an enhancement. | ||
|
||
## Infrastructure Needed [optional] | ||
|
||
Use this section if you need things from the project. Examples include a new | ||
subproject, repos requested, github details, and/or testing infrastructure. | ||
|
||
Listing these here allows the community to get the process for these resources | ||
started right away. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the current proposal records these in
MachineMaintenance
resources, instead of writing to theMachine
resource. Makes sense to me, because you don't want to be fighting the machine-API trying to write to the sameMachine
resource, but probably need to update this line to talk about "the machine-maintenance resource".There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Resolved: c29b1c0