-
Notifications
You must be signed in to change notification settings - Fork 475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Machine-maintenance operator proposal #341
Conversation
@cblecker @jharrington22 @jewzaam @jeremyeder @michaelgugino @derekwaynecarr Could you please provide feedback on this proposal. In particular whom i have listed as reviewers/approvers. Many Thanks. |
I think there are some good ideas here, especially looking at the maintenance events in AWS. As noted on the mailing list, the machine-api already retrieves this data via the describeInstances call, we can put that information into the status field of the machine, an annotation, or what have you. The MMO can then react accordingly, IMO, just delete the machine that's going to go down for maintenance well ahead of time. We also need to consider other cloud providers. For instance, GCP only makes this kind of data available via metadata, and they only give a 60 second head's up, and only if you've looked at this value in the metadata previously. In the case of GCP, they live migrate, so for most hosts, ideally there are no issues. For GPU instances, they give a 60 minute warning, and stop the instance. The instance will be recreated elsewhere on restart. In both the AWS and GCP case where an instance is stopped and recreated elsewhere, it's not clear what impacts that will have on the kubelet. In particular, m5d instances provide NVME drives which can be partitioned and utilized on /var/lib/containers (or whatever the mount is) for local container storage. If the stuff on the root EBS expects there to be a bunch of items in /var/lib/containers, that might cause an issue. Further more, since it's a new instance, will the disks even get setup? Ignition is set to run only once. In any case, probably best to proactively delete the instance rather than hoping we can restart the host after the maintenance window is complete. |
@michaelgugino Thank you for the feedback. Its becoming apparent that juggling perspectives of running a single unmanaged cluster vs a fleet of managed is difficult to define expectations of operators. I now agree with you on extending the machine-api to store maintenance event status. This keeps the core logic solid and then something like the MMO can extend onto and apply its own policies. Maybe the MMO would be best suited strictly handling maintenance events that the machine-api would publish to the machine CRs. Any unhealthy nodes ("stopped" or "unhealthy behind LB" can be remediated with a MHC policy. |
What would happen if the mmo detects a maintenance event while an upgrade is underway? |
@jeremyeder -> Controller can re-queue the object and reconcile it again according to the SyncPeriod. |
@jeremyary how does one detect an upgrade is underway? |
assuming wrong Jeremy ^_^ @jeremyeder |
@michaelgugino we should be able to pull the status from PROGRESSING
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this is an interesting idea, i can see it being useful for operators to better tune their openshift clusters. i added some suggestions and also some questions i have.
maintenance: "in-progress" | ||
``` | ||
|
||
The MMO would then delete this CR after validating the target machine has been deleted and new one created indicating that the maintenance is completed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would there be any way to know that the maintenance had completed after the CR is deleted?
my concern is that the maintenance would complete and then the deletion happens and due to timing something might miss that signal. in these cases is there some backup or audit trail (perhaps events) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would there be any way to know that the maintenance had completed after the CR is deleted?
Make the exact same query to find any scheduled events for target node. If != nil, either the maintenance didn't clear or a new one is now there.
Co-authored-by: Michael McCune <msm@opbstudios.com>
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: dofinn The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Co-authored-by: Michael McCune <msm@opbstudios.com>
Co-authored-by: Michael McCune <msm@opbstudios.com>
Co-authored-by: Michael McCune <msm@opbstudios.com>
Co-authored-by: Michael McCune <msm@opbstudios.com>
Co-authored-by: Michael McCune <msm@opbstudios.com>
Co-authored-by: Michael McCune <msm@opbstudios.com>
/retitle Machine-maintenance operator proposal Unpacking the current title's "MMO" acronym. |
### Implementation Details/Notes/Constraints [optional] | ||
|
||
Constraints: | ||
* This implementation will require the machine-api to query cloud providers for scheduled maintenances and publish them in the machine's CR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the current proposal records these in MachineMaintenance
resources, instead of writing to the Machine
resource. Makes sense to me, because you don't want to be fighting the machine-API trying to write to the same Machine
resource, but probably need to update this line to talk about "the machine-maintenance resource".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Resolved: c29b1c0
* This implementation will require the machine-api to query cloud providers for scheduled maintenances and publish them in the machine's CR. | ||
* GCP only allows maintenances to be queried from the node itself -> `curl http://metadata.google.internal/computeMetadata/v1/instance/maintenance-event -H "Metadata-Flavor: Google"` | ||
|
||
This operator will iterate through the machineList{} and inspect each machine CR for scheduled maintenances. If a maintenance is found, the controller will validate the state of the cluster prior to performing any maintenance. For example; is the cluster upgrading? is the cluster already performing a maintenance? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"is the cluster upgrading" should be orthogonal. The limiting conditions are "are we already attempting to cordon/drain another machine in this pool?" and "can we cordon and drain this machine?". Workloads on the machines should be protected by PDBs and such to prevent eviction that would impact cluster health or provided services.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Resolved: 4556f9e
|
||
These processes will only hold true for infra and worker roles of machines within the cluster. | ||
|
||
If a scheduled maintenance is detected for a master node, an alert should be raised. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this? Does scheduled maintenance wipe the current disk or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Admins should be aware that a provider maintenance is planned for a control plane node? Give them a chance to be proactive.
|
||
Not every cloud provider has the same kind of maintenances that inherit that same degradation on the cluster. Some of the features may need to be toggleable per provider. | ||
|
||
Example: (AWS maintenance)[https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-instances-status-check_sched.html#types-of-scheduled-events] vs (GCP maintenance)[https://cloud.google.com/compute/docs/storing-retrieving-metadata#maintenanceevents]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please explain in the enhancement text the points that readers are expected to take from this comparison.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Resolved: c760ac9
### machinemaintenance controller | ||
The machinemaintenance controller will iterate through machine CRs and reconcile identified maintenances. It will be responsible for first validating the state of the cluster prior to executing anything on a target object. Validating the state of the cluster will include will initially check for only is the cluster upgrading or is a maintenance already being performed. More use-cases can be added as seen fit. | ||
|
||
If the cluster fails validation (for example is upgrading), the controller will requeue the object and process it again according to its `SyncPeriod` which would currently be proposed at 60 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "60 minutes" is overly-specific for an enhancement proposal. Can we just say "... process it again later." and leave the duration as an internal implementation detail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Resolved: 4fb2944
|
||
The event type is then sourced from the CR and then is resolved by either deleting a target machine CR (so the machine-api creates a new one) or raising an alert for manual intervention (master maintenance scheduled). | ||
|
||
This is a very UDP type of approach. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear to me what this comparison adds. If you want to talk about a particular feature of this approach, can you talk about it directly in the enhancement text, instead of referencing another protocol and leaving it to the reader to look for parallels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed protocol reference.
An alert will be raised for the following conditions: | ||
|
||
* Unable to schedule maintenance during customer defined window and prior to cloud provider maintenance deadline | ||
* Post maintenance verification failed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you define post-maintenance verification in the enhancement? If so, I'm having trouble finding it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Post maintenance verification failed | ||
* Node does not return after shutdown | ||
* Node unable to drain | ||
* Node unable to shutdown |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect drain/shutdown to be MachineHealthCheck / machine-API operator concerns that a machine-maintenance operator would not need to handle directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MMO would only alert if it was the process that triggered the drain. Nodes being unable to perform drain/shutdown would require and alert for manual intervention.
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
/lifecycle frozen |
This proposal is over a year old. As part of recent efforts to clean up old pull requests, I am removing the life-cycle/frozen label to allow it to age out and be closed. If the proposal is still active, please restore the label. /remove-lifecycle frozen |
Inactive enhancement proposals go stale after 28d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle stale |
Stale enhancement proposals rot after 7d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle rotten |
Rotten enhancement proposals close after 7d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Reopen the proposal by commenting /close |
@openshift-bot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
No description provided.