Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Machine-maintenance operator proposal #341

Closed
wants to merge 17 commits into from
Closed

Machine-maintenance operator proposal #341

wants to merge 17 commits into from

Conversation

dofinn
Copy link
Contributor

@dofinn dofinn commented May 26, 2020

No description provided.

@dofinn
Copy link
Contributor Author

dofinn commented May 26, 2020

@cblecker @jharrington22 @jewzaam @jeremyeder @michaelgugino @derekwaynecarr Could you please provide feedback on this proposal. In particular whom i have listed as reviewers/approvers. Many Thanks.

@michaelgugino
Copy link
Contributor

I think there are some good ideas here, especially looking at the maintenance events in AWS. As noted on the mailing list, the machine-api already retrieves this data via the describeInstances call, we can put that information into the status field of the machine, an annotation, or what have you. The MMO can then react accordingly, IMO, just delete the machine that's going to go down for maintenance well ahead of time.

We also need to consider other cloud providers. For instance, GCP only makes this kind of data available via metadata, and they only give a 60 second head's up, and only if you've looked at this value in the metadata previously. In the case of GCP, they live migrate, so for most hosts, ideally there are no issues. For GPU instances, they give a 60 minute warning, and stop the instance. The instance will be recreated elsewhere on restart.

In both the AWS and GCP case where an instance is stopped and recreated elsewhere, it's not clear what impacts that will have on the kubelet. In particular, m5d instances provide NVME drives which can be partitioned and utilized on /var/lib/containers (or whatever the mount is) for local container storage. If the stuff on the root EBS expects there to be a bunch of items in /var/lib/containers, that might cause an issue. Further more, since it's a new instance, will the disks even get setup? Ignition is set to run only once. In any case, probably best to proactively delete the instance rather than hoping we can restart the host after the maintenance window is complete.

@dofinn
Copy link
Contributor Author

dofinn commented May 27, 2020

@michaelgugino Thank you for the feedback. Its becoming apparent that juggling perspectives of running a single unmanaged cluster vs a fleet of managed is difficult to define expectations of operators.

I now agree with you on extending the machine-api to store maintenance event status. This keeps the core logic solid and then something like the MMO can extend onto and apply its own policies. Maybe the MMO would be best suited strictly handling maintenance events that the machine-api would publish to the machine CRs. Any unhealthy nodes ("stopped" or "unhealthy behind LB" can be remediated with a MHC policy.

@jeremyeder
Copy link
Contributor

What would happen if the mmo detects a maintenance event while an upgrade is underway?

@dofinn
Copy link
Contributor Author

dofinn commented Jun 4, 2020

What would happen if the mmo detects a maintenance event while an upgrade is underway?

@jeremyeder -> Controller can re-queue the object and reconcile it again according to the SyncPeriod.

https://github.com/openshift/enhancements/pull/341/files#diff-9576061bff3e7a1922fa6c3ab7de01daR97-R99

@michaelgugino
Copy link
Contributor

What would happen if the mmo detects a maintenance event while an upgrade is underway?

@jeremyary how does one detect an upgrade is underway?

@jeremyary
Copy link

What would happen if the mmo detects a maintenance event while an upgrade is underway?

@jeremyary how does one detect an upgrade is underway?

assuming wrong Jeremy ^_^ @jeremyeder

@dofinn
Copy link
Contributor Author

dofinn commented Jun 9, 2020

@michaelgugino we should be able to pull the status from PROGRESSING

oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.3     True        False         41h     Cluster version is 4.4.3

Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this is an interesting idea, i can see it being useful for operators to better tune their openshift clusters. i added some suggestions and also some questions i have.

enhancements/maintenance/machine-maintenance-operator.md Outdated Show resolved Hide resolved
enhancements/maintenance/machine-maintenance-operator.md Outdated Show resolved Hide resolved
enhancements/maintenance/machine-maintenance-operator.md Outdated Show resolved Hide resolved
maintenance: "in-progress"
```

The MMO would then delete this CR after validating the target machine has been deleted and new one created indicating that the maintenance is completed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would there be any way to know that the maintenance had completed after the CR is deleted?

my concern is that the maintenance would complete and then the deletion happens and due to timing something might miss that signal. in these cases is there some backup or audit trail (perhaps events) ?

Copy link
Contributor Author

@dofinn dofinn Jul 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would there be any way to know that the maintenance had completed after the CR is deleted?

Make the exact same query to find any scheduled events for target node. If != nil, either the maintenance didn't clear or a new one is now there.

Co-authored-by: Michael McCune <msm@opbstudios.com>
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: dofinn
To complete the pull request process, please assign imcleod
You can assign the PR to them by writing /assign @imcleod in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

dofinn and others added 6 commits July 6, 2020 13:40
Co-authored-by: Michael McCune <msm@opbstudios.com>
Co-authored-by: Michael McCune <msm@opbstudios.com>
Co-authored-by: Michael McCune <msm@opbstudios.com>
Co-authored-by: Michael McCune <msm@opbstudios.com>
Co-authored-by: Michael McCune <msm@opbstudios.com>
Co-authored-by: Michael McCune <msm@opbstudios.com>
@wking
Copy link
Member

wking commented Jul 6, 2020

/retitle Machine-maintenance operator proposal

Unpacking the current title's "MMO" acronym.

@openshift-ci-robot openshift-ci-robot changed the title MMO enhancement proposal Machine-maintenance operator proposal Jul 6, 2020
### Implementation Details/Notes/Constraints [optional]

Constraints:
* This implementation will require the machine-api to query cloud providers for scheduled maintenances and publish them in the machine's CR.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the current proposal records these in MachineMaintenance resources, instead of writing to the Machine resource. Makes sense to me, because you don't want to be fighting the machine-API trying to write to the same Machine resource, but probably need to update this line to talk about "the machine-maintenance resource".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Resolved: c29b1c0

* This implementation will require the machine-api to query cloud providers for scheduled maintenances and publish them in the machine's CR.
* GCP only allows maintenances to be queried from the node itself -> `curl http://metadata.google.internal/computeMetadata/v1/instance/maintenance-event -H "Metadata-Flavor: Google"`

This operator will iterate through the machineList{} and inspect each machine CR for scheduled maintenances. If a maintenance is found, the controller will validate the state of the cluster prior to performing any maintenance. For example; is the cluster upgrading? is the cluster already performing a maintenance?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"is the cluster upgrading" should be orthogonal. The limiting conditions are "are we already attempting to cordon/drain another machine in this pool?" and "can we cordon and drain this machine?". Workloads on the machines should be protected by PDBs and such to prevent eviction that would impact cluster health or provided services.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Resolved: 4556f9e


These processes will only hold true for infra and worker roles of machines within the cluster.

If a scheduled maintenance is detected for a master node, an alert should be raised.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this? Does scheduled maintenance wipe the current disk or something?

Copy link
Contributor Author

@dofinn dofinn Jul 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Admins should be aware that a provider maintenance is planned for a control plane node? Give them a chance to be proactive.


Not every cloud provider has the same kind of maintenances that inherit that same degradation on the cluster. Some of the features may need to be toggleable per provider.

Example: (AWS maintenance)[https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-instances-status-check_sched.html#types-of-scheduled-events] vs (GCP maintenance)[https://cloud.google.com/compute/docs/storing-retrieving-metadata#maintenanceevents].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain in the enhancement text the points that readers are expected to take from this comparison.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Resolved: c760ac9

### machinemaintenance controller
The machinemaintenance controller will iterate through machine CRs and reconcile identified maintenances. It will be responsible for first validating the state of the cluster prior to executing anything on a target object. Validating the state of the cluster will include will initially check for only is the cluster upgrading or is a maintenance already being performed. More use-cases can be added as seen fit.

If the cluster fails validation (for example is upgrading), the controller will requeue the object and process it again according to its `SyncPeriod` which would currently be proposed at 60 minutes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "60 minutes" is overly-specific for an enhancement proposal. Can we just say "... process it again later." and leave the duration as an internal implementation detail?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Resolved: 4fb2944


The event type is then sourced from the CR and then is resolved by either deleting a target machine CR (so the machine-api creates a new one) or raising an alert for manual intervention (master maintenance scheduled).

This is a very UDP type of approach.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me what this comparison adds. If you want to talk about a particular feature of this approach, can you talk about it directly in the enhancement text, instead of referencing another protocol and leaving it to the reader to look for parallels.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed protocol reference.

An alert will be raised for the following conditions:

* Unable to schedule maintenance during customer defined window and prior to cloud provider maintenance deadline
* Post maintenance verification failed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you define post-maintenance verification in the enhancement? If so, I'm having trouble finding it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* Post maintenance verification failed
* Node does not return after shutdown
* Node unable to drain
* Node unable to shutdown
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect drain/shutdown to be MachineHealthCheck / machine-API operator concerns that a machine-maintenance operator would not need to handle directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MMO would only alert if it was the process that triggered the drain. Nodes being unable to perform drain/shutdown would require and alert for manual intervention.

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 30, 2020
@dofinn
Copy link
Contributor Author

dofinn commented Oct 31, 2020

/lifecycle frozen

@openshift-ci-robot openshift-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 31, 2020
@dhellmann
Copy link
Contributor

This proposal is over a year old. As part of recent efforts to clean up old pull requests, I am removing the life-cycle/frozen label to allow it to age out and be closed. If the proposal is still active, please restore the label.

/remove-lifecycle frozen

@openshift-ci openshift-ci bot removed the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Sep 21, 2021
@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 19, 2021
@openshift-bot
Copy link

Stale enhancement proposals rot after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 26, 2021
@openshift-bot
Copy link

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci bot closed this Nov 2, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 2, 2021

@openshift-bot: Closed this PR.

In response to this:

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants