Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add APIs for machine phases #531

Merged
merged 3 commits into from
Nov 28, 2018

Conversation

hardikdr
Copy link
Member

What this PR does / why we need it: This PR adds the necessary machines-phases and states in the machine-api stack. Machine phases and states are a way of representing the lifecycle of the machines. PR is based on the following proposal doc: proposal

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

Release note:

Added APIs for machine phases

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 10, 2018
// MachineRunning should be set when machine has joined the cluster and running successfully.
MachineRunning MachinePhase = "Running"

// MachineRunning should be set when machine is being deleted.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace MachineRunning with MachineTerminating

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, thanks for pointing out.

@scruplelesswizard
Copy link

/ok-to-test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 11, 2018
Copy link

@scruplelesswizard scruplelesswizard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a nit from me

)

// MachineState is the current status of the last performed operation.
type MachineState string

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: MachineOperationState would be a better name as it represents the state of the operation, not of the machine

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that makes sense, thanks.
Updating the occurrences wherever needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@hardikdr hardikdr force-pushed the machine-phase-api branch 2 times, most recently from 5f36ff3 to 7426220 Compare October 11, 2018 09:24
@alvaroaleman
Copy link
Member

Background for the failing tests: The timeout for the test controlplane is too low, so it often fails to start: 2018/10/11 09:27:09 timeout waiting for process kube-apiserver to start

This already happened multiple times on different PRs. For the time being the workaround is to hit /retest until they pass.

I've created a PR in kubebuilder to get this configurable so we can increase the timeout: kubernetes-sigs/controller-runtime#169 It will however take some time until we can use that, because even when that PR is merged we have to wait for the next release.

@hardikdr
Copy link
Member Author

/retest

@hardikdr
Copy link
Member Author

thanks, @alvaroaleman for the information, I will then retry after some time.

@scruplelesswizard
Copy link

/retest

@alvaroaleman
Copy link
Member

@hardikdr Actually the failures here a real, check out the logs

@hardikdr
Copy link
Member Author

@alvaroaleman Yes, I just corrected it and test-check seems to be passing. Thanks for the help.

@davidewatson
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 14, 2018
@alvaroaleman
Copy link
Member

@hardikdr Can you quickly explain what this allows us to do what we can currently not do?

Because:

Is there anything I missed in the above list? Is there anything I got wrong in the above list?

@hardikdr
Copy link
Member Author

Yes sure.
The idea is to basically define the life-cycle of the machines in a descriptive manner.
I see use cases mainly from 2 broad categories: Expressiveness and Consumption.

1. Expressiveness:

  • Seems it will be nice to have a framework to express different lifecycle-events of machines as accurately as possible.

  • Taking a subtle example:
    Machine gets into the failed-phase. There could be multiple reasons behind it, for instance:

    • Kubelet did not contact APISever from last 10 minutes.
    • Or Disk-pressure is high from last ~30minutes due to which apps are suffering.
    • Or simply machine-creation itself has failed due to error either from cloud-provider or say mis-configuration of software-stack on machine.

It seems it will be great to express all three scenarios in a structured fashion.

  • For first 2 cases, we could set MStatus.LastOperation to Health-check and MStatus.LastOparation.State/Description to clearly mention it failed due to kubelet didn't respond in first case and diskPressure was high in second case.

  • The third case is different in a way that creation itself failed, and LastOperation could be then set to Create and LO.State/Description could be set :

    • to saying cloud-provide could not create the machine,
    • or cloud-provider created the machine but kubelet could not register itself.
  • Essentially proposed framework will offer equipped means of expressing the current status-quo of the machine. On top LastOperation field will also provide the latest opreration history of machine-status.

2. Consumption:

  • Machine could go through many different phases during life-span and machineSetController or User might want to take different actions based on the exact phase.
    For instance,
  • MachineSet controller might not want to delete the machine with Unknown phase, but rather give some more time to machine to connect back within timeout period.
  • MachineSet controller might want to delete the machine immediately to replace it by newer one if its in the Failed phase - assuming machine would had been given enough time already before to join back.
  • MachineSet controller might want to ignore machines in Standby phase in case of baremetals, and more.
  • Essentially, set of phases should be well-defined for machineSet/other-higher controller to take the right-actions.

To answer the question: Can you quickly explain what this allows us to do what we can currently not do?

  • I would say, in current model we do not have means of expressing the phases in detailed and structured way - as described in use case 1 above.
  • And also we do not have well-defined set of inputs for machineSet/other-controllers to take the right actions based on the current status of the machine. - use case 2.

Though machine-shared-controller and machine-external-controller would still require a contract to update the machineStatus, but that's a separate thread. Turned out to be a long one, but I hope it makes sense :) Feedback is most welcome.

@alvaroaleman
Copy link
Member

Thanks for the detailed answer, @hardikdr !

It seems it will be great to express all three scenarios in a structured fashion.

We can derive all this information from .Status.NodeRef: failed machine creation from the fact that the .Status.NodeRef does not exist or does not point to a valid node, the two other examples you mentioned are available as conditions on the Node object.

MachineSet controller might not want to delete the machine with Unknown phase, but rather give some more time to machine to connect back within timeout period.

I don't think it should be the concern of the MachineSet controller to re-create cloud provider instances when the node gets lost or never connects, that is something the Machine controller does/should do. And again, all this info here is available from the Machine object or the referenced Node object.

I would say, in current model we do not have means of expressing the phases in detailed and structured way - as described in use case 1 above

Can you explain why this is needed? For users we can use events, for other controllers do you have an example use-case where the information available via the .Status.NodeRef is not sufficient? Because to me it feels like we are duplicating the Node object and its features here (conditions, heartbeats)

And also we do not have well-defined set of inputs for machineSet/other-controllers to take the right actions based on the current status of the machine

Basically the same question as above, do you have a sample of a controller who needs this? The machineSet controller only needs to know if a node exists and if its healthy, this works perfectly fine with the existing .Status.NodeRef

@hardikdr
Copy link
Member Author

thanks for the comments @alvaroaleman .

We can derive all this information from .Status.NodeRef: failed machine creation from the fact that the .Status.NodeRef does not exist or does not point to a valid node,

Yes, NodeRef is essential and will provide much info, but I am not sure if there is a way to understand if creation/deletion or other operation has failed or succeeded on a particular machine- via NodeRef's existence. Deriving certain possibilities only based on the availability of NodeRef or pointer to wrong machines- seems less desirable.

I don't think it should be the concern of the MachineSet controller to re-create cloud provider instances when the node gets lost or never connects, that is something the Machine controller does/should do.

I would rather expect machine-controller should mostly be resposible for creation/deletion of the machines and reporting right health-status via MachineStatus field. MachineSet controller should then look for Failed machines and try to replace them. This is more from the perspective that MachienSet controller should only make sure to have # of healthy machines and take necessary steps - and not participate in race-condition with MachineController while recreation of machines.

Can you explain why this is needed? For users we can use events, for other controllers do you have an example use-case where the information available via the .Status.NodeRef is not sufficient?

Fundamentally, NodeRef is bound by the possible values available in the NodeObject. We would defininitely want to expand our usecases to include new phases such as Draining/Standby[for barementals] and more.

@davidewatson
Copy link
Contributor

MachineOperationType and MachineOperationState can be useful for GUIs built on top of the ClusterAPI. In order to determine if an upgrade is compelte, we currently compare the Version.Kubelet and Version.ControlPlane with the version reported by the Node. This is somewhat limited since kubelet versions are not the only reason for an upgrade.

Others have suggested using Events instead. Can events be lost however, if a controller reboots for example?

From a user perspective I am not sure there is a more reliable way to answer these questions. Is the machine still upgrading, is it being deleted, etc? This is valuable for user feedback within a GUI.

@roberthbailey
Copy link
Contributor

As discussed during the meeting today, we will merge this in ~24 hours unless there are objections before then.

/approve
/hold

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Oct 31, 2018
@scruplelesswizard
Copy link

/retest

pkg/apis/cluster/v1alpha1/machine_types.go Outdated Show resolved Hide resolved
pkg/apis/cluster/v1alpha1/machine_types.go Outdated Show resolved Hide resolved
pkg/apis/cluster/v1alpha1/machine_types.go Outdated Show resolved Hide resolved
pkg/apis/cluster/v1alpha1/machine_types.go Outdated Show resolved Hide resolved
@hardikdr
Copy link
Member Author

@chaosaffe thanks for the suggestions. All of them looks good to me. I also made the necessary changes.

@scruplelesswizard
Copy link

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 22, 2018
// specific machine. It should also convey the state of the latest-operation for example if
// it is still on-going, failed or completed successfully.
// +optional
LastOperation LastOperation `json:"lastOperation,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make this a pointer as its an optional struct

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes sure, done.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 22, 2018
@alvaroaleman
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 22, 2018
@hardikdr
Copy link
Member Author

/retest

@roberthbailey
Copy link
Contributor

It looks like we've pretty much reached consensus on this PR. Let's make sure there are no further comments or objections at the next meeting and then merge it if all looks good.

Copy link
Contributor

@sidharthsurana sidharthsurana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor naming suggestion, otherwise lgtm

properties:
description:
type: string
lastUpdateTime:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Can we change the name lastUpdateTime to lastUpdated just to be consistent with the similar fields in other places and objects.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, thanks.

Co-Authored-By: hardikdr <hardik.dodiya@sap.com>
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 27, 2018
@sidharthsurana
Copy link
Contributor

/LGTM

@k8s-ci-robot
Copy link
Contributor

@sidharthsurana: changing LGTM is restricted to assignees, and only kubernetes-sigs/cluster-api repo collaborators may be assigned issues.

In response to this:

/LGTM

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@scruplelesswizard
Copy link

scruplelesswizard commented Nov 28, 2018 via email

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 28, 2018
@roberthbailey
Copy link
Contributor

/approve

@roberthbailey
Copy link
Contributor

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 28, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hardikdr, roberthbailey

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 5a27d98 into kubernetes-sigs:master Nov 28, 2018
@derekwaynecarr
Copy link
Contributor

I apologize for the late question, but is there a write-up that summarizes a response from Daniel's feedback here:
https://docs.google.com/document/d/12TsBPn1lfMk50_yydzbXNZ9PT-8-88o4tSNi7eqPCVg/edit?disco=AAAACWAG9QI

similar to him, I worry about the use of a single phase based on experience with pods.

@roberthbailey
Copy link
Contributor

@hardikdr -- can you answer @derekwaynecarr's question? I know that you had a chance to connect with @lavalamp at KubeCon and chat about phases / conditions / etc.

jayunit100 pushed a commit to jayunit100/cluster-api that referenced this pull request Jan 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.