Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Control plane node join #463

Merged

Conversation

ashish-amarnath
Copy link
Contributor

@ashish-amarnath ashish-amarnath commented Dec 25, 2018

What this PR does / why we need it:
Implements support for control plane node join

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #413

Special notes for your reviewer:

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

Release note:

cluster-api-provider-aws now supports creating clusters with HA control plane using clusterctl phases.
Steps to create clusters with HA control plane are documented <Link to readme generated as a fix for #414 >

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 25, 2018
@k8s-ci-robot
Copy link
Contributor

Hi @ashish-amarnath. Thanks for your PR.

I'm waiting for a kubernetes-sigs or kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Dec 25, 2018
@ashish-amarnath
Copy link
Contributor Author

@chuckha Thanks for the pointers on #413
I took a stab at this to get the scaffolding and initial ideas in place.
Publishing this PR to get some feedback on the approach.

/assign @chuckha
cc @detiber

Copy link
Member

@vincepri vincepri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the lead on this! Left some early comments.

pkg/apis/awsprovider/v1alpha1/types.go Outdated Show resolved Hide resolved
pkg/cloud/aws/actuators/machine/actuator.go Outdated Show resolved Hide resolved
pkg/cloud/aws/actuators/machine_scope.go Outdated Show resolved Hide resolved
pkg/cloud/aws/actuators/scope.go Outdated Show resolved Hide resolved
pkg/cloud/aws/services/ec2/instances.go Show resolved Hide resolved
pkg/cloud/aws/services/ec2/instances.go Outdated Show resolved Hide resolved
pkg/cloud/aws/services/ec2/instances.go Outdated Show resolved Hide resolved
Copy link
Contributor

@chuckha chuckha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a little more discussion around why a counter is needed would help me understand this code more

pkg/cloud/aws/actuators/machine/actuator.go Outdated Show resolved Hide resolved
pkg/cloud/aws/actuators/machine/actuator.go Outdated Show resolved Hide resolved
pkg/cloud/aws/actuators/machine/actuator.go Outdated Show resolved Hide resolved
pkg/cloud/aws/actuators/machine/actuator.go Outdated Show resolved Hide resolved
@ashish-amarnath
Copy link
Contributor Author

ashish-amarnath commented Jan 6, 2019

Current status:

Two scenarios of HA control plane support have been tested.

ToDo:

Developer notes:

Reference: https://github.com/kubernetes/enhancements/blob/master/keps/sig-cluster-lifecycle/0015-kubeadm-join-control-plane.md#add-a-new-control-plane-instance-dynamic-workflow

Testing

Test 1

  • Test controlplane node join:
  • Bootstrap cluster with a single master machine. Once the cluster has an established controlplane:
  • Create 2 additional control plane machines one at a time: kubectl apply -f <ValidMachineYaml> creating one machine at a time
    The 2 machines should join the cluster as masters. Verify this by:
  • kubectl get no should show the two new machines with master roles
  • kcs get po should show 3 (same number as controlplane machines) replicas of the controlplane pods running
  • The 3 machines should have the correct security group and should be registered with the api server loadbalancer.

Test 2

  • Test controlplane node join: creating multiple controlplane machines
  • Bootstrap cluster with a single master machine. Once the cluster has an established controlplane:
  • Create 2 additional control plane machines at once: kubectl apply -f <ValidMachineListYaml>
    The 2 machines should join the cluster as masters. Verify this by:
  • kubectl get no should show the two new machines with master roles
  • kcs get po should show 3 (same number as controlplane machines) replicas of the controlplane pods running
  • The 3 machines should have the correct security group and should be registered with the api server loadbalancer.

Copy link
Contributor

@chuckha chuckha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely some good progress. Once the style errors are fixed I'll make another pass. Let me know when it's ready for testing and I'll run it through its paces.

pkg/cloud/aws/actuators/machine/actuator.go Outdated Show resolved Hide resolved
pkg/cloud/aws/actuators/machine/actuator.go Outdated Show resolved Hide resolved
pkg/cloud/aws/actuators/machine/actuator.go Outdated Show resolved Hide resolved
pkg/cloud/aws/actuators/machine/actuator.go Outdated Show resolved Hide resolved
pkg/cloud/aws/actuators/machine/actuator.go Outdated Show resolved Hide resolved
pkg/cloud/aws/actuators/machine/actuator.go Outdated Show resolved Hide resolved
pkg/cloud/aws/actuators/machine/actuator.go Outdated Show resolved Hide resolved
pkg/cloud/aws/actuators/machine/actuator.go Outdated Show resolved Hide resolved
pkg/cloud/aws/services/ec2/instances.go Show resolved Hide resolved
pkg/cloud/aws/services/ec2/instances.go Outdated Show resolved Hide resolved
@@ -55,6 +55,49 @@ nodeRegistration:
EOF

kubeadm init --config /tmp/kubeadm.yaml

tar -cvzf /etc/kubernetes/pki.tar.gz /etc/kubernetes/pki/*
kubectl -n kube-system --kubeconfig /etc/kubernetes/admin.conf create secret generic kubeadm-certs --from-file=/etc/kubernetes/pki.tar.gz
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting approach for managing the secrets, but it would probably be better if we took a different approach. I can see a couple of different ways of solving this without adding a secret post-init:

  • pre-generate the etcd ca keypair, front-proxy ca keypair, and sa(service account) keypair in the cluster controller and store in the Cluster ProviderSpec
    • inject them into the init config like we are currently doing with the CA cert/key
    • pre-write the keypairs to the filesystem before calling kubeadm join
    • if other certificates are needed prior to calling kubeadm join, pregenerate them using kubeadm phases (kubeadm join should only really require the CAs, it would be a bug otherwise)
  • Instead of modifying the cluster actuator, it would be possible to fetch the required certificates using the kubernetes client from within the machine actuator, then inject the certificates using the init script on the host prior to running kubeadm join.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like the simplicity of this approach. can you elaborate on why you think it would be better if we took a different approach?

i'm guessing security issues? secrets aren't actually secret and it would be far too easy to get these secrets and do bad things. But it's not that different if we store the values in the cluster provider spec. That's still plaintext secret storage in etcd.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect operators to use envelope encryption for secrets in their bootstrap / manager clusters, so it'll be justifiable to store ca certificates as secrets, less so as values in the provider spec.

We can't pre-generate the non-authoritative certs yet because we don't know the IP addresses of the nodes beforehand, and this is particularly important for etcd, but reasonable to do so for the certificate authorities and service account key-pair.

For a solution that will also support autoscaling groups, the ideal way would be use the instance identity document as a host attestation mechanism, where a verifiable CSR is made, that the controller can sign.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, there are a couple of different reasons why I'm suggesting adding it to the provider spec:

  • It automatically provides an avenue for users to override the CA and SA keypairs to use custom values
  • While we are embedding the certificates in the spec today, we can later update the spec to allow for either embedding or referencing an external source at a later point

I want to make sure that we are considering UX when we start to look at more secure private key storage, by allowing direct embedding of the cert material, we avoid having to pre-generate any additional secrets as part of the workflow and we the spec is updated to support remote references for secure material (k8s secrets or leveraging something AWS native), we can have the actuator move the cert material to the proper place as part of the workflow, or even have an admission webhook that does the job to avoid ever storing the cert material in etcd).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy with this approach. So we need to change the spec a bit now, at least, to hold all the key pairs we need for the HA control plane?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the UX and the extendability aspect. However, IMO, we have a dependency on kubeadm here for generating the certs on when the first controlplane node runs the init. From reading the kubeadm docs, there seems to be a way to skip cert generation and run the other "phases". By doing this we will be deviating from the standard kubeadm init flow. Which, IMO, is OK as long as we call that out for UX.

@ashish-amarnath If the keypair files already exist on disk in the default locations that kubeadm would use for creating them and are not close to expiration they will be picked up and used by kubeadm without having to run phases.

Also, kubeadm will start handling the cert distribution for controlplane join in v.1.14. More info kubernetes/kubeadm#1200
With this in mind and we aren't GA yet, I think the cert distribution fix I have should be a stop-gap till we go to 1.14. WDYT?

Even if kubeadm handles the certificate copying for v1.14+ we would still benefit from having the ability for users to specify custom keypairs here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've decoupled the certs into their individual secrets so that they can be each updated independent of the other secrets.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 thanks for the reasoning @detiber. totally agree that changing the spec is the right way to go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I propose that we keep the spec change out of this PR and get the controlplane machine join scenario worked out. I'll follow this PR with the spec change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #492 to address this.

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 11, 2019
@ashish-amarnath ashish-amarnath changed the title [WIP] Control plane node join Control plane node join Jan 14, 2019
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 14, 2019
@ashish-amarnath
Copy link
Contributor Author

/ok-to-test

@k8s-ci-robot
Copy link
Contributor

@ashish-amarnath: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/ok-to-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@justaugustus
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 14, 2019
@chuckha
Copy link
Contributor

chuckha commented Jan 17, 2019

/ok-to-test

@chuckha
Copy link
Contributor

chuckha commented Jan 17, 2019

/lgtm
/appove

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 17, 2019
@chuckha
Copy link
Contributor

chuckha commented Jan 17, 2019

let's type better

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ashish-amarnath, chuckha

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 17, 2019
@k8s-ci-robot k8s-ci-robot merged commit b8ef509 into kubernetes-sigs:master Jan 17, 2019
@ashish-amarnath ashish-amarnath deleted the control-plane-node-join branch January 17, 2019 20:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add machine actuator support for control plane join
8 participants