-
Notifications
You must be signed in to change notification settings - Fork 294
Calico self hosted integration #124
Calico self hosted integration #124
Conversation
heschlie
commented
Dec 6, 2016
- Migrated Calico to self hosted install
- Updated Calico versions
- Migrated Calico to self hosted install - Updated Calico versions
Current coverage is 72.65% (diff: 100%)@@ master #124 diff @@
==========================================
Files 4 4
Lines 1126 1415 +289
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 781 1028 +247
- Misses 259 279 +20
- Partials 86 108 +22
|
@@ -298,7 +241,7 @@ coreos: | |||
http://localhost:8080/api/v1/nodes/$(hostname)" | |||
{{end}} | |||
|
|||
{{if .Experimental.EphemeralImageStorage.Enabled}} | |||
{{if .Experimental.EphemeralImageStorage.Enabled}}mount |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary mount
added to the end of line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now how did that get there...
Hi @heschlie, thanks for the pull request 👍 I don't have a decent knowledge of calico self-hosting therefore would you mind:
|
Hi @mumoshu you are most welcome! Self hosting is simply having Kubernetes manage Calico instead of doing so with systemd files, or manually managing containers. This is achieved in a few ways. We use a deamonset to manage Calico Node and installing the CNI binaries. This runs on every node so it insures it is deployed everywhere, and in the same manner. This also let's us manage the CNI config from a single place. There are no announcements, we have simply been migrating our Kubernetes installations to self hosted, as it is a bit easier to manage and I believe the preferred way for Kubernetes (maybe @caseydavenport could chime in with more info there) I have a similar PR open with coreos-kubernetes here: You can find more info on Calico self hosted installs here: http://docs.projectcalico.org/v1.6/getting-started/kubernetes/installation/hosted/ |
annotations: | ||
scheduler.alpha.kubernetes.io/critical-pod: '' | ||
scheduler.alpha.kubernetes.io/tolerations: | | ||
[{"key": "dedicated", "value": "master", "effect": "NoSchedule" }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this assume controller nodes to be tainted like kubectl taint <node> dedicated=master:NoSchedule
while they're Schedulable
?
Currently, kube-aws created controller nodes are not tainted like that and Unschedulable
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, If I recall correctly, daemonsets don't respect Unschedulable
so it doesn't matter.
I'm still wondering the need to add a toleration like this though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe you are correct here in regards to daemonsets, this was mainly trying to keep our manifest consistent across deployments. We should be able to remove this if it is an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like consistency 😄 Just curious but in which deployment do you taint master nodes like this and then make pods tolerate it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe kops taints the masters by default, emphasis on the think!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kubeadm also does this by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@heschlie @caseydavenport Thanks to your help, I've successfully spotted these 🙇
kops: https://github.com/kubernetes/kops/blob/6c66d18a9c360eac836fb1baf335c09c8597d8e4/protokube/pkg/protokube/tainter.go#L59
kubeadm: https://github.com/kubernetes/kubernetes.github.io/blob/master/docs/getting-started-guides/kubeadm.md#24-initializing-your-master
kubernetes: kubernetes/kubernetes#33530
Please leave the [{"key": "dedicated", "value": "master", "effect": "NoSchedule" },
part as is.
I'm going to make kube-aws adapt to it i.e. master nodes will be schedulable and tainted shortly 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for back and forth but would you please make key be node.alpha.kubernetes.io/role
rather than dedicated
according to kubernetes/kubernetes#36272?
@@ -233,9 +197,6 @@ coreos: | |||
[Service] | |||
Type=oneshot | |||
ExecStartPre=/usr/bin/bash -c "while sleep 1; do if /usr/bin/curl --insecure -s -m 20 -f https://127.0.0.1:10250/healthz > /dev/null ; then break ; fi; done" | |||
{{ if .UseCalico }} | |||
ExecStartPre=/usr/bin/systemctl is-active calico-node | |||
{{ end }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any idea how we could hold on/delay running cfn-sginal until self-hosted calico becomes ready?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can probably do something with calicoctl, let me dig into it a bit and if so I'll push an update to do so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@heschlie We might be able to get away with a calicoctl node status
here.
@mumoshu I added in the new tolerations and added a new |
Note to self: Not depends on but wants #150 |
@heschlie Hi, thanks your support here! Running our E2E test suite revealed that this seems to cause controller nodes to fail while launching. IFAICS, the docker image referenced by Could you point me a valid image or would you mind fixing it?
|
@mumoshu I'm not seeing this problem, when I setup the cluster all of the Calico related pods come online without any issues. I've verified the image is up and tagged on dockerhub as well. |
@mumoshu I've gone and updated the image anyway so they now match Calico v2.0 instead of v1.6 (we just released v2.0 this week) could you give it another shot and let me know? |
@heschlie Thanks for the quick follow up! I've investigated it a bit further. Now, it seems to be hanging up retrying
I guess you can reproduce this by enabling an experimental feature called experimental:
waitSignal:
enabled: true |
Running
|
During the waitsignal we instead download the calicoctl binary and run it as opposed to using docker contianer
@mumoshu I've sorted the WaitSignal stuff, there is a process namespace issue with running in a docker container. We can run around it but I found just downloading the binary to be more elegant, let me know if that is an issue. If it all checks out on your end I'll squash the commits! |
@@ -723,7 +903,7 @@ write_files: | |||
} | |||
{{ end }} | |||
|
|||
- path: /srv/kubernetes/manifests/kube-dns-autoscaler-de.yaml | |||
- path: /srv/kubernetes/manifests/kube-dns-rc.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for rebasing but this should be kube-dns-autoscaler-de.yaml
as kube-dns has been migrated from rc
to de
and is at https://github.com/coreos/kube-aws/pull/124/files#diff-a6019b6709ad6c3c74954ac000740bc5R942!
@mumoshu scratch the part about the
The "SecurityGroups": [
"",
"",
{
"Ref": "SecurityGroupWorker"
}
], Is there an environment variable I need to set to have this filled out? |
@heschlie Thanks for your continuous efforts on this 🙇 That's odd. AFAICS, security groups are populated here and/or here if and only if the For me, they're at minimum:
And I tend to run the
|
Heya here is my env vars: export KUBE_AWS_KEY_NAME=<my-key>
export KUBE_AWS_KMS_KEY_ARN="arn:aws:kmsmy-arn-key"
export KUBE_AWS_DOMAIN=testing.heschlie.com
export KUBE_AWS_REGION=us-east-1
export KUBE_AWS_AVAILABILITY_ZONE="us-east-1b"
export KUBE_AWS_HOSTED_ZONE_ID=<zone-id>
export KUBE_AWS_AZ_1="us-east-1b"
export KUBE_AWS_USE_CALICO=true
export KUBE_AWS_CLUSTER_NAME="kubeawstest1"
export KUBE_AWS_S3_DIR_URI="s3://schlie-kube-aws"
export DOCKER_REPO=quay.io/mumoshu/
export SSH_PRIVATE_KEY=/home/heschlie/.ssh/id_dsa
KUBE_AWS_DEPLOY_TO_EXISTING_VPC=1 \
KUBE_AWS_CLUSTER_AUTOSCALER_ENABLED=1 \
KUBE_AWS_NODE_POOL_INDEX=1 \
KUBE_AWS_AWS_NODE_LABELS_ENABLED=1 \
KUBE_AWS_NODE_LABELS_ENABLED=1 \
KUBE_AWS_WAIT_SIGNAL_ENABLED=1 \
KUBE_AWS_AWS_ENV_ENABLED=1 \
KUBE_AWS_USE_CALICO=true \
KUBE_AWS_CLUSTER_NAME=kubeawstest1 sh -c './run all' Couple of then env vars are duplicated there, but I don't see that causing any issues. |
@heschlie I don't see any suspicious points for now but could you try again without |
@mumoshu Thanks, that seemed to do it. I've had one successful run, going to try again to see if I can get it into a failed state. As @caseydavenport had mentioned it does look like an issue with the kube-proxy, hopefully I can get a bit more insight as to what is wrong once I get it into a failed state. |
@mumoshu I'm having trouble getting a cluster into a failed state, my E2E tests seems to be passing without on my branch. If you still have a cluster in a failed state or can get one into a failed state can you get some info for me?
|
@heschlie , just for my eductation, am I right that this calico-node integration affects network policies only, but networking itself still done with flanneld? |
@redbaron When using canal (Calico + Flannel), Calico will enforce network policies, and also configure the container<->host networking (i.e the container's veth and corresponding route) through its CNI plugin, and flannel will perform host<->host networking. |
why flanneld is needed then? hosts can speak to each other just fine |
@redbaron Sorry, should have been more clear. Calico is used to get traffic from the container to the host it lives on and vice-versa. You may want to check out this repo: https://github.com/projectcalico/canal |
Bumped the policy controller to 0.5.2 to get a NoneType bugfix in
@caseydavenport Thanks for your supports here 🙇 @heschlie No way to know what had been happening on my cluster before but anyways the conformance test is now passing! I'm merging this. It is already a month since you've submitted this PR! |
@@ -367,6 +313,10 @@ write_files: | |||
owner: root:root | |||
content: | | |||
#!/bin/bash -e | |||
{{ if .UseCalico }} | |||
/bin/bash /opt/bin/populate-tls-calico-etcd | |||
/usr/bin/docker run --rm --net=host -v /srv/kubernetes/manifests:/host/manifests {{.HyperkubeImageRepo}}:{{.K8sVer}} /hyperkube kubectl apply -f /host/manifests/calico.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not curl -XPOST
like the rest of the script? or maybe it's better change curl to kubectl call?
|
||
/usr/bin/cp /srv/kubernetes/manifests/calico-policy-controller.yaml /etc/kubernetes/manifests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/srv/kubernetes/manifests/calico-policy-controller.yaml
file seems to be left in userdata, doesn't seem to be used
content: | | ||
#!/bin/bash -e | ||
/usr/bin/curl -H "Content-Type: application/json" -XPOST --data-binary @"/srv/kubernetes/manifests/calico-system.json" "http://127.0.0.1:8080/api/v1/namespaces" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like /srv/kubernetes/manifests/calico-system.json
is left in userdata and not use anywhere
Just discovered that k8s doesn't support rolling update of |
@redbaron yeah, currently what you can do is update the DaemonSet and then do a manual update of each pod. There is a proposal expected to land in v1.6 for server-side DaemonSet rolling updates: https://github.com/kubernetes/community/blob/master/contributors/design-proposals/daemonset-update.md |
@redbaron also this daemonset upgrade controller if you are ok patching in the meantime |
* coreos/master: (132 commits) fix: Spot Fleet doesn't support the t2 instance family Fix node pools on master Allow option to disable certificates management (kubernetes-retired#243) Bump to k8s 1.5.2 Update README.md Update ROADMAP.md Update ROADMAP.md Update ROADMAP.md Update the inline documentation in cluster.yaml typo Don't fail sed if some files are missing Workaround systemd issues with oneshot autorestarts etcd static IP addressing overhaul Calico self hosted integration (kubernetes-retired#124) Fix lint. bugfix for a typo in install-kube-system scripts Update README.md fix(e2e): Correctly wait for a node pool stack for deletion Don't require key-name param during cluster init Propagate SSHAuthorizedKeys to nodepools ...
* coreos/master: (49 commits) fix: Spot Fleet doesn't support the t2 instance family Fix node pools on master Allow option to disable certificates management (kubernetes-retired#243) Bump to k8s 1.5.2 Update README.md Update ROADMAP.md Update ROADMAP.md Update ROADMAP.md Update the inline documentation in cluster.yaml typo Don't fail sed if some files are missing Workaround systemd issues with oneshot autorestarts etcd static IP addressing overhaul Calico self hosted integration (kubernetes-retired#124) Fix lint. bugfix for a typo in install-kube-system scripts Update README.md fix(e2e): Correctly wait for a node pool stack for deletion Don't require key-name param during cluster init Propagate SSHAuthorizedKeys to nodepools ...
feat: Calico self hosted integration * Migrated Calico to self hosted install * Updated Calico to v2.0 versions * Bumped the policy controller to 0.5.2 to get a NoneType bugfix in