Skip to content
This repository has been archived by the owner on Sep 4, 2021. It is now read-only.

Updating Calico to hosted install #768

Merged
merged 1 commit into from
Dec 13, 2016
Merged

Conversation

heschlie
Copy link
Contributor

@heschlie heschlie commented Dec 1, 2016

In order to get the hosted intall to work we had to make some changes to the deployment, the most disruptive being that we need to write to the hyperkubes /opt/cni/bin directory, so when Calico is enabled we mount that directory to the host. Our calico/cni:v1.5.2 image deploys flannel, host-local, and loopback CNI binaries alongside the calico binaries.

Some other minor changes were:

  • adding the pod-cidr to the proxy pods to get around some issues with Vagrant networking
  • We use docker to run hyperkube with kubectl to deploy the calico.yaml manifest
  • Removed all systemd Calico configuration
  • Using one CNI conf for all nodes
  • Vagrant files needed some touching as the VMs needed a bit more CPU/memory

We will want to update the remaining docs to use the same changes, but I wanted this reviewed before going through that process.

@@ -0,0 +1,185 @@
# This ConfigMap is used to configure a self-hosted Calico installation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we combine this into a single manifest for both single / multi nodes? I think the only difference should be the etcd_endpoints, correct? Would really like to see one manifest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking the same, but my concern is does it matter that the file would be in the multi-node install? I'm thinking no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pointed the single node to the same file and removed the extra, spun up the single node to make sure it went ok, looks good.

@@ -149,7 +159,7 @@ EOF
cat << EOF > $TEMPLATE
[Unit]
Requires=network-online.target
After=network-online.target
After=network-online.targetETCDENDPOINTS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed...

# We need to overwrite this for a hosted Calico install
if [ "${USE_CALICO}" = "true" ]; then
export CALICO_OPTS="--volume cni-bin,kind=host,source=/opt/cni/bin \
--mount volume=cni-bin,target=/opt/cni/bin"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kube-flannel uses /etc/cni, I think we should do the same thing here https://github.com/coreos/flannel/blob/master/Documentation/kube-flannel.yml#L67

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks to be for the CNI conf files, not the CNI binaries correct? This is needed so our install-cni container can drop in the calico binaries into hyperkube.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack, right.

v.gui = false
end
end

config.vm.provider :virtualbox do |vb|
vb.cpus = 1
vb.cpus = 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary / safe assumption?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had run into some stability issues with one, but that may have been while I was trying to run the policy controller on the master, I'll retest and and revert back if possible.

@@ -161,6 +162,9 @@ Vagrant.configure("2") do |config|

controller.vm.provision :file, :source => CONTROLLER_CLOUD_CONFIG_PATH, :destination => "/tmp/vagrantfile-user-data"
controller.vm.provision :shell, :inline => "mv /tmp/vagrantfile-user-data /var/lib/coreos-vagrant/", :privileged => true

controller.vm.provision :file, :source => CALICO_MANIFEST, :destination => "/tmp/calico.yaml"
controller.vm.provision :shell, :inline => "mkdir -p /srv/kubernetes/manifests && mv /tmp/calico.yaml /srv/kubernetes/manifests/", :privileged => true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we're not relying on vagrant for anything beyond machine provisioning (this isn't completely true even now, but hopefully not adding more external configuration). What do you think about in-lining the calico manifest in the script (gated on USE_CALICO`)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was initially trying to avoid putting it into the script, but if that would be preferred I don't see a problem with that.

curl --silent -H "Content-Type: application/json" -XPOST -d"$(cat /srv/kubernetes/manifests/calico-system.json)" "http://127.0.0.1:8080/api/v1/namespaces/" > /dev/null

# Deploy Calico
docker run --rm --net=host -v /srv/kubernetes/manifests:/host/manifests $HYPERKUBE_IMAGE_REPO:$K8S_VER /hyperkube kubectl apply -f /host/manifests/calico.yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it will get run on every boot - the kubectl create should fail "already exists", which is fine - but if the exit code propagates to the docker run, then this could cause the script to exit (I'm just not sure about the actual behavior -- ignore if this has been tested / is safe).

Copy link
Contributor Author

@heschlie heschlie Dec 7, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to fail on reboot long before it gets to this point. Specifically it fails here:

local REQUIRED=('ADVERTISE_IP' 'POD_NETWORK' 'ETCD_ENDPOINTS' 'SERVICE_IP_RANGE' 'K8S_SERVICE_IP' 'DNS_SERVICE_IP' 'K8S_VER' 'HYPERKUBE_IMAGE_REPO' 'USE_CALICO')

As ETCD_ENDPOINTS seems to be unset on reboot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But on further review it should not report a failure if it did get to this point

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #775 to track this

@aaronlevy
Copy link
Contributor

Couple minor comments/questions

@heschlie heschlie changed the title WIP: Updating Calico to hosted install Updating Calico to hosted install Dec 9, 2016
@heschlie
Copy link
Contributor Author

heschlie commented Dec 9, 2016

I believe I hit all the necessary places. The kube-aws docs aren't pulled from here anymore correct?

@@ -12,6 +12,7 @@ $update_channel = "alpha"

CLUSTER_IP="10.3.0.1"
NODE_IP = "172.17.4.99"
NODE_VCPUS = 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this turn out to be a requirement? It's probably generally fine to do (wouldn't work on single core machine -- not sure how common that would be). If we are keeping this, can we make this configurable in the multi-node installation too.

echo "Waiting for Kubernetes API..."
until curl --silent "http://127.0.0.1:8080/version"
do
sleep 5
done
echo
echo "K8S: Calico Policy"
curl --silent -H "Content-Type: application/json" -XPOST -d"$(cat /srv/kubernetes/manifests/calico-system.json)" "http://127.0.0.1:8080/api/v1/namespaces/" > /dev/null
docker run --rm --net=host -v /srv/kubernetes/manifests:/host/manifests $HYPERKUBE_IMAGE_REPO:$K8S_VER /hyperkube kubectl apply -f /host/manifests/calico.yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, can we switch this to rkt? We require rkt already due to running the kubelet via kubelet-wrapper, but if you want to use rkt as your runtime, we do not require the use of docker.

It should be mostly equivalent, but to implement --rm takes a little more coordination.

When executing rkt run, add :
--uuid-file-save=/var/run/calico-install.uuid

Then after execution
/usr/bin/rkt rm --uuid-file=/var/run/calico-install.uuid

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm getting device or resource busy when running rm against that pod, the full set of commands:

/usr/bin/rkt run --uuid-file-save=/var/run/calico-install.uuid \
                        --net=host \
                        --volume manifests,kind=host,source=/etc/kubernetes/manifests \
                        --mount volume=manifests,target=/host/manifests \
                        $HYPERKUBE_IMAGE_REPO:$K8S_VER --exec=/hyperkube -- kubectl apply -f /host/manifests/calico.yaml
/usr/bin/rkt rm --uuid-file=/var/run/calico-install.uuid

Any suggestions? Should I just leave the pod in an exited state?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The one other trick that might be worth trying is rkt status --wait $uuid between the run and rm

If a wait doesn't help, I'd be happy to dig a bit deeper

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disregard my last comment, I remembered the actual issue here. This is caused by a shared bind mount of /var/lib/rkt (created by the rkt fly kubelet) interacting poorly with rkt rm: rkt/rkt#3181

There's unfortunately not a good fix yet 😦

@aaronlevy
Copy link
Contributor

Couple more minor comments.

/cc @robszumski and @joshix for docs changes

@aaronlevy
Copy link
Contributor

Based on @euank feedback (#768 (comment)) -- sorry to ask this, but prob can just switch back to using docker for now and add a TODO for switching to rkt once the issue is resolved

@heschlie
Copy link
Contributor Author

No problem, thanks for the fast feedback :)

@aaronlevy
Copy link
Contributor

@heschlie can you squash the cleanup commits / where appropriate and we can get this merged?

--volume cni-bin,kind=host,source=/opt/cni/bin \
--mount volume=cni-bin,target=/opt/cni/bin
```
- Add `ExecStartPre=/usr/bin/mkdir -p /opt/cni/bin`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish there was a cleaner way to describe all of this, but I'm at a loss

* Connects containers to the flannel overlay network, which enables the "one IP per pod" concept.
* Enforces network policy created through the Kubernetes policy API, ensuring pods talk to authorized resources only.

When creating `/etc/systemd/system/calico-node.service`:
Finally policy agent is the last major piece of the calico.yaml. It monitors the API for changes related to network policy and configures Calico to implement that policy.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can drop "finally" here.

The policy agent is the last major piece of the Calico system.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should also read "The policy controller"

@heschlie
Copy link
Contributor Author

commits squashed!

Copy link
Contributor

@aaronlevy aaronlevy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@aaronlevy
Copy link
Contributor

@heschlie looks like there are some conflicts now. Can you rebase your changes

Copy link
Contributor

@aaronlevy aaronlevy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing approval until conflicts resolved

@heschlie
Copy link
Contributor Author

@aaronlevy and done!

@@ -31,11 +31,19 @@ export K8S_SERVICE_IP=10.3.0.1
export DNS_SERVICE_IP=10.3.0.10

# Whether to use Calico for Kubernetes network policy.
export USE_CALICO=false
export USE_CALICO=true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we leave this as default off for now. It's also default to off in multi-node installations and I'd like to not also change default behavior in this large of a PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep sorry about that it should have been set to false, must have missed it, amended the last commit.

@aaronlevy
Copy link
Contributor

Thanks. One last minor comment to address - then squash and we should be good.

@aaronlevy
Copy link
Contributor

Perfect. Can you squash that last commit


```sh
$ curl -H "Content-Type: application/json" -XPOST -d'{"apiVersion":"v1","kind":"Namespace","metadata":{"name":"kube-system"}}' "http://127.0.0.1:8080/api/v1/namespaces"
>>>>>>> 625c7d7... Updating Calico to hosted install
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some rebase cruft left here.

In order to get the hosted intall to work we had to make some changes to
the deployment, the most disruptive being that we need to write to the
hyperkubes `/opt/cni/bin` directory, so when Calico is enabled we mount
that directory to the host.

Some other minor changes were adding the pod-cidr to the proxy pods to
get around some issues with Vagrant networking
We use docker to run hyperkube with kubectl to deploy the calico.yaml
manifest, this should be changed to rkt once the noted bug is fixed
Removed all systemd Calico configuration
Using one CNI conf for all nodes

Updated bare metal docs to self-hosted Calico install

Updated upgrade doc, noted this new install is not compatible with old
systemd install method.
Copy link
Contributor

@aaronlevy aaronlevy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants