Update calico to use the correct CIDR for pods #2768

ottoyiu · 2017-06-16T20:48:34Z

This PR is in relations to the conversations had in #1171.

For reference, here's the default values for the individual components.

  kubeAPIServer:
    serviceClusterIPRange: 100.64.0.0/13
  kubeControllerManager:
    # I do not think we need this with CNI
    clusterCIDR: 100.96.0.0/11
  kubeDNS:
    serverIP: 100.64.0.10
  kubeProxy:
    clusterCIDR: 100.96.0.0/11
  kubelet:
    clusterDNS: 100.64.0.10
    networkPluginName: cni
    nonMasqueradeCIDR: 100.64.0.0/10
  # these values are the base subnets that everything is setup from
  # I do not believe that these should overlap
  nonMasqueradeCIDR: 100.64.0.0/10
  serviceClusterIPRange: 100.64.0.0/13

Currently, we are using .NonMasqueradeCIDR in the wrong fashion. We
should be using .KubeControllerManager.ClusterCIDR instead to prevent IP
collision with Service IPs.

Note: The .NonMasqueradeCIDR is the cluster's base subnet, and should not be directly used by the components because of overlap.

Scenarios Tested

New Cluster Creation for k8s 1.6 - PASS

Deployed a new Kubernetes 1.6.4 cluster using this branch with the changes. Pods are now being allocated an IP within the 100.96.0.0/11 range (as defined by .KubeControllerManager.ClusterCIDR vs .NonMasqueradeCIDR

Tested:

Pod to Service IP (100.64.0.10/kube-dns and example nginx service)
Pod to Pod
Pod to External (google.com)

Observation:

Nodes are assigned a Pod CIDR within the .KubeControllerManager.ClusterCIDR range as expected
Pods are being assigned IPs in the proper range
Service IP range remains unchanged

Cluster Upgrade for k8s 1.6 (cluster first created using kops-1.6.2)

Deployed a new Kubernetes 1.6.4 cluster using kops-1.6.2, then ran kops built from this branch:

kops update cluster --yes to change the CIDR.
Observation: All pods are still functional and running with the existing Pod IPs. Pod to Service IP, Pod to Pod, Pod to External (google.com) tested.

Upgrade strategy attempts

kops rolling-update --force --yes to do rolling restart - FAIL
Observation: Existing Pods continue to be functional with its old Pod IPs (which may or may not lie within the newly defined IP range) while nodes are being cycled. A period of several minutes where DNS is not responsive in Pods, and shortly recovers. Not sure if it has anything to do with the CIDR change. Seems more to do with the "rolling" update. New Nodes are still assigned old CIDR range. Service IP range remains unchanged.
Deleting running calico/node pods from calico daemonset - FAIL
kubectl get pods --namespace kube-system | grep calico-node | awk '{print $1}' | xargs kubectl delete pod --namespace kube-system

Observation: Existing Pods continue to be functional with its old Pod IPs. New Pods being rescheduled on existing Nodes are still getting a Pod IP in the old IP range. New Nodes are still assigned old CIDR range.

Failure Investigation

Calico only uses the CALICO_IPV4POOL_CIDR to create a default IPv4 pool if a pool doesn't exist already:
https://github.com/projectcalico/calicoctl/blob/v1.3.0/calico_node/startup/startup.go#L463

Because of this, we need to run two jobs that execute calicoctl manually to migrate on the new CIDR - one to create a new IPv4 pool that we want, and one to delete the existing IP pool that we no longer want. This is to be executed after executing one of the listed upgrade strategies:

# This ConfigMap is used to configure a self-hosted Calico installation.
kind: ConfigMap
apiVersion: v1
metadata:
  name: calico-config-ippool
  namespace: kube-system
data:
  # The default IP Pool to be created for the cluster.
  # Pod IP addresses will be assigned from this pool.
  ippool.yaml: |
      apiVersion: v1
      kind: ipPool
      metadata:
        cidr: 100.96.0.0/11
      spec:
        ipip:
          enabled: true
          mode: cross-subnet
        nat-outgoing: true
---
## This manifest deploys a Job which adds a new ippool to calico
apiVersion: batch/v1
kind: Job
metadata:
  name: configure-calico-ippool
  namespace: kube-system
  labels:
    k8s-app: calico
    role.kubernetes.io/networking: "1"
spec:
  template:
    metadata:
      name: configure-calico-ippool
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ''
    spec:
      hostNetwork: true
      serviceAccountName: calico
      tolerations:
      - key: node-role.kubernetes.io/master
        effect: NoSchedule
      - key: CriticalAddonsOnly
        operator: Exists
      restartPolicy: OnFailure
      containers:
        # Writes basic configuration to datastore.
        - name: configure-calico
          image: calico/ctl:v1.2.1
          args:
          - apply
          - -f
          - /etc/config/calico/ippool.yaml
          volumeMounts:
            - name: config-volume
              mountPath: /etc/config
          env:
            # The location of the etcd cluster.
            - name: ETCD_ENDPOINTS
              valueFrom:
                configMapKeyRef:
                  name: calico-config
                  key: etcd_endpoints
      volumes:
       - name: config-volume
         configMap:
           name: calico-config-ippool
           items:
            - key: ippool.yaml
---
## This manifest deploys a Job which deletes the old ippool from calico
apiVersion: batch/v1
kind: Job
metadata:
  name: configure-calico-ippool
  namespace: kube-system
  labels:
    k8s-app: calico
    role.kubernetes.io/networking: "1"
spec:
  template:
    metadata:
      name: configure-calico-ippool
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ''
    spec:
      hostNetwork: true
      serviceAccountName: calico
      tolerations:
      - key: node-role.kubernetes.io/master
        effect: NoSchedule
      - key: CriticalAddonsOnly
        operator: Exists
      restartPolicy: OnFailure
      containers:
        # Writes basic configuration to datastore.
        - name: configure-calico
          image: calico/ctl:v1.2.1
          args:
          - delete
          - ipPool
          - 100.64.0.0/10
          env:
            # The location of the etcd cluster.
            - name: ETCD_ENDPOINTS
              valueFrom:
                configMapKeyRef:
                  name: calico-config

By doing that, new Pods will get new IPs in the right range, and existing Pods with existing IPs continue to function.

Operations on k8s 1.5 using the same tests as above

New Cluster Creation for k8s 1.5 - PASS
Cluster Upgrade for k8s 1.5 - SAME ISSUE AS DESCRIBED ABOVE WITH K8S 1.6

This change is

k8s-ci-robot · 2017-06-16T20:48:42Z

Hi @ottoyiu. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with @k8s-bot ok to test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

chrislovecnm

Comment for you

chrislovecnm · 2017-06-16T21:57:42Z

upup/pkg/fi/cloudup/bootstrapchannelbuilder.go

@@ -353,7 +353,8 @@ func (b *BootstrapChannelBuilder) buildManifest() (*channelsapi.Addons, map[stri

 	if b.cluster.Spec.Networking.Calico != nil {
 		key := "networking.projectcalico.org"
-		version := "2.1.1"
+		// 2.1.1-kops.1 = 2.1.1 with CIDR change
+		version := "2.1.1-kops.1"


I think you will need 2.1.2-kops.1 not 2.1.1-kops.1

chrislovecnm · 2017-06-16T21:57:57Z

@k8s-bot ok to test

justinsb · 2017-06-20T04:53:44Z

Do we know what happens when we change this value?

ottoyiu · 2017-06-22T22:03:19Z

@justinsb: you mean when a user modifies the .KubeControllerManager.ClusterCIDR value, or you mean, what happens when we switch from .NonMasqueradeCIDR to .KubeControllerManager.ClusterCIDR?

Currently, we are using .NonMasqueradeCIDR in the wrong fashion. We should be using .KubeControllerManager.ClusterCIDR to prevent IP collision with Service IPs.

ottoyiu · 2017-06-27T02:20:53Z

@justinb @chrislovecnm I have tested these changes. The upgrade/migration path for an existing cluster to the new non-overlapping CIDR requires a manual step (as shown in main PR notes) which to some can be seen as sub-optimal. However, it could also be seen as ideal, as this change will have zero effect on existing running clusters unless the manual step is ran. I would like to hear what you think is the proper approach in this case. I'm leaning towards just documenting the migration step for existing clusters if the cluster operator(s) deem the migration necessary.

chrislovecnm · 2017-06-29T20:58:40Z

@ottoyiu I think documenting and printing a warning on screen may be helpful for the user. Any other ides?

chrislovecnm · 2017-07-06T20:28:48Z

@ottoyiu can we get release notes like #2875

chrislovecnm

👍

ottoyiu · 2017-07-10T19:03:29Z

@chrislovecnm just came by from vacation. What's the proper way to contribute to a release note? Should I submit a PR on justinb's release notes branch?

Also document the migration procedure necessary for existing calico clusters

Release notes for Calico Pod CIDR changes made in #2768

blakebarnett · 2017-07-24T16:43:18Z

@ottoyiu See #3018 and #3019, need to probably turn off cross-subnet mode as the default or any users doing this migration will break their clusters.

ottoyiu · 2017-07-24T17:12:14Z

@blakebarnett will fix #3018 ASAP. Oversight on my part; my apologies. #3019 seems to be a much trickier problem to solve, since its tied with the calico addon version. Kops already thinks the calico has been upgraded... will document it in the ticket instead.

Also document the migration procedure necessary for existing calico clusters

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jun 16, 2017

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 16, 2017

ottoyiu force-pushed the calico_cidr branch 2 times, most recently from e145d8c to 37e3cb5 Compare June 16, 2017 21:06

chrislovecnm reviewed Jun 16, 2017

View reviewed changes

k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jun 16, 2017

Update calico to use the correct CIDR for pods

8674309

Currently, we are using .NonMasqueradeCIDR in the wrong fashion. We should be using .KubeControllerManager.ClusterCIDR to prevent IP collision with Service IPs.

ottoyiu force-pushed the calico_cidr branch from 37e3cb5 to 8674309 Compare June 26, 2017 20:47

ottoyiu changed the title ~~WIP: Update calico to use the correct CIDR for pods~~ Update calico to use the correct CIDR for pods Jun 27, 2017

chrislovecnm approved these changes Jul 6, 2017

View reviewed changes

chrislovecnm merged commit 24c0bc1 into kubernetes:master Jul 6, 2017

ottoyiu added a commit to ottoyiu/kops that referenced this pull request Jul 11, 2017

Release notes for Calico Pod CIDR changes made in kubernetes#2768

2f1e62e

Also document the migration procedure necessary for existing calico clusters

ottoyiu added a commit to ottoyiu/kops that referenced this pull request Jul 12, 2017

Release notes for Calico Pod CIDR changes made in kubernetes#2768

ee9aede

Also document the migration procedure necessary for existing calico clusters

ottoyiu added a commit to ottoyiu/kops that referenced this pull request Jul 12, 2017

Release notes for Calico Pod CIDR changes made in kubernetes#2768

83c99bd

Also document the migration procedure necessary for existing calico clusters

ottoyiu added a commit to ottoyiu/kops that referenced this pull request Jul 12, 2017

Release notes for Calico Pod CIDR changes made in kubernetes#2768

2c4a616

Also document the migration procedure necessary for existing calico clusters

ottoyiu added a commit to ottoyiu/kops that referenced this pull request Jul 13, 2017

Release notes for Calico Pod CIDR changes made in kubernetes#2768

2bcf93b

Also document the migration procedure necessary for existing calico clusters

ottoyiu added a commit to ottoyiu/kops that referenced this pull request Jul 13, 2017

Release notes for Calico Pod CIDR changes made in kubernetes#2768

171f258

Also document the migration procedure necessary for existing calico clusters

ottoyiu added a commit to ottoyiu/kops that referenced this pull request Jul 13, 2017

Release notes for Calico Pod CIDR changes made in kubernetes#2768

025a539

Also document the migration procedure necessary for existing calico clusters

ottoyiu added a commit to ottoyiu/kops that referenced this pull request Jul 13, 2017

Release notes for Calico Pod CIDR changes made in kubernetes#2768

4b8ff94

Also document the migration procedure necessary for existing calico clusters

ottoyiu added a commit to ottoyiu/kops that referenced this pull request Jul 13, 2017

Release notes for Calico Pod CIDR changes made in kubernetes#2768

919454b

Also document the migration procedure necessary for existing calico clusters

chrislovecnm added a commit that referenced this pull request Jul 17, 2017

Merge pull request #2911 from ottoyiu/calico_release_notes_1-7

eb76f2a

Release notes for Calico Pod CIDR changes made in #2768

aknuds1 pushed a commit to aknuds1/kops that referenced this pull request Aug 25, 2017

Release notes for Calico Pod CIDR changes made in kubernetes#2768

9826419

Also document the migration procedure necessary for existing calico clusters

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update calico to use the correct CIDR for pods #2768

Update calico to use the correct CIDR for pods #2768

ottoyiu commented Jun 16, 2017 •

edited

Loading

k8s-ci-robot commented Jun 16, 2017

chrislovecnm left a comment

chrislovecnm Jun 16, 2017

chrislovecnm commented Jun 16, 2017

justinsb commented Jun 20, 2017

ottoyiu commented Jun 22, 2017

ottoyiu commented Jun 27, 2017 •

edited

Loading

chrislovecnm commented Jun 29, 2017

chrislovecnm commented Jul 6, 2017

chrislovecnm left a comment

ottoyiu commented Jul 10, 2017

blakebarnett commented Jul 24, 2017

ottoyiu commented Jul 24, 2017

Update calico to use the correct CIDR for pods #2768

Update calico to use the correct CIDR for pods #2768

Conversation

ottoyiu commented Jun 16, 2017 • edited Loading

Scenarios Tested

New Cluster Creation for k8s 1.6 - PASS

Cluster Upgrade for k8s 1.6 (cluster first created using kops-1.6.2)

Upgrade strategy attempts

Failure Investigation

Operations on k8s 1.5 using the same tests as above

k8s-ci-robot commented Jun 16, 2017

chrislovecnm left a comment

Choose a reason for hiding this comment

chrislovecnm Jun 16, 2017

Choose a reason for hiding this comment

chrislovecnm commented Jun 16, 2017

justinsb commented Jun 20, 2017

ottoyiu commented Jun 22, 2017

ottoyiu commented Jun 27, 2017 • edited Loading

chrislovecnm commented Jun 29, 2017

chrislovecnm commented Jul 6, 2017

chrislovecnm left a comment

Choose a reason for hiding this comment

ottoyiu commented Jul 10, 2017

blakebarnett commented Jul 24, 2017

ottoyiu commented Jul 24, 2017

ottoyiu commented Jun 16, 2017 •

edited

Loading

ottoyiu commented Jun 27, 2017 •

edited

Loading