Bump Kubernetes version to v1.11.3 #1459

mumoshu · 2018-09-28T06:22:03Z

No description provided.

codecov-io · 2018-09-28T07:00:43Z

Codecov Report

Merging #1459 into master will not change coverage.
The diff coverage is n/a.

@@          Coverage Diff           @@
##           master   #1459   +/-   ##
======================================
  Coverage    38.1%   38.1%           
======================================
  Files          74      74           
  Lines        4559    4559           
======================================
  Hits         1737    1737           
  Misses       2580    2580           
  Partials      242     242

Impacted Files	Coverage Δ
core/controlplane/config/config.go	`63.34% <ø> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e8d6bea...53c9a27. Read the comment docs.

mumoshu · 2018-09-28T07:45:49Z

Since v1.11.0 kubelet flags related to the rkt container runtime has been removed, and thus kube-aws controller node fails to run kubeblet successfully due to a missing flag error for flags like --rkt-path.

And it seems like no one has noticed about that until recently.

I'm removing the rkt container runtime support on the way to merge this. So that kube-aws works for k8s 1.11 too.

For more info see kubernetes/website#9538.

kubelet v1.11 does not already support flags necessary to the runtime. Without doing this, I was unable to start kubelets on kube-aws controller nodes

mumoshu · 2018-09-28T13:45:11Z

Hmm, so my cluster is now failing while setting up controller nodes, due to that flanneld is trying to access apiserver via the service ip. flanneld in a pod shouldn't try to access apiserver via the k8s api svc ip, because it isn't available without flanneld(chicken-and-egg problem!)

master|kubeaws1 core@ip-10-0-1-147 ~ $ docker logs 04b989f2548f
I0928 13:40:06.514045       1 main.go:474] Determining IP address of default interface
I0928 13:40:06.514268       1 main.go:487] Using interface with name eth0 and address 10.0.1.147
I0928 13:40:06.514283       1 main.go:504] Defaulting external address to interface address (10.0.1.147)
E0928 13:40:36.515469       1 main.go:231] Failed to create SubnetManager: error retrieving pod spec for 'kube-system/canal-master-lf8lp': Get https://10.3.0.1:443/api/v1/namespaces/kube-system/pods/canal-master-lf8lp: dial tcp 10.3.0.1:443: i/o timeout

mumoshu · 2018-09-28T14:05:13Z

This is how flanneld on controller nodes run on my k8s 1.10 cluster:

I0707 11:13:02.224207       1 main.go:474] Determining IP address of default interface
I0707 11:13:02.224577       1 main.go:487] Using interface with name eth0 and address 10.0.1.197
I0707 11:13:02.224594       1 main.go:504] Defaulting external address to interface address (10.0.1.197)
I0707 11:13:02.233789       1 kube.go:130] Waiting 10m0s for node controller to sync
I0707 11:13:02.233819       1 kube.go:283] Starting kube subnet manager
I0707 11:13:03.233959       1 kube.go:137] Node controller sync successful

It was somehow trying to contact k8s apiserver via service ip. That's a chicken-and-egg problem!

mumoshu · 2018-09-28T14:20:39Z

Then, calico on controller is trying to access k8s api via service ip. I believe calico-node on controller nodes, as they have hostNetwork: true, should't rely on service ip.

$ docker logs 5ed9a9d5bcbf
ls: /calico-secrets: No such file or directory
Wrote Calico CNI binaries to /host/opt/cni/bin
CNI plugin version: v3.1.3
/host/secondary-bin-dir is non-writeable, skipping
                "k8s_api_root": "https://__KUBERNETES_SERVICE_HOST__:__KUBERNETES_SERVICE_PORT__",
                "k8s_api_root": "https://10.3.0.1:__KUBERNETES_SERVICE_PORT__",
CNI config: {
    "name": "k8s-pod-network",
    "cniVersion": "0.3.1",
    "plugins": [
        {
            "type": "calico",
            "log_level": "info",
            "mtu": 8951,
            "datastore_type": "kubernetes",
            "nodename": "ip-10-0-1-252.ap-northeast-1.compute.internal",
            "ipam": {
                "type": "host-local",
                "subnet": "usePodCidr"
            },
            "policy": {
                "type": "k8s",
                "k8s_auth_token": "__SERVICEACCOUNT_TOKEN__"
            },
            "kubernetes": {
                "k8s_api_root": "https://10.3.0.1:443",
                "kubeconfig": "/etc/kubernetes/cni/net.d/calico-kubeconfig"
            }
        },
        {
            "type": "portmap",
            "capabilities": {"portMappings": true},
            "snat": true,
            "externalSetMarkChain": "KUBE-MARK-MASQ"
        }
    ]
}
Created CNI config 10-calico.conflist
Done configuring CNI.  Sleep=true

canal on controller nodes were somehow trying to contact k8s apiserver via service ip. they are on host network, and thus I think they shouldnt rely on service ip.

mumoshu · 2018-09-28T14:29:30Z

@davidmccormick Hey! Would you mind reviewing my changes regarding self-hosted canal(calico/flannel)?

mumoshu · 2018-09-28T14:49:49Z

Wishing this is the last error I fix 😃

Sep 28 14:48:07 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: Attempt 9 failed! Trying again in 3 seconds...
Sep 28 14:48:10 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: NAME          STATUS    AGE
Sep 28 14:48:10 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: kube-system   Active    10m
Sep 28 14:48:11 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: secret/kiam-server-tls configured
Sep 28 14:48:12 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: secret/kiam-agent-tls configured
Sep 28 14:48:13 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: daemonset.extensions/kiam-server unchanged
Sep 28 14:48:13 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: service/kiam-server unchanged
Sep 28 14:48:13 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: serviceaccount/kiam-server unchanged
Sep 28 14:48:13 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: clusterrole.rbac.authorization.k8s.io/kiam-read configured
Sep 28 14:48:13 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: clusterrolebinding.rbac.authorization.k8s.io/kiam-server configured
Sep 28 14:48:13 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: daemonset.extensions/kiam-agent unchanged
Sep 28 14:48:14 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: namespace/kube-system configured
Sep 28 14:48:14 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: daemonset.extensions/kube-proxy unchanged
Sep 28 14:48:14 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: daemonset.extensions/kube-node-drainer-ds unchanged
Sep 28 14:48:15 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: No resources found.
Sep 28 14:48:15 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: Error from server (NotFound): apiservices.apiregistration.k8s.io "v1beta1.metrics.k8s.io" not found
Sep 28 14:48:16 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: serviceaccount/canal unchanged
Sep 28 14:48:16 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: serviceaccount/flannel unchanged
Sep 28 14:48:16 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: clusterrole.rbac.authorization.k8s.io/calico configured
Sep 28 14:48:16 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: clusterrole.rbac.authorization.k8s.io/flannel configured
Sep 28 14:48:16 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: clusterrolebinding.rbac.authorization.k8s.io/flannel configured
Sep 28 14:48:16 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: clusterrolebinding.rbac.authorization.k8s.io/canal-flannel configured
Sep 28 14:48:16 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: clusterrolebinding.rbac.authorization.k8s.io/canal-calico configured
Sep 28 14:48:17 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: configmap/canal-config unchanged
Sep 28 14:48:17 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: daemonset.extensions/canal-master configured
Sep 28 14:48:17 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: daemonset.extensions/canal-node configured
Sep 28 14:48:17 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: error: error validating Sep 28 14:51:01 ip-10-0-1-95.ap-northeast-1.compute.internal retry[20924]: error: error validating "/srv/kubernetes/manifests/canal.yaml": error validating data: ValidationError(CustomResourceDefinition): unknown field "description" in io.k8s.apiextensions-apiserver.pkg.apis.apiextensions.v1beta1.CustomResourceDefinition; if you choose to ignore these errors, turn validation off with --validate=false
Sep 28 14:48:17 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: Attempt 10 failed and there are no more attempts left!
Sep 28 14:48:17 ip-10-0-1-95.ap-northeast-1.compute.internal systemd[1]: install-kube-system.service: Main process exited, code=exited, status=1/FAILURE
Sep 28 14:48:17 ip-10-0-1-95.ap-northeast-1.compute.internal systemd[1]: install-kube-system.service: Failed with result 'exit-code'.
Sep 28 14:48:17 ip-10-0-1-95.ap-northeast-1.compute.internal systemd[1]: Failed to start install-kube-system.service.

mumoshu · 2018-09-28T14:54:11Z

Reading aws/amazon-vpc-cni-k8s#142, it now sounds like I can just remove those redundant fields from calico yaml

davidmccormick · 2018-09-28T21:38:15Z

Sure, will have a look on Monday morning! I would probably have a little look at the kube-proxy as it is responsible for allowing access to the service ip’s, flannel only handles the pod ip. Access to services should be fine even before flannel is up if kube-proxy is running properly. Have a great weekend! 😀

mumoshu · 2018-09-28T22:58:04Z

@davidmccormick Thanks!

Access to services should be fine even before flannel is up if kube-proxy is running properly.

I was under the impression that, before flanneld forms an overlay (pod) network, kube-proxy has no where to route the k8s api's svc ip. Good to know!

mumoshu · 2018-09-29T12:27:59Z

Just ran iptables-save and it looks like you're definitely correct - the kubernetes service's cluster ip can be accessed from the host, being routed to the host ip:

$ sudo iptables-save | grep default/kube
-A KUBE-SEP-ZGLHPGREO32CJZAK -s 10.0.1.162/32 -m comment --comment "default/kubernetes:https" -j KUBE-MARK-MASQ
-A KUBE-SEP-ZGLHPGREO32CJZAK -p tcp -m comment --comment "default/kubernetes:https" -m tcp -j DNAT --to-destination 10.0.1.162:443
-A KUBE-SERVICES ! -s 10.2.0.0/16 -d 10.3.0.1/32 -p tcp -m comment --comment "default/kubernetes:https cluster IP" -m tcp --dport 443 -j KUBE-MARK-MASQ
-A KUBE-SERVICES -d 10.3.0.1/32 -p tcp -m comment --comment "default/kubernetes:https cluster IP" -m tcp --dport 443 -j KUBE-SVC-NPX46M4PTMTKRN6Y
-A KUBE-SVC-NPX46M4PTMTKRN6Y -m comment --comment "default/kubernetes:https" -j KUBE-SEP-ZGLHPGREO32CJZAK

mumoshu · 2018-09-29T12:55:48Z

After revering the --kube-api-url fix from flanneld and calico, I ran docker ps -a on the failing node. It didn't show any kube-proxy running.

Looking into controller-manager logs by running docker logs, I saw:

I0929 12:41:39.583719       1 node_lifecycle_controller.go:972] Controller detected that some Nodes are Ready. Exiting master disruption mode.
E0929 12:42:11.631265       1 daemon_controller.go:1023] pods "kube-proxy-" is forbidden: error looking up service account kube-system/kube-proxy: serviceaccount "kube-proxy" not found
E0929 12:42:11.631295       1 daemon_controller.go:286] kube-system/kube-proxy failed with : pods "kube-proxy-" is forbidden: error looking up service account kube-system/kube-proxy: serviceaccount "kube-proxy" not found
I0929 12:42:11.631384       1 event.go:221] Event(v1.ObjectReference{Kind:"DaemonSet", Namespace:"kube-system", Name:"kube-proxy", UID:"df787fd6-c3e4-11e8-a093-06dcad7d4bf4", APIVersion:"apps/v1", ResourceVersion:"219", FieldPath:""}): type: 'Warning' reason: 'FailedCreate' Error creating: pods "kube-proxy-" is forbidden: error looking up service account kube-system/kube-proxy: serviceaccount "kube-proxy" not found

Who should create the kube-proxy serviceaccount?

mumoshu · 2018-09-29T13:03:49Z

Oh, shouldn't this be run before we create calico and flannel related pods?

https://github.com/kubernetes-incubator/kube-aws/blob/e1d277267d0580a02ae273fd9aac5e9b717d872f/core/controlplane/config/templates/cloud-config-controller#L940-L945

mumoshu · 2018-09-29T13:10:03Z

Yes, and no - this is a composite issue.

On the failing controller node, install-kube-system had failed due to the issue fixed by b8e6187:

Sep 29 12:41:55 ip-10-0-0-18.ap-northeast-1.compute.internal retry[2884]: daemonset.extensions/canal-node configured
Sep 29 12:41:55 ip-10-0-0-18.ap-northeast-1.compute.internal retry[2884]: error: error validating "/srv/kubernetes/manifests/canal.yaml": error validating data: ValidationError(CustomResourceDef>
Sep 29 12:41:55 ip-10-0-0-18.ap-northeast-1.compute.internal retry[2884]: Attempt 10 failed and there are no more attempts left!
Sep 29 12:41:55 ip-10-0-0-18.ap-northeast-1.compute.internal systemd[1]: install-kube-system.service: Main process exited, code=exited, status=1/FAILURE
Sep 29 12:41:55 ip-10-0-0-18.ap-northeast-1.compute.internal systemd[1]: install-kube-system.service: Failed with result 'exit-code'.
Sep 29 12:41:55 ip-10-0-0-18.ap-northeast-1.compute.internal systemd[1]: Failed to start install-kube-system.service.

This resulted in install-kube-system exiting before we create the kube-proxy serviceaccount.

So, b8e6187 seems to be necessary, but a9b1ba9 and fe72a0f could be reverted.

ref kubernetes-retired#1459 (comment)

This reverts commit fe72a0f.

This reverts commit a9b1ba9.

mumoshu · 2018-09-29T13:41:49Z

Yay! It works now.

mumoshu · 2018-09-29T13:44:27Z

@davidmccormick I'm merging this anyway. I'd greatly appreciate it if you could review it next week.

Also, thanks a lot for the awesome guidance towards the fix!

* Bump Kubernetes version to v1.11.3 * fix: remove rkt container runtime support kubelet v1.11 does not already support flags necessary to the runtime. Without doing this, I was unable to start kubelets on kube-aws controller nodes * fix calico-related CRDs installation * serviceaccounts should be created before anything else ref kubernetes-retired#1459 (comment)

Bump Kubernetes version to v1.11.3

ffd780e

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Sep 28, 2018

mumoshu added this to the v0.12.0 milestone Sep 28, 2018

fix: remove rkt container runtime support

0347789

kubelet v1.11 does not already support flags necessary to the runtime. Without doing this, I was unable to start kubelets on kube-aws controller nodes

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Sep 28, 2018

fix flanneld on controller nodes

a9b1ba9

It was somehow trying to contact k8s apiserver via service ip. That's a chicken-and-egg problem!

fix canal on controller nods

fe72a0f

canal on controller nodes were somehow trying to contact k8s apiserver via service ip. they are on host network, and thus I think they shouldnt rely on service ip.

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 28, 2018

fix calico-related CRDs installation

b8e6187

also remove rkt container runtime related parts from cloud-config-worker

e51bd8a

mumoshu added 3 commits September 29, 2018 22:13

serviceaccounts should be created before anything else

e496ba7

ref kubernetes-retired#1459 (comment)

Revert "fix canal on controller nods"

f307037

This reverts commit fe72a0f.

Revert "fix flanneld on controller nodes"

53c9a27

This reverts commit a9b1ba9.

mumoshu merged commit c113c7e into kubernetes-retired:master Sep 30, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bump Kubernetes version to v1.11.3 #1459

Bump Kubernetes version to v1.11.3 #1459

mumoshu commented Sep 28, 2018

codecov-io commented Sep 28, 2018 •

edited

Loading

mumoshu commented Sep 28, 2018

mumoshu commented Sep 28, 2018 •

edited

Loading

mumoshu commented Sep 28, 2018

mumoshu commented Sep 28, 2018

mumoshu commented Sep 28, 2018

mumoshu commented Sep 28, 2018 •

edited

Loading

mumoshu commented Sep 28, 2018

davidmccormick commented Sep 28, 2018

mumoshu commented Sep 28, 2018

mumoshu commented Sep 29, 2018

mumoshu commented Sep 29, 2018

mumoshu commented Sep 29, 2018

mumoshu commented Sep 29, 2018

mumoshu commented Sep 29, 2018

mumoshu commented Sep 29, 2018

Bump Kubernetes version to v1.11.3 #1459

Bump Kubernetes version to v1.11.3 #1459

Conversation

mumoshu commented Sep 28, 2018

codecov-io commented Sep 28, 2018 • edited Loading

Codecov Report

mumoshu commented Sep 28, 2018

mumoshu commented Sep 28, 2018 • edited Loading

mumoshu commented Sep 28, 2018

mumoshu commented Sep 28, 2018

mumoshu commented Sep 28, 2018

mumoshu commented Sep 28, 2018 • edited Loading

mumoshu commented Sep 28, 2018

davidmccormick commented Sep 28, 2018

mumoshu commented Sep 28, 2018

mumoshu commented Sep 29, 2018

mumoshu commented Sep 29, 2018

mumoshu commented Sep 29, 2018

mumoshu commented Sep 29, 2018

mumoshu commented Sep 29, 2018

mumoshu commented Sep 29, 2018

codecov-io commented Sep 28, 2018 •

edited

Loading

mumoshu commented Sep 28, 2018 •

edited

Loading

mumoshu commented Sep 28, 2018 •

edited

Loading