Skip to content
This repository has been archived by the owner on Sep 30, 2020. It is now read-only.

Bump Kubernetes version to v1.11.3 #1459

Merged
merged 9 commits into from
Sep 30, 2018

Conversation

mumoshu
Copy link
Contributor

@mumoshu mumoshu commented Sep 28, 2018

No description provided.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Sep 28, 2018
@mumoshu mumoshu added this to the v0.12.0 milestone Sep 28, 2018
@codecov-io
Copy link

codecov-io commented Sep 28, 2018

Codecov Report

Merging #1459 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@          Coverage Diff           @@
##           master   #1459   +/-   ##
======================================
  Coverage    38.1%   38.1%           
======================================
  Files          74      74           
  Lines        4559    4559           
======================================
  Hits         1737    1737           
  Misses       2580    2580           
  Partials      242     242
Impacted Files Coverage Δ
core/controlplane/config/config.go 63.34% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e8d6bea...53c9a27. Read the comment docs.

@mumoshu
Copy link
Contributor Author

mumoshu commented Sep 28, 2018

Since v1.11.0 kubelet flags related to the rkt container runtime has been removed, and thus kube-aws controller node fails to run kubeblet successfully due to a missing flag error for flags like --rkt-path.

And it seems like no one has noticed about that until recently.

I'm removing the rkt container runtime support on the way to merge this. So that kube-aws works for k8s 1.11 too.

For more info see kubernetes/website#9538.

kubelet v1.11 does not already support flags necessary to the runtime. Without doing this, I was unable to start kubelets on kube-aws controller nodes
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Sep 28, 2018
@mumoshu
Copy link
Contributor Author

mumoshu commented Sep 28, 2018

Hmm, so my cluster is now failing while setting up controller nodes, due to that flanneld is trying to access apiserver via the service ip. flanneld in a pod shouldn't try to access apiserver via the k8s api svc ip, because it isn't available without flanneld(chicken-and-egg problem!)

master|kubeaws1 core@ip-10-0-1-147 ~ $ docker logs 04b989f2548f
I0928 13:40:06.514045       1 main.go:474] Determining IP address of default interface
I0928 13:40:06.514268       1 main.go:487] Using interface with name eth0 and address 10.0.1.147
I0928 13:40:06.514283       1 main.go:504] Defaulting external address to interface address (10.0.1.147)
E0928 13:40:36.515469       1 main.go:231] Failed to create SubnetManager: error retrieving pod spec for 'kube-system/canal-master-lf8lp': Get https://10.3.0.1:443/api/v1/namespaces/kube-system/pods/canal-master-lf8lp: dial tcp 10.3.0.1:443: i/o timeout

@mumoshu
Copy link
Contributor Author

mumoshu commented Sep 28, 2018

This is how flanneld on controller nodes run on my k8s 1.10 cluster:

I0707 11:13:02.224207       1 main.go:474] Determining IP address of default interface
I0707 11:13:02.224577       1 main.go:487] Using interface with name eth0 and address 10.0.1.197
I0707 11:13:02.224594       1 main.go:504] Defaulting external address to interface address (10.0.1.197)
I0707 11:13:02.233789       1 kube.go:130] Waiting 10m0s for node controller to sync
I0707 11:13:02.233819       1 kube.go:283] Starting kube subnet manager
I0707 11:13:03.233959       1 kube.go:137] Node controller sync successful

It was somehow trying to contact k8s apiserver via service ip. That's a chicken-and-egg problem!
@mumoshu
Copy link
Contributor Author

mumoshu commented Sep 28, 2018

Then, calico on controller is trying to access k8s api via service ip. I believe calico-node on controller nodes, as they have hostNetwork: true, should't rely on service ip.

$ docker logs 5ed9a9d5bcbf
ls: /calico-secrets: No such file or directory
Wrote Calico CNI binaries to /host/opt/cni/bin
CNI plugin version: v3.1.3
/host/secondary-bin-dir is non-writeable, skipping
                "k8s_api_root": "https://__KUBERNETES_SERVICE_HOST__:__KUBERNETES_SERVICE_PORT__",
                "k8s_api_root": "https://10.3.0.1:__KUBERNETES_SERVICE_PORT__",
CNI config: {
    "name": "k8s-pod-network",
    "cniVersion": "0.3.1",
    "plugins": [
        {
            "type": "calico",
            "log_level": "info",
            "mtu": 8951,
            "datastore_type": "kubernetes",
            "nodename": "ip-10-0-1-252.ap-northeast-1.compute.internal",
            "ipam": {
                "type": "host-local",
                "subnet": "usePodCidr"
            },
            "policy": {
                "type": "k8s",
                "k8s_auth_token": "__SERVICEACCOUNT_TOKEN__"
            },
            "kubernetes": {
                "k8s_api_root": "https://10.3.0.1:443",
                "kubeconfig": "/etc/kubernetes/cni/net.d/calico-kubeconfig"
            }
        },
        {
            "type": "portmap",
            "capabilities": {"portMappings": true},
            "snat": true,
            "externalSetMarkChain": "KUBE-MARK-MASQ"
        }
    ]
}
Created CNI config 10-calico.conflist
Done configuring CNI.  Sleep=true

canal on controller nodes were somehow trying to contact k8s apiserver via service ip. they are on host network, and thus I think they shouldnt rely on service ip.
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 28, 2018
@mumoshu
Copy link
Contributor Author

mumoshu commented Sep 28, 2018

@davidmccormick Hey! Would you mind reviewing my changes regarding self-hosted canal(calico/flannel)?

@mumoshu
Copy link
Contributor Author

mumoshu commented Sep 28, 2018

Wishing this is the last error I fix 😃

Sep 28 14:48:07 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: Attempt 9 failed! Trying again in 3 seconds...
Sep 28 14:48:10 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: NAME          STATUS    AGE
Sep 28 14:48:10 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: kube-system   Active    10m
Sep 28 14:48:11 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: secret/kiam-server-tls configured
Sep 28 14:48:12 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: secret/kiam-agent-tls configured
Sep 28 14:48:13 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: daemonset.extensions/kiam-server unchanged
Sep 28 14:48:13 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: service/kiam-server unchanged
Sep 28 14:48:13 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: serviceaccount/kiam-server unchanged
Sep 28 14:48:13 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: clusterrole.rbac.authorization.k8s.io/kiam-read configured
Sep 28 14:48:13 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: clusterrolebinding.rbac.authorization.k8s.io/kiam-server configured
Sep 28 14:48:13 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: daemonset.extensions/kiam-agent unchanged
Sep 28 14:48:14 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: namespace/kube-system configured
Sep 28 14:48:14 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: daemonset.extensions/kube-proxy unchanged
Sep 28 14:48:14 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: daemonset.extensions/kube-node-drainer-ds unchanged
Sep 28 14:48:15 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: No resources found.
Sep 28 14:48:15 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: Error from server (NotFound): apiservices.apiregistration.k8s.io "v1beta1.metrics.k8s.io" not found
Sep 28 14:48:16 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: serviceaccount/canal unchanged
Sep 28 14:48:16 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: serviceaccount/flannel unchanged
Sep 28 14:48:16 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: clusterrole.rbac.authorization.k8s.io/calico configured
Sep 28 14:48:16 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: clusterrole.rbac.authorization.k8s.io/flannel configured
Sep 28 14:48:16 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: clusterrolebinding.rbac.authorization.k8s.io/flannel configured
Sep 28 14:48:16 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: clusterrolebinding.rbac.authorization.k8s.io/canal-flannel configured
Sep 28 14:48:16 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: clusterrolebinding.rbac.authorization.k8s.io/canal-calico configured
Sep 28 14:48:17 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: configmap/canal-config unchanged
Sep 28 14:48:17 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: daemonset.extensions/canal-master configured
Sep 28 14:48:17 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: daemonset.extensions/canal-node configured
Sep 28 14:48:17 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: error: error validating Sep 28 14:51:01 ip-10-0-1-95.ap-northeast-1.compute.internal retry[20924]: error: error validating "/srv/kubernetes/manifests/canal.yaml": error validating data: ValidationError(CustomResourceDefinition): unknown field "description" in io.k8s.apiextensions-apiserver.pkg.apis.apiextensions.v1beta1.CustomResourceDefinition; if you choose to ignore these errors, turn validation off with --validate=false
Sep 28 14:48:17 ip-10-0-1-95.ap-northeast-1.compute.internal retry[10497]: Attempt 10 failed and there are no more attempts left!
Sep 28 14:48:17 ip-10-0-1-95.ap-northeast-1.compute.internal systemd[1]: install-kube-system.service: Main process exited, code=exited, status=1/FAILURE
Sep 28 14:48:17 ip-10-0-1-95.ap-northeast-1.compute.internal systemd[1]: install-kube-system.service: Failed with result 'exit-code'.
Sep 28 14:48:17 ip-10-0-1-95.ap-northeast-1.compute.internal systemd[1]: Failed to start install-kube-system.service.

@mumoshu
Copy link
Contributor Author

mumoshu commented Sep 28, 2018

Reading aws/amazon-vpc-cni-k8s#142, it now sounds like I can just remove those redundant fields from calico yaml

@davidmccormick
Copy link
Contributor

Sure, will have a look on Monday morning! I would probably have a little look at the kube-proxy as it is responsible for allowing access to the service ip’s, flannel only handles the pod ip. Access to services should be fine even before flannel is up if kube-proxy is running properly. Have a great weekend! 😀

@mumoshu
Copy link
Contributor Author

mumoshu commented Sep 28, 2018

@davidmccormick Thanks!

Access to services should be fine even before flannel is up if kube-proxy is running properly.

I was under the impression that, before flanneld forms an overlay (pod) network, kube-proxy has no where to route the k8s api's svc ip. Good to know!

@mumoshu
Copy link
Contributor Author

mumoshu commented Sep 29, 2018

Just ran iptables-save and it looks like you're definitely correct - the kubernetes service's cluster ip can be accessed from the host, being routed to the host ip:

$ sudo iptables-save | grep default/kube
-A KUBE-SEP-ZGLHPGREO32CJZAK -s 10.0.1.162/32 -m comment --comment "default/kubernetes:https" -j KUBE-MARK-MASQ
-A KUBE-SEP-ZGLHPGREO32CJZAK -p tcp -m comment --comment "default/kubernetes:https" -m tcp -j DNAT --to-destination 10.0.1.162:443
-A KUBE-SERVICES ! -s 10.2.0.0/16 -d 10.3.0.1/32 -p tcp -m comment --comment "default/kubernetes:https cluster IP" -m tcp --dport 443 -j KUBE-MARK-MASQ
-A KUBE-SERVICES -d 10.3.0.1/32 -p tcp -m comment --comment "default/kubernetes:https cluster IP" -m tcp --dport 443 -j KUBE-SVC-NPX46M4PTMTKRN6Y
-A KUBE-SVC-NPX46M4PTMTKRN6Y -m comment --comment "default/kubernetes:https" -j KUBE-SEP-ZGLHPGREO32CJZAK

@mumoshu
Copy link
Contributor Author

mumoshu commented Sep 29, 2018

After revering the --kube-api-url fix from flanneld and calico, I ran docker ps -a on the failing node. It didn't show any kube-proxy running.

Looking into controller-manager logs by running docker logs, I saw:

I0929 12:41:39.583719       1 node_lifecycle_controller.go:972] Controller detected that some Nodes are Ready. Exiting master disruption mode.
E0929 12:42:11.631265       1 daemon_controller.go:1023] pods "kube-proxy-" is forbidden: error looking up service account kube-system/kube-proxy: serviceaccount "kube-proxy" not found
E0929 12:42:11.631295       1 daemon_controller.go:286] kube-system/kube-proxy failed with : pods "kube-proxy-" is forbidden: error looking up service account kube-system/kube-proxy: serviceaccount "kube-proxy" not found
I0929 12:42:11.631384       1 event.go:221] Event(v1.ObjectReference{Kind:"DaemonSet", Namespace:"kube-system", Name:"kube-proxy", UID:"df787fd6-c3e4-11e8-a093-06dcad7d4bf4", APIVersion:"apps/v1", ResourceVersion:"219", FieldPath:""}): type: 'Warning' reason: 'FailedCreate' Error creating: pods "kube-proxy-" is forbidden: error looking up service account kube-system/kube-proxy: serviceaccount "kube-proxy" not found

Who should create the kube-proxy serviceaccount?

@mumoshu
Copy link
Contributor Author

mumoshu commented Sep 29, 2018

@mumoshu
Copy link
Contributor Author

mumoshu commented Sep 29, 2018

Yes, and no - this is a composite issue.

On the failing controller node, install-kube-system had failed due to the issue fixed by b8e6187:

Sep 29 12:41:55 ip-10-0-0-18.ap-northeast-1.compute.internal retry[2884]: daemonset.extensions/canal-node configured
Sep 29 12:41:55 ip-10-0-0-18.ap-northeast-1.compute.internal retry[2884]: error: error validating "/srv/kubernetes/manifests/canal.yaml": error validating data: ValidationError(CustomResourceDef>
Sep 29 12:41:55 ip-10-0-0-18.ap-northeast-1.compute.internal retry[2884]: Attempt 10 failed and there are no more attempts left!
Sep 29 12:41:55 ip-10-0-0-18.ap-northeast-1.compute.internal systemd[1]: install-kube-system.service: Main process exited, code=exited, status=1/FAILURE
Sep 29 12:41:55 ip-10-0-0-18.ap-northeast-1.compute.internal systemd[1]: install-kube-system.service: Failed with result 'exit-code'.
Sep 29 12:41:55 ip-10-0-0-18.ap-northeast-1.compute.internal systemd[1]: Failed to start install-kube-system.service.

This resulted in install-kube-system exiting before we create the kube-proxy serviceaccount.

So, b8e6187 seems to be necessary, but a9b1ba9 and fe72a0f could be reverted.

@mumoshu
Copy link
Contributor Author

mumoshu commented Sep 29, 2018

Yay! It works now.

@mumoshu
Copy link
Contributor Author

mumoshu commented Sep 29, 2018

@davidmccormick I'm merging this anyway. I'd greatly appreciate it if you could review it next week.

Also, thanks a lot for the awesome guidance towards the fix!

@mumoshu mumoshu merged commit c113c7e into kubernetes-retired:master Sep 30, 2018
kevtaylor pushed a commit to HotelsDotCom/kube-aws that referenced this pull request Jan 9, 2019
* Bump Kubernetes version to v1.11.3

* fix: remove rkt container runtime support

kubelet v1.11 does not already support flags necessary to the runtime. Without doing this, I was unable to start kubelets on kube-aws controller nodes

* fix calico-related CRDs installation

* serviceaccounts should be created before anything else

ref kubernetes-retired#1459 (comment)
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants