Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CNI not ready on particular instance group #9530

Closed
javierlga opened this issue Jul 8, 2020 · 14 comments · Fixed by #9547
Closed

CNI not ready on particular instance group #9530

javierlga opened this issue Jul 8, 2020 · 14 comments · Fixed by #9547

Comments

@javierlga
Copy link

javierlga commented Jul 8, 2020

1. What kops version are you running? The command kops version, will display
this information.

Version 1.17.0

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

Server Version: v1.17.7
Client Version: v1.17.0

3. What cloud provider are you using?
Amazon Web Services

4. What commands did you run? What is the simplest way to reproduce this issue?
Created a new cluster running kops create -f cluster.yaml , a new secret for SSH key kops create secret --name test.k8s.local sshpublickey admin -i ~/.ssh/id_rsa.pub and finally kops update cluster --yes

5. What happened after the commands executed?
All the resources defined within the YAML file were created, a group of three masters, and two different instance groups. However, the nodes from one instance group were not ready, looking at kubelet logs I could find this:

Kubelet:

Jul 08 20:10:59 ip-172-31-7-154 kubelet[2303]: E0708 20:10:59.830034    2303 kubelet.go:2183] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized

kubeAPIServer logs:

I0708 20:42:59.131972       1 healthz.go:191] [+]ping ok
[+]log ok
[+]etcd ok
[+]poststarthook/start-kube-apiserver-admission-initializer ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/start-apiextensions-informers ok
[+]poststarthook/start-apiextensions-controllers ok
[+]poststarthook/crd-informer-synced ok
[+]poststarthook/bootstrap-controller ok
[-]poststarthook/rbac/bootstrap-roles failed: reason withheld
[+]poststarthook/scheduling/bootstrap-system-priority-classes ok
[+]poststarthook/start-cluster-authentication-info-controller ok
[+]poststarthook/start-kube-aggregator-informers ok
[+]poststarthook/apiservice-registration-controller ok
[+]poststarthook/apiservice-status-available-controller ok
[+]poststarthook/kube-apiserver-autoregistration ok
[+]autoregister-completion ok
[+]poststarthook/apiservice-openapi-controller ok
healthz check failed

CNI provider is kube-router.

6. What did you expect to happen?
Bootstrap all the nodes from both instance groups as expected to validate the cluster.

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

---
apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  name: test.k8s.local
spec:
  kubeAPIServer:
    admissionControl:
      - NamespaceLifecycle
      - LimitRanger
      - ServiceAccount
      - DefaultStorageClass
      - DefaultTolerationSeconds
      - NodeRestriction
      - ResourceQuota
      - AlwaysPullImages
    auditLogMaxAge: 30
    auditLogMaxBackups: 10
    auditLogMaxSize: 100
    auditLogPath: /var/log/kube-apiserver-audit.log
    disableBasicAuth: true
    runtimeConfig:
      autoscaling/v2beta1: "true"
  kubeControllerManager:
    horizontalPodAutoscalerSyncPeriod: 10s
  kubelet:
    anonymousAuth: false
    authorizationMode: Webhook
    authenticationTokenWebhook: true
    featureGates:
      ServiceNodeExclusion: "true"
    evictionHard: memory.available<800Mi
    kubeReserved:
      cpu: 500m
      ephemeral-storage: 1Gi
      memory: 1.6Gi
    kubeReservedCgroup: /podruntime.slice
    readOnlyPort: 0
    systemReserved:
      cpu: 500m
      ephemeral-storage: 1Gi
      memory: 2.4Gi
    systemReservedCgroup: /system.slice
    tlsCipherSuites:
      - TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
      - TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
      - TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305
      - TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
      - TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305
      - TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
      - TLS_RSA_WITH_AES_256_GCM_SHA384
      - TLS_RSA_WITH_AES_128_GCM_SHA256
  fileAssets:
    - content: |
        [Unit]
        Description=Limited resources slice for Kubernetes services
        Documentation=man:systemd.special(7)
        DefaultDependencies=no
        Before=slices.target
        Requires=-.slice
        After=-.slice
      name: podruntime-slice
      path: /etc/systemd/system/podruntime.slice
  api:
    loadBalancer:
      type: Public
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws
  etcdClusters:
  - cpuRequest: 200m
    etcdMembers:
    - instanceGroup: master-us-west-2c-1
      name: "1"
    - instanceGroup: master-us-west-2c-2
      name: "2"
    - instanceGroup: master-us-west-2c-3
      name: "3"
    memoryRequest: 100Mi
    name: main
  - cpuRequest: 100m
    etcdMembers:
    - instanceGroup: master-us-west-2c-1
      name: "1"
    - instanceGroup: master-us-west-2c-2
      name: "2"
    - instanceGroup: master-us-west-2c-3
      name: "3"
    memoryRequest: 100Mi
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeDNS:
    provider: CoreDNS
  kubernetesApiAccess:
  - 0.0.0.0/0
  kubernetesVersion: 1.17.7
  networkCIDR: 172.31.0.0/16
  networkID: vpc-XXXXXXXX
  networking:
    kuberouter: {}
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 0.0.0.0/0
  subnets:
  - cidr: 172.31.0.0/20
    id: subnet-XXXXXXXX
    name: us-west-2c
    type: Public
    zone: us-west-2c
  topology:
    dns:
      type: Public
    masters: public
    nodes: public
---
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: test.k8s.local
  name: master-us-west-2c-1
spec:
  detailedInstanceMonitoring: true
  machineType: m3.large
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-us-west-2c-1
  role: Master
  subnets:
  - us-west-2c
---
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: test.k8s.local
  name: master-us-west-2c-2
spec:
  detailedInstanceMonitoring: true
  machineType: m3.large
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-us-west-2c-2
  role: Master
  subnets:
  - us-west-2c
---
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: test.k8s.local
  name: master-us-west-2c-3
spec:
  detailedInstanceMonitoring: true
  machineType: m3.large
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-us-west-2c-3
  role: Master
  subnets:
  - us-west-2c
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: test.k8s.local
  name: nodes
spec:
  detailedInstanceMonitoring: true
  machineType: c5.2xlarge
  maxSize: 30
  minSize: 4
  role: Node
  nodeLabels:
    kops.k8s.io/instancegroup: nodes
    alpha.service-controller.kubernetes.io/exclude-balancer: "true"
  cloudLabels:
    k8s.io/cluster-autoscaler/enabled: ""
    k8s.io/cluster-autoscaler/k8s1.xxxx.com: ""
  subnets:
  - us-west-2c
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: test.k8s.local
  name: ingress-nodes
spec:
  detailedInstanceMonitoring: true
  additionalSecurityGroups:
  - sg-XXXXXXXX
  externalLoadBalancers:
  - targetGroupArn: arn:aws:elasticloadbalancing:us-west-2:XXXXXXXXXXXX:targetgroup/XXXXX-80/XXXXXX
  - targetGroupArn: arn:aws:elasticloadbalancing:us-west-2:XXXXXXXXXXXX:targetgroup/XXXXX-443/XXXXXX
  - targetGroupArn: arn:aws:elasticloadbalancing:us-west-2:XXXXXXXXXXXX:targetgroup/XXXXX-80/XXXXXX
  - targetGroupArn: arn:aws:elasticloadbalancing:us-west-2:XXXXXXXXXXXX:targetgroup/XXXXX-443/XXXXXX
  machineType: c5.4xlarge
  maxSize: 3
  minSize: 3
  nodeLabels:
    dedicated: ingress
    kops.k8s.io/instancegroup: ingress-nodes
  role: Node
  subnets:
  - us-west-2c
  taints:
  - dedicated=ingress:NoSchedule

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know?
No

@olemarkus
Copy link
Member

Any chance you can provide a spec that reproduces this issue that contains a bit less custom stuff? Would make it easier to diagnose the issue.

@hakman
Copy link
Member

hakman commented Jul 9, 2020

Aslo, can you check the kops-configuration, kubelet and kube-proxy on one of the nodes that don't join the cluster?

@javierlga
Copy link
Author

Any chance you can provide a spec that reproduces this issue that contains a bit less custom stuff? Would make it easier to diagnose the issue.

Sure, I have updated the cluster spec without the additional user data key.

Aslo, can you check the kops-configuration, kubelet and kube-proxy on one of the nodes that don't join the cluster?

Kops-configuration latest logs (I actually checked them since the beginning and didn't find any errors) :

Jul 09 15:41:33 ip-172-31-12-153 nodeup[1030]: I0709 15:41:33.678800    1030 http.go:77] Downloading "https://github.com/kubernetes/kops/release
s/download/v1.17.0/images-protokube.tar.gz"
Jul 09 15:41:36 ip-172-31-12-153 nodeup[1030]: I0709 15:41:36.350879    1030 files.go:100] Hash matched for "/var/cache/nodeup/sha256:c02010d7c8
812b5a40fe725a6a0cb0ed2b84d6ed1986c1144b77309be7027344_https___artifacts_k8s_io_binaries_kops_1_17_0_images_protokube_tar_gz": sha256:c02010d7c8
812b5a40fe725a6a0cb0ed2b84d6ed1986c1144b77309be7027344
Jul 09 15:41:36 ip-172-31-12-153 nodeup[1030]: I0709 15:41:36.350910    1030 load_image.go:109] running command docker load -i /var/cache/nodeup
/sha256:c02010d7c8812b5a40fe725a6a0cb0ed2b84d6ed1986c1144b77309be7027344_https___artifacts_k8s_io_binaries_kops_1_17_0_images_protokube_tar_gz
Jul 09 15:41:40 ip-172-31-12-153 nodeup[1030]: I0709 15:41:40.564515    1030 executor.go:103] Tasks: 60 done / 60 total; 0 can run
Jul 09 15:41:40 ip-172-31-12-153 nodeup[1030]: I0709 15:41:40.564537    1030 context.go:91] deleting temp dir: "/tmp/deploy238242106"
Jul 09 15:41:40 ip-172-31-12-153 nodeup[1030]: success
Jul 09 15:41:40 ip-172-31-12-153 systemd[1]: Started Run kops bootstrap (nodeup).

Kubelet status is running but the logs from journalctl show:

Jul 09 15:58:33 ip-172-31-12-153 kubelet[1986]: E0709 15:58:33.842735    1986 kubelet.go:2183] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
Jul 09 15:58:32 ip-172-31-12-153 kubelet[1986]: W0709 15:58:32.273841    1986 cni.go:237] Unable to update cni config: no networks found in /etc/cni/net.d/

Daemon arguments are:

--anonymous-auth=false --authentication-token-webhook=true --authorization-mode=Webhook --cgroup-root=/ --client-ca-file=/srv/kuber
netes/ca.crt --cloud-provider=aws --cluster-dns=100.64.0.10 --cluster-domain=cluster.local --enable-debugging-handlers=true --eviction-hard=memo
ry.available<800Mi --feature-gates=ServiceNodeExclusion=true --hostname-override=ip-172-31-12-153.us-west-2.compute.internal --kube-reserved-cgr
oup=/podruntime.slice --kube-reserved=cpu=500m,ephemeral-storage=1Gi,memory=1.6Gi --kubeconfig=/var/lib/kubelet/kubeconfig --network-plugin=cni
--non-masquerade-cidr=100.64.0.0/10 --pod-infra-container-image=k8s.gcr.io/pause-amd64:3.0 --pod-manifest-path=/etc/kubernetes/manifests --read-
only-port=0 --register-schedulable=true --register-with-taints=dedicated=ingress:NoSchedule --system-reserved-cgroup=/system.slice --system-rese
rved=cpu=500m,ephemeral-storage=1Gi,memory=2.4Gi --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA2
56,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_25
6_GCM_SHA384,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_GCM_SHA256 --v=2 --volume-plugin-dir=/usr/libexec/kubernetes/kubelet-plugins/v
olume/exec/ --cni-bin-dir=/opt/cni/bin/ --cni-conf-dir=/etc/cni/net.d/

Regarding kube-proxy, kube-router replaces its functionality.

@hakman
Copy link
Member

hakman commented Jul 9, 2020

The "network plugin is not ready: cni config uninitialized" part is misleading. This is an effect, not the cause. There should be other cause for nodes not being able to register, either in kubelet or in kube-router logs.

@javierlga
Copy link
Author

The "network plugin is not ready: cni config uninitialized" part is misleading. This is an effect, not the cause. There should be other cause for nodes not being able to register, either in kubelet or in kube-router logs.

Yes, I agree, it's really strange that it only happens on a particular instance group.

@hakman
Copy link
Member

hakman commented Jul 9, 2020

By any chance is the group with the taint?

@javierlga
Copy link
Author

By any chance is the group with the taint?

Yes, it has one:

  taints:
  - dedicated=ingress:NoSchedule

However, I'm not completely sure if this is the issue because we have another cluster with the same config (apart from the k8s version 1.15.3) and it's running with no issues.

@javierlga
Copy link
Author

javierlga commented Jul 9, 2020

Well, at the end of the day, the issue was with the taints, I just removed it and the nodes are now ready and joined to the cluster.

This is quite interesting, the other cluster like I previously mentioned, has almost the same configuration and it works with no issues, I thought the taints only affected workloads created after the cluster is validated.

@hakman
Copy link
Member

hakman commented Jul 10, 2020

Thanks for the update @javierlga.
/close

@k8s-ci-robot
Copy link
Contributor

@hakman: Closing this issue.

In response to this:

Thanks for the update @javierlga.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@javierlga
Copy link
Author

javierlga commented Jul 10, 2020

/reopen

I have decided to reopen the issue because I don't think it's the expected behavior, adding a taint to a particular instance group shouldn't affect node bootstrapping process, in this case the CNI.

@k8s-ci-robot k8s-ci-robot reopened this Jul 10, 2020
@k8s-ci-robot
Copy link
Contributor

@javierlga: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@johngmyers
Copy link
Member

Your expectation would not be correct. Taints affect all pods that don't tolerate them. But the kuberouter tolerations do seem a bit narrow.

@javierlga
Copy link
Author

Your expectation would not be correct. Taints affect all pods that don't tolerate them. But the kuberouter tolerations do seem a bit narrow.

Yes, that's correct, tbh I haven't checked kube-router tolerations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants