Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Controller is not aware of the number of available IP addresses #18

Closed
arun-gupta opened this issue Dec 20, 2017 · 7 comments
Closed

Controller is not aware of the number of available IP addresses #18

arun-gupta opened this issue Dec 20, 2017 · 7 comments

Comments

@arun-gupta
Copy link

Controller is not aware of how many IP addresses are available to be assigned to pods. It tries to assign an IP address for the new pod and then fails reactively. It should have this information proactively.

Created a 2 node t2.medium cluster as:

kops create cluster \
--name example.cluster.k8s.local \
--zones us-east-1a,us-east-1b,us-east-1c \
--networking amazon-vpc-routed-eni \
--node-size t2.medium
--kubernetes-version 1.8.4 \
--yes

Created a Deployment using the configuration file:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  replicas: 3 
  template:
    metadata:
      labels:
        app: nginx 
    spec:
      containers:
      - name: nginx 
        image: nginx:1.12.1
        ports: 
        - containerPort: 80
        - containerPort: 443

Scaled the replicas:

kubectl scale --replicas=30 deployment/nginx-deployment

30 pods are expected in the cluster (2 * 3 * 5) but only 27 pods are available. Three pods are always in ContainerCreating state.

A similar cluster was created with m4.2xlarge. 120 pods (2 * 4 * 15) are expected in the cluster, but only 109 pods are available.

More details about one of the pods that is not getting scheduled:

$ kubectl describe pod/nginx-deployment-745df977f7-tndc7
Name:           nginx-deployment-745df977f7-tndc7
Namespace:      default
Node:           ip-172-20-68-116.ec2.internal/172.20.95.9
Start Time:     Wed, 20 Dec 2017 14:32:53 -0800
Labels:         app=nginx
                pod-template-hash=3018953393
Annotations:    kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicaSet","namespace":"default","name":"nginx-deployment-745df977f7","uid":"96335f7f-e5d5-11e7-bc3a-0a41...
                kubernetes.io/limit-ranger=LimitRanger plugin set: cpu request for container nginx
Status:         Pending
IP:             
Created By:     ReplicaSet/nginx-deployment-745df977f7
Controlled By:  ReplicaSet/nginx-deployment-745df977f7
Containers:
  nginx:
    Container ID:   
    Image:          nginx:1.12.1
    Image ID:       
    Ports:          80/TCP, 443/TCP
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:        100m
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-7wp6n (ro)
Conditions:
  Type           Status
  Initialized    True 
  Ready          False 
  PodScheduled   True 
Volumes:
  default-token-7wp6n:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-7wp6n
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.alpha.kubernetes.io/notReady:NoExecute for 300s
                 node.alpha.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason                  Age                   From                                    Message
  ----     ------                  ----                  ----                                    -------
  Normal   Scheduled               55m                   default-scheduler                       Successfully assigned nginx-deployment-745df977f7-tndc7 to ip-172-20-68-116.ec2.internal
  Normal   SuccessfulMountVolume   55m                   kubelet, ip-172-20-68-116.ec2.internal  MountVolume.SetUp succeeded for volume "default-token-7wp6n"
  Warning  FailedCreatePodSandBox  55m (x8 over 55m)     kubelet, ip-172-20-68-116.ec2.internal  Failed create pod sandbox.
  Normal   SandboxChanged          5m (x919 over 55m)    kubelet, ip-172-20-68-116.ec2.internal  Pod sandbox changed, it will be killed and re-created.
  Warning  FailedSync              10s (x1017 over 55m)  kubelet, ip-172-20-68-116.ec2.internal  Error syncing pod

More details about the exact steps are at https://gist.github.com/arun-gupta/87f2c9ff533008f149db6b53afa73bd0

@ofiliz
Copy link

ofiliz commented Dec 21, 2017

Another option: Kubelet has an option called "--max-pods" that lets you configure how many pods can be scheduled on that node. Since the max number of IP addresses per node based on its instance type is well-known, we can start Kubelet with --maxpods=[max_IP_addresses] - [number of IP addresses used by infra]".

@lstoll
Copy link

lstoll commented Dec 21, 2017

Problem we had with max pods is it includes ones in the host namespace as well, which don't need an IP. So a bunch of system daemonsets were taking up already limited address space without needing an address. To work around that in our AWS CNI plugin we ended up adding a taint to the node to essentially mark it as full, and deleting pods that are failing. I think my take away has always been "there's options and none of them are great". Getting kubernetes/kubernetes#20177 upstream would probably be the best outcome.

@liwenwu-amazon
Copy link
Contributor

Will investigate whether we can use (scheduler extension) kubernetes/kubernetes#13580 for this.

@deiwin
Copy link

deiwin commented Oct 19, 2018

One problem with the current --maxpods solution is that it doesn't account for IPs in "cooling mode". Even if --maxpods is set correctly, when a lot of pods are deleted and created in quick succession in a cluster with high IP address utilization, then scheduler can schedule pods to nodes that don't actually have an IP available. Not all IPs may be assigned, but all of the non-assigned IPs may be in "cooling mode" and therefore still unavailable.

@liwenwu-amazon
Copy link
Contributor

@deiwin , right now, the "cooling period" is 30 seconds. In worst case, a Pod will get an IP from datastore in 30 seconds.

@deiwin
Copy link

deiwin commented Oct 22, 2018

Yes, but in our experience even 30 seconds is enough to cause many CronJob failures on a cluster with high IP address utilization. We have many CronJobs that run every minute and have a short activeDeadlineSeconds (usually either 30 or 50 seconds). These mis-schedulings cause these CronJobs to fail with DeadlineExceeded.

@jaypipes
Copy link
Contributor

Note that unfortunately, the upstream issue (kubernetes/kubernetes#5507) that tracks this is in lifecycle/frozen and priority/backlog. I'm going to close this issue out with a plea to have folks comment on the upstream issue and see if we can find a path forward that makes IP addresses a concrete resource that is consumed by pods and tracked like other resources (CPU, memory, etc)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants