Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce Startup Latency by moving the init container to a regular container #2137

Closed
wants to merge 1 commit into from

Conversation

bwagner5
Copy link
Contributor

@bwagner5 bwagner5 commented Nov 11, 2022

What type of PR is this?

  • feature

Which issue does this PR fix:
awslabs/amazon-eks-ami#1099

What does this PR do / Why do we need it:

  • The VPC CNI controls when a node becomes usable for pod scheduling and execution so the startup time of the CNI should be optimized to bring nodes online faster. This PR improves the startup time of the VPC CNI by >50% (from 9 sec to 4 sec) by moving the init container to a regular container in the pod and synchronizing the vpc-cni container via a shared emptyDir volume. This allows the containers to initialize in parallel and avoids the kubelet latency of switching between the init container and the regular container.

If an issue # is not available please add repro steps and logs from IPAMD/CNI showing the issue:

Testing done on this change:

I installed the changes into my cluster (1.22) via helm and ran tests on the current production version (v1.12.0) and my change. Below are representative samples across 10 runs.

The "VPC CNI Plugin Initialized" event is when the CNI config file is moved to the /etc/cni/net.d dir, signaling to the container runtime that networking is ready.

Previous aws-node startup timing (9 secs):

|     Event                     | Timestamp | t  |
|-------------------------------|-----------|----|
| VPC CNI Init Container Starts | 05:29:42  | 0s |
| AWS Node Container Starts     | 05:29:49  | 7s |
| VPC CNI Plugin Initialized    | 05:29:51  | 9s |

After this change startup timing (4 secs):

|     Event                     | Timestamp | t  |
|-------------------------------|-----------|----|
| AWS Node Container Starts     | 05:30:30  | 0s |
| VPC CNI Init Container Starts | 05:30:32  | 2s |
| VPC CNI Plugin Initialized    | 05:30:34  | 4s |

Automation added to e2e:

Will this PR introduce any new dependencies?:

No

Will this break upgrades or downgrades. Has updating a running cluster been tested?:

I have not tested updating a running cluster, but the changes are only on improving the startup sequence.

Does this change require updates to the CNI daemonset config files to work?:

Yes, this change requires moving the previous init-container under the regular containers list in the DaemonSet and involves adding an emptyDir volume and mapping in both containers.

Does this PR introduce any user-facing change?:

No


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@bwagner5 bwagner5 requested a review from a team as a code owner November 11, 2022 06:15
@jayanthvn
Copy link
Contributor

jayanthvn commented Nov 11, 2022

Need to run https://github.com/aws/amazon-vpc-cni-k8s/blob/master/Makefile#L342 and copy the new manifests here - https://github.com/aws/amazon-vpc-cni-k8s/tree/master/config/master.

Self managed addons upgrades/downgrades will be reapplying the manifests but we need to confirm on managed addons upgrade/downgrade.

@achevuru
Copy link
Contributor

Did we validate the IPv6 test suite? If not, can we do that as well..

@bwagner5
Copy link
Contributor Author

Did we validate the IPv6 test suite? If not, can we do that as well..

> cd test/integration/ipv6
> ginkgo -v --fail-on-pending -- \
 --cluster-kubeconfig=$KUBECONFIG \
 --cluster-name=$CLUSTER_NAME \
 --aws-region=$AWS_REGION \
 --aws-vpc-id=$VPC_ID \
 --ng-name-label-key=ng \
 --ng-name-label-val=ipv6
...
Ran 7 of 7 Specs in 485.351 seconds
SUCCESS! -- 7 Passed | 0 Failed | 0 Pending | 0 Skipped
PASS

Ginkgo ran 1 suite in 8m11.459192652s
Test Suite Passed

Copy link
Contributor

@jdn5126 jdn5126 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes overall look valid to me, and IPv6 integration tests passing gives more confidence. Can you confirm upgrade/downgrade is successful? That's my only other concern.

scripts/entrypoint.sh Outdated Show resolved Hide resolved
@jdn5126
Copy link
Contributor

jdn5126 commented Nov 17, 2022

@bwagner5 can you update this branch? I can approve after that. As for coverage here, it sounds like you have run all of the ginkgo suites, and the integration suite will be run on merge, so if there is any issues with that, we can revert.

@jayanthvn
Copy link
Contributor

jayanthvn commented Nov 17, 2022

@jdn5126 - We need to verify MAO upgrades/downgrades before merging. Also we need to get the manifest related changes reviewed by MAO team.

@bwagner5
Copy link
Contributor Author

I have tested upgrade and downgrade:

> helm repo list
NAME                	URL
karpenter           	https://charts.karpenter.sh
grafana-charts      	https://grafana.github.io/helm-charts
prometheus-community	https://prometheus-community.github.io/helm-charts
eks                 	https://aws.github.io/eks-charts

> helm show chart eks/aws-vpc-cni
apiVersion: v1
appVersion: v1.12.0
description: A Helm chart for the AWS VPC CNI
home: https://github.com/aws/amazon-vpc-cni-k8s
icon: https://raw.githubusercontent.com/aws/eks-charts/master/docs/logo/aws.png
keywords:
- eks
- cni
- networking
- vpc
maintainers:
- email: jayanthvn@users.noreply.github.com
  name: Jayanth Varavani
  url: https://github.com/jayanthvn
name: aws-vpc-cni
sources:
- https://github.com/aws/amazon-vpc-cni-k8s
version: 1.2.0

## Initial install (v1.12.0)
> helm upgrade --install aws-node -n kube-system eks/aws-vpc-cni --set image.region=us-east-2 --set init.image.region=us-east-2

## Observed all pods coming up healthy
> kubectl get pods -A
NAMESPACE     NAME                        READY   STATUS    RESTARTS   AGE
karpenter     karpenter-b5dd75db5-nktbk   1/1     Running   0          17h
karpenter     karpenter-b5dd75db5-q9txh   1/1     Running   0          17h
kube-system   aws-node-7sghv              1/1     Running   0          4s
kube-system   aws-node-qb5s2              1/1     Running   0          18s
kube-system   coredns-5948f55769-9cqk7    1/1     Running   0          4d17h
kube-system   coredns-5948f55769-np758    1/1     Running   0          4d18h
kube-system   kube-proxy-gw5dh            1/1     Running   0          4d16h
kube-system   kube-proxy-qgkvz            1/1     Running   0          4d16h

> kubectl logs po/aws-node-7sghv -n kube-system -c aws-vpc-cni-init
Copying CNI plugin binaries ...
Configure rp_filter loose...
net.ipv4.conf.eth0.rp_filter = 2
2
net.ipv4.tcp_early_demux = 1
CNI init container done

> kubectl logs po/aws-node-7sghv -n kube-system -c aws-node
Defaulted container "aws-node" out of: aws-node, aws-vpc-cni-init (init)
{"level":"info","ts":"2022-11-22T17:15:41.815Z","caller":"entrypoint.sh","msg":"Validating env variables ..."}
{"level":"info","ts":"2022-11-22T17:15:41.816Z","caller":"entrypoint.sh","msg":"Install CNI binaries.."}
{"level":"info","ts":"2022-11-22T17:15:41.828Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2022-11-22T17:15:41.829Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
{"level":"info","ts":"2022-11-22T17:15:42.834Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-22T17:15:43.859Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-22T17:15:43.878Z","caller":"entrypoint.sh","msg":"Copying config file ... "}
{"level":"info","ts":"2022-11-22T17:15:43.881Z","caller":"entrypoint.sh","msg":"Successfully copied CNI plugin binary and config file."}
{"level":"info","ts":"2022-11-22T17:15:43.882Z","caller":"entrypoint.sh","msg":"Foregrounding IPAM daemon ..."}

## Upgraded to new version
> helm upgrade --install aws-node -n kube-system charts/aws-vpc-cni/ --set image.override=332273710158.dkr.ecr.us-east-2.amazonaws.com/vpc-cni:both --set init.image.override=332273710158.dkr.ecr.us-east-2.amazonaws.com/vpc-cni:both-init --set env.CLUSTER_ENDPOINT=https://70ea879ae73a1a4348bfa116a3ff8368.gr7.us-east-2.eks.amazonaws.com

> kubectl get pods -A
NAMESPACE     NAME                        READY   STATUS    RESTARTS   AGE
karpenter     karpenter-b5dd75db5-nktbk   1/1     Running   0          17h
karpenter     karpenter-b5dd75db5-q9txh   1/1     Running   0          17h
kube-system   aws-node-6gxhx              2/2     Running   0          2m12s
kube-system   aws-node-gdpm4              2/2     Running   0          119s
kube-system   coredns-5948f55769-9cqk7    1/1     Running   0          4d18h
kube-system   coredns-5948f55769-np758    1/1     Running   0          4d18h
kube-system   kube-proxy-gw5dh            1/1     Running   0          4d17h
kube-system   kube-proxy-qgkvz            1/1     Running   0          4d17h

> kubectl logs po/aws-node-6gxhx -n kube-system -c aws-vpc-cni-init
Copying CNI plugin binaries ...
Configure rp_filter loose...
net.ipv4.conf.eth0.rp_filter = 2
2
net.ipv4.tcp_early_demux = 1
CNI init container done

> kubectl logs po/aws-node-6gxhx -n kube-system -c aws-node
"level":"info","ts":"2022-11-22T17:29:17.403Z","caller":"entrypoint.sh","msg":"Validating env variables ..."}
{"level":"info","ts":"2022-11-22T17:29:17.404Z","caller":"entrypoint.sh","msg":"Install CNI binaries.."}
{"level":"info","ts":"2022-11-22T17:29:17.416Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2022-11-22T17:29:17.417Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
{"level":"info","ts":"2022-11-22T17:29:18.422Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-22T17:29:19.427Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-22T17:29:19.444Z","caller":"entrypoint.sh","msg":"Copying config file ... "}
{"level":"info","ts":"2022-11-22T17:29:19.448Z","caller":"entrypoint.sh","msg":"Successfully copied CNI plugin binary and config file."}
{"level":"info","ts":"2022-11-22T17:29:19.449Z","caller":"entrypoint.sh","msg":"Foregrounding IPAM daemon ..."}

> kubectl scale deploy/inflate --replicas=1
> kubectl get pods
NAMESPACE     NAME                        READY   STATUS    RESTARTS   AGE
default       inflate-75b4f74469-2vk5m    1/1     Running   0          97s
> kubectl scale deploy/inflate --replicas=0

## Downgraded to v1.12.0
> helm upgrade --install aws-node -n kube-system eks/aws-vpc-cni --set image.override=602401143452.dkr.ecr.us-east-2.amazonaws.com/amazon-k8s-cni:v1.12.0 --set init.image.override=602401143452.dkr.ecr.us-east-2.amazonaws.com/amazon-k8s-cni-init:v1.12.0

> kubectl get pods -A
NAMESPACE     NAME                        READY   STATUS    RESTARTS   AGE
karpenter     karpenter-b5dd75db5-nktbk   1/1     Running   0          17h
karpenter     karpenter-b5dd75db5-q9txh   1/1     Running   0          17h
kube-system   aws-node-626jk              1/1     Running   0          6s
kube-system   aws-node-rdnnz              1/1     Running   0          19s
kube-system   coredns-5948f55769-9cqk7    1/1     Running   0          4d18h
kube-system   coredns-5948f55769-np758    1/1     Running   0          4d18h
kube-system   kube-proxy-gw5dh            1/1     Running   0          4d17h
kube-system   kube-proxy-qgkvz            1/1     Running   0          4d17h

> kubectl logs po/aws-node-626jk -n kube-system -c aws-vpc-cni-init
Copying CNI plugin binaries ...
Configure rp_filter loose...
net.ipv4.conf.eth0.rp_filter = 2
2
net.ipv4.tcp_early_demux = 1
CNI init container done

> kubectl logs po/aws-node-626jk -n kube-system -c aws-node
{"level":"info","ts":"2022-11-22T17:36:27.712Z","caller":"entrypoint.sh","msg":"Validating env variables ..."}
{"level":"info","ts":"2022-11-22T17:36:27.713Z","caller":"entrypoint.sh","msg":"Install CNI binaries.."}
{"level":"info","ts":"2022-11-22T17:36:27.725Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2022-11-22T17:36:27.726Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
{"level":"info","ts":"2022-11-22T17:36:28.732Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-22T17:36:29.762Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-22T17:36:29.779Z","caller":"entrypoint.sh","msg":"Copying config file ... "}
{"level":"info","ts":"2022-11-22T17:36:29.783Z","caller":"entrypoint.sh","msg":"Successfully copied CNI plugin binary and config file."}
{"level":"info","ts":"2022-11-22T17:36:29.784Z","caller":"entrypoint.sh","msg":"Foregrounding IPAM daemon ..."}

> kubectl scale deploy/inflate --replicas=1
> kubectl get pods
NAMESPACE     NAME                        READY   STATUS    RESTARTS   AGE
default       inflate-75b4f74469-mljzt    1/1     Running   0          40s
> kubectl scale deploy/inflate --replicas=0

securityContext:
containers:
- name: aws-vpc-cni-init
image: "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni-init:v1.12.0-11-g69122330"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jayanthvn what should the tag be set to when making a change like this? If we keep it at v1.12.0, then the manifest and image are incompatible, but I doubt we want to release a new tag. Do we guarantee that anyone cloning this should be able to run kubectl apply -f config/master/aws-k8s-cni.yaml?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we will have build a RC image and use that tag here since both manifest and the entry point is changing.

Copy link
Contributor

@jdn5126 jdn5126 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating! After addressing these changes, can you also squash and rebase so that this is one commit? Since this is an important change, I think we want a clean and easy to read history here

scripts/entrypoint.sh Outdated Show resolved Hide resolved
scripts/init.sh Outdated Show resolved Hide resolved
config/master/aws-k8s-cni.yaml Show resolved Hide resolved
@jdn5126 jdn5126 removed this from the v1.12.1 milestone Dec 12, 2022
@jdn5126
Copy link
Contributor

jdn5126 commented Dec 12, 2022

We have to hold off on this change until we can resolve issues with the MAO upgrade path 😞

@github-actions
Copy link

This pull request is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

@github-actions github-actions bot added the stale Issue or PR is stale label Feb 11, 2023
@bryantbiggs
Copy link
Member

Shhhh bot, not stale

@jdn5126 jdn5126 removed the stale Issue or PR is stale label Feb 11, 2023
@github-actions
Copy link

This pull request is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

@github-actions github-actions bot added the stale Issue or PR is stale label Apr 13, 2023
@armenr
Copy link

armenr commented Apr 13, 2023

No, please Mr. GitHub-bot. Still not stale 😬

@jdn5126 jdn5126 removed the stale Issue or PR is stale label Apr 13, 2023
@github-actions
Copy link

This pull request is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

@github-actions github-actions bot added the stale Issue or PR is stale label Jun 13, 2023
@alam0rt
Copy link

alam0rt commented Jun 13, 2023

Would love to see this merged. Looks like it needs a rebase tho

@github-actions github-actions bot removed the stale Issue or PR is stale label Jun 14, 2023
@github-actions
Copy link

This pull request is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

@github-actions github-actions bot added the stale Issue or PR is stale label Aug 13, 2023
@jayanthvn
Copy link
Contributor

/not stale

@github-actions github-actions bot removed the stale Issue or PR is stale label Aug 15, 2023
@github-actions
Copy link

This pull request is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

@github-actions github-actions bot added the stale Issue or PR is stale label Oct 14, 2023
@github-actions
Copy link

Pull request closed due to inactivity.

@github-actions github-actions bot closed this Oct 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Issue or PR is stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants