Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EKS] : Reduction in EKS cluster creation time #1227

Open
kirtichandak opened this issue Jan 15, 2021 · 41 comments
Open

[EKS] : Reduction in EKS cluster creation time #1227

kirtichandak opened this issue Jan 15, 2021 · 41 comments
Labels
EKS Amazon Elastic Kubernetes Service

Comments

@kirtichandak
Copy link

kirtichandak commented Jan 15, 2021

EKS cluster control plane provisioning time currently averages 15 minutes. We’ll use this issue to track the ongoing improvements we are making to reduce the creation time.

Which service(s) is this request for?
EKS

@kirtichandak kirtichandak added Proposed Community submitted issue EKS Amazon Elastic Kubernetes Service and removed Proposed Community submitted issue labels Jan 15, 2021
@heidemn
Copy link

heidemn commented Jan 16, 2021

Reducing the upgrade time would also be nice, especially since the upgrade today involves several manual steps, see #600 .
When I recently upgraded to 1.15 and 1.16, I remember something like 40, 45min per control plane upgrade until EKS reported that the upgrade had fully finished.

If upgrades could be faster, with the same reliability, this would be great. And it might be even more important than the creation time, assuming that clusters are upgraded several times in their lifetime.

@billinghamj
Copy link

Upgrading from 1.18 to 1.19 today took 47 mins according to Terraform logs for the control plane, then 34 mins to upgrade a single very small nodegroup (7 nodes), then you have to manually update core-dns, kube-proxy and aws-cni.

So overall you're talking an absolute minimum of 1.5 hours if you're watching the thing like a hawk and not wasting any time with gaps. This does seem a little crazy and unsustainable :\

I do hope it doesn't keep getting worse with future versions too...

@mreferre
Copy link

@billinghamj thanks for the feedback. You are being heard. On a tangent (and not as a mean to respond to your specific question/need) I am wondering if you have considered using Fargate in your EKS deployments. Among other advantages, one thing that in the context of this thread would be interesting is the fact there are no nodes to upgrade and also that AWS embeds, as part of the Fargate, service components you don't need to care about (kubelet/proxy, cni, log routers, etc). In your extremely specific example you would "just" had to upgrade the control plane and core-dns. Just curious if you took Fargate into account and, if you did, what made you stick to "regular" EC2 worker nodes.

@heidemn
Copy link

heidemn commented Feb 21, 2021

@mreferre in my opinion, running Fargate in EKS is not cost-effective:

  • EKS adds overhead cost for the control plane,
  • Fargate makes compute power more expensive, compared to EC2.

I don't think there's much of a use case to run the main EKS workloads in Fargate.
Maybe it can be used for small tools (e.g. cron jobs), but not for apps that need a lot of CPU.

Side note: Price reductions of any kind are always welcome :-)

@mreferre
Copy link

Thanks for the feedback @heidemn. The raw compute costs of Fargate are (on average) only roughly 20% more expensive than standard EC2 prices (after a consistent price reduction we announced a while back).

Ironically, I think that for very tiny/small workloads, Fargate isn't very cost effective given that the smallest pod size is 0.25vCPU/512MB of memory (and in most cases it would be more convenient to consolidate many tiny workloads on EC2 instances). However for larger workloads, assuming your pod utilization is high, Fargate may become cost effective pretty quickly given there are no worker nodes waste (most K8s clusters are utilized only at a fraction of their full capacity, which you are paying for). If you have a real life example of where you concluded that Fargate was more expensive could you share it please (even offline, I can be reached at mreferre @ amazon dot com).

Also, I did not mean this to become a distraction from the original question in this thread.

Thanks!

PS here a few more considerations around EKS/Fargate economics. I'd like to understand where these assumptions aren't correct (we want to learn more about practical cases where these assumptions are not applicable).

@heidemn
Copy link

heidemn commented Feb 21, 2021

Thanks, that blog post is definitely helpful. I will give Fargate on EKS a try soon.

To close this (off-)topic:
What I could imagine is that starting a Pod might be slower on Fargate than on EC2 (if the instance is already running).
But the benefits of better isolation and not having to maintain servers are definitely not bad.

@billinghamj
Copy link

billinghamj commented Feb 21, 2021

Our services tend to use around 30MB RAM, are IO bound, and we run hundreds/thousands of them on our non-prod cluster.

Even when using exclusively ARM instance types on spot, due to the pods-per-node limits, EKS is doing pretty poorly for us right now cost-wise (with Rancher, we ran the entire thing on a single instance with no performance issues). Obviously Fargate would exacerbate this massively

Aside from that, we generally are happier having a bit more control, and want as close as possible to a "plain vanilla" K8s setup. History has told us not to trust AWS too much when it comes to behind-the-scenes magic. Before managed spot instances were available, we were quite happy with self-managed nodes too

On principle, lack of ARM support is also a complete non-starter for us. We want to push for that future hard, so we're voting with our feet

@mreferre
Copy link

@billinghamj this makes a lot of sense. Thanks for the feedback.

@kirtichandak
Copy link
Author

kirtichandak commented Feb 25, 2021

We have now introduced a change for 1.19 clusters that reduces control plane creation time by 40%, enabling you to create an EKS control plane in 9 minutes on an average. The improvement is also coming to other supported Kubernetes versions in a few weeks.

@mveitas
Copy link

mveitas commented Feb 25, 2021

@kirtichandak Does this just apply to cluster creation or are some of these improvements going to be seen in upgrading a cluster version?

@kirtichandak
Copy link
Author

kirtichandak commented Mar 19, 2021

The change to reduce control plane creation time by 40% is now available for all EKS supported versions. This enables you to create an EKS control plane in 9 minutes on an average.

We are currently working on reducing this time further and we'll keep using this issue to track upcoming improvements.

@kkapoor1987
Copy link

@kirtichandak Is this improvement specific to certain regions ?

@stevehipwell
Copy link

Is there any news (or another issue) on improving the control plane upgrade time? Taking in excess of 45 mins to upgrade a managed system that can be created from scratch in less than 10 mins isn't great and in practice is almost un-workable.

@mhulscher
Copy link

We have now introduced a change for 1.19 clusters that reduces control plane creation time by 40%, enabling you to create an EKS control plane in 9 minutes on an average. The improvement is also coming to other supported Kubernetes versions in a few weeks.

Has there been a regression? It seems that all my cluster creations at 1.21 take approximately 14 minutes :(

@przemolb
Copy link

przemolb commented Nov 4, 2021

We have now introduced a change for 1.19 clusters that reduces control plane creation time by 40%, enabling you to create an EKS control plane in 9 minutes on an average. The improvement is also coming to other supported Kubernetes versions in a few weeks.

Has there been a regression? It seems that all my cluster creations at 1.21 take approximately 14 minutes :(

This is my observation as well.

@samsen1
Copy link

samsen1 commented Dec 7, 2021

I'm in the process of creating a 1.21 cluster and it took 14mins and 10secs to complete.

@przemolb
Copy link

przemolb commented Dec 7, 2021

I think the 14 minutes time is for both control plane and worker nodes.

@gbvanrenswoude
Copy link

What would help greatly as well with EKS cluster rollouts are allowing concurrent operations on the cluster, e.g. creating 2 Fargate profiles, enabling control plane logging and an OIDC provider at the same time. Currently we have to use waiters in our CF stack code to create all these things sequentially.

@mikestef9
Copy link
Contributor

Our focus recently has been reducing the time for cluster updates. We are in the process of rolling out changes that will reduce cluster version upgrade time down to ~12 minutes. After that completes, we'll roll out the same improved update workflow for OIDC provider associations and KMS encryption updates.

@cdharma
Copy link

cdharma commented Mar 22, 2022

i just created a 1 node cluster (1.21) in Oregon, took 21 min end-to-end.

@armenr
Copy link

armenr commented May 4, 2022

We have now introduced a change for 1.19 clusters that reduces control plane creation time by 40%, enabling you to create an EKS control plane in 9 minutes on an average. The improvement is also coming to other supported Kubernetes versions in a few weeks.

Has there been a regression? It seems that all my cluster creations at 1.21 take approximately 14 minutes :(

Seeing the same thing on our end. We're halfway into the idea of moving to Rancher entirely. :-\

Unfortunately for us, we're so invested in EKS and all the work we put into it, that sunk-cost keeps me from getting the OK by our CTO to make the jump.

@matti
Copy link

matti commented May 4, 2022

Seeing creation times over 20 minutes all the time

@stevehipwell
Copy link

@mikestef9 do you have any progress to report on the encryption upgrade times? We're seeing this consistently take over an hour for EKS v1.22 clusters.

@matti
Copy link

matti commented Aug 18, 2022

16mins to create cluster and then another 10mins to get any nodes in.

@hangtime79
Copy link

I have been following this thread for almost 2 years and it does not appear EKS has gotten any better in creation times. I have a consistent 15 - 20 mins. Spinning up and tearing down EKS clusters is something that would go a long way for people to use it more. The overhead of running a cluster for extended periods of time make it so much of a headache that many of our clients decide its better to use competing tech.

@przemolb
Copy link

I suspect AWS is not motivated to really reduce the time - if they did then people would start creating EKSes, do their work and tear them down.

@armenr
Copy link

armenr commented Sep 30, 2022

@przemolb - I used to work at AWS. I can vouch for the fact that I never heard a single product team, or a single engineer ever say "let's do << X >> in order to lock the customer in/make it hard for them to achieve <<Y>>."

I think a primary reason for the slowdown in cluster provisioning is probably the AutoScaler API. The AutoScaler API and AutoScale Groups are a notoriously OLD and SLOW.

Two other components in the EKS stack that slow things down are: 1/ OIDC Provider creation/association, 2/ KMS Encryption & key association

IF you want to be able to create/destroy K8s clusters VERY quickly, while still benefiting from all the nice things EKS provides, you can create an EKS cluster, configure it to use karpenter instead of cluster-autoscaler and then COMBINE that architecture with VCluster.

This is the architecture I've developed for our Dev & QA environments.

  1. Create a "Dev" EKS cluster
  2. Install/configure Karpenter (it's straightforward and not difficult if you follow the docs and know what you're doing)
  3. Use VCluster to schedule/deploy "nested clusters" inside of it

When idle (or when no workloads/environments are scheduled), the Dev/QA EKS clusters run just 1 single server (cheap).

When we need to "spin up" a new "cluster" for a specific workload (or because we need another "environment" for testing), I use VCluster to schedule that into the EKS cluster.

Because Karpenter is blazing-fast, those nodes and containers usually come up within 58 seconds (yes, we timed them).

When done, we tear them down in seconds, and the EKS cluster goes back to just idling on a single small node.

This way, you can "create a cluster" - via VCluster - in less than 2 minutes, then destroy it in just a few minutes as well. The EKS cluster that hosts the other "nested" VCluster(s) can have a static nodegroup with 1 small node in it, to keep it running all the time, but keep it very cheap (monthly costs) as well.

@przemolb
Copy link

Thanks @armenr for the hint about VCluster - it seems really good work around time requested to spin up a new EKS cluster.

@armenr
Copy link

armenr commented Oct 13, 2022

In case this helps or is useful to anyone:

Launching EKS clusters in us-west-2 is averaging ~12 minutes this week.

@stevehipwell
Copy link

Launching EKS v1.23 clusters in eu-west-1 is also averaging ~12 minutes this week.

@mikestef9
Copy link
Contributor

Appreciate the patience here, changes have now been rolled out for reduction in encryption and OIDC association time.

@armenr
Copy link

armenr commented Nov 10, 2022

@mikestef9 - Thanks for updating all of us!

From my cursory tests, I believe I see improvement. Clocks in around ~9 minutes 30 seconds

Using the /examples/karpenter code in the EKS Blueprints repo (https://github.com/aws-ia/terraform-aws-eks-blueprints):

module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [10s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [20s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [30s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [40s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [50s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [1m0s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [1m10s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [1m20s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [1m30s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [1m40s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [1m50s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [2m0s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [2m10s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [2m20s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [2m30s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [2m40s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [2m50s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [3m0s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [3m10s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [3m20s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [3m30s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [3m40s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [3m50s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [4m0s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [4m10s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [4m20s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [4m30s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [4m40s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [4m50s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [5m0s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [5m10s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [5m20s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [5m30s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [5m40s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [5m50s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [6m0s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [6m10s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [6m20s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [6m30s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [6m40s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [6m50s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [7m0s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [7m10s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [7m20s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [7m30s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [7m40s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [7m50s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [8m0s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [8m10s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [8m20s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [8m30s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [8m40s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [8m50s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [9m0s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [9m10s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [9m20s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Still creating... [9m30s elapsed]
module.eks_blueprints.module.aws_eks.aws_eks_cluster.this[0]: Creation complete after 9m38s [id=stelthos-karpenteris]

@24601
Copy link

24601 commented Nov 27, 2022

us-east-1, twenty five (25) minutes and counting. Nothing special, no special configs, etc.

It must take real innovation to be this bad. But, hey, Uncle Jeff went to space in a really awkward cowboy hat that we all are paying for so I shouldn't complain, right?

And, no, it's not a violation of the community guidelines to make candid, non-personal comments about objective measures of poor performance and absolutely insane operational & capital priorities of majority shareholders, upper management, etc.

@matti
Copy link

matti commented Dec 10, 2022

no change, still ~14-15mins (eu-north-1, 1.24, eksctl) for the cluster. then node groups etc take loooong as well.

2022-12-10 14:02:32 [ℹ]  deploying stack "eksctl-test-3-cluster"
2022-12-10 14:03:02 [ℹ]  waiting for CloudFormation stack "eksctl-test-3-cluster"
2022-12-10 14:03:32 [ℹ]  waiting for CloudFormation stack "eksctl-test-3-cluster"
...
2022-12-10 14:12:34 [ℹ]  waiting for CloudFormation stack "eksctl-test-3-cluster"
2022-12-10 14:13:34 [ℹ]  waiting for CloudFormation stack "eksctl-test-3-cluster"
2022-12-10 14:15:38 [ℹ]  daemonset "kube-system/aws-node" restarted

@eduanb
Copy link

eduanb commented Feb 2, 2023

For comparison:
GKE 7:55
AKS: 3:45
EKS: 15:00
DOKS 5:10
LKS 1:46
Scaleway 6:52
IBM Cloud 54:39
OVH 7:47
Ignoring the outlier of IBM, EKS is 2,7 times slower than the average.
https://medium.com/@elliotgraebert/comparing-the-top-eight-managed-kubernetes-providers-2ae39662391b

@hangtime79
Copy link

hangtime79 commented Apr 10, 2023

I have been following this thread for almost 2 years and it does not appear EKS has gotten any better in creation times. I have a consistent 15 - 20 mins. Spinning up and tearing down EKS clusters is something that would go a long way for people to use it more. The overhead of running a cluster for extended periods of time make it so much of a headache that many of our clients decide its better to use competing tech.

Seven months since my last post and 27 months since the ticket was opened; still hasn't gotten better. Still sitting in the 15 min range.

@armenr
Copy link

armenr commented Apr 11, 2023

@matti - if it helps at all, we avoid some of the additional timing overhead of Node Group(s) creation by just spinning up 1 "base" nodegroup (with either 1 or 2 nodes max)...and then installing Karpenter, and relying on Karpenter to take care of everything else we'd typically delegate to EKS Node Groups. In the abstract, we use different Karpenter Provisioners + Node Templates together to achieve the same logical separation provided by having different/multiple EKS Node Groups.

For everyone else: AFAIK, Node Groups rely on AWS ASGs under the hood...and the ASG mechanism is a notoriously slow - and also old - bit of plumbing...which is exactly why Karpenter was built as an alternative to using Cluster AutoScaler + ASGs.

EKS creation times have appeared to improve incrementally (by ~3 minutes from our experience, in us-west-2), but are still pretty bad when compared to AKS or Digital Ocean. To be quite honest, even as a former Amazonian, it kind of raises some eyebrows when you consider that Digital Ocean manages to provision hosted Kubernetes clusters faster than AWS can.

@matti
Copy link

matti commented Apr 11, 2023

@armenr interesting pick'n'mix approach, thank you for sharing.

@matti
Copy link

matti commented Dec 15, 2023

Latest eksctl, 11 minutes to create the cluster and then all the spices on top of it.

2023-12-15 13:53:02 [ℹ]  building cluster stack "eksctl-t-2-cluster"
2023-12-15 13:53:03 [ℹ]  deploying stack "eksctl-t-2-cluster"
2023-12-15 13:53:33 [ℹ]  waiting for CloudFormation stack "eksctl-t-2-cluster"
2023-12-15 13:54:03 [ℹ]  waiting for CloudFormation stack "eksctl-t-2-cluster"
2023-12-15 13:55:03 [ℹ]  waiting for CloudFormation stack "eksctl-t-2-cluster"
2023-12-15 13:56:04 [ℹ]  waiting for CloudFormation stack "eksctl-t-2-cluster"
2023-12-15 13:57:04 [ℹ]  waiting for CloudFormation stack "eksctl-t-2-cluster"
2023-12-15 13:58:05 [ℹ]  waiting for CloudFormation stack "eksctl-t-2-cluster"
2023-12-15 13:59:05 [ℹ]  waiting for CloudFormation stack "eksctl-t-2-cluster"
2023-12-15 14:00:05 [ℹ]  waiting for CloudFormation stack "eksctl-t-2-cluster"
2023-12-15 14:01:05 [ℹ]  waiting for CloudFormation stack "eksctl-t-2-cluster"
2023-12-15 14:02:06 [ℹ]  waiting for CloudFormation stack "eksctl-t-2-cluster"
2023-12-15 14:04:10 [ℹ]  daemonset "kube-system/aws-node" restarted

@matti
Copy link

matti commented Jun 5, 2024

latest eksctl, 10 minutes to create 1.30 eks

@rothgar
Copy link

rothgar commented Jun 5, 2024

I don't think sharing provisioning times on this issues is helpful because it is highly dependent on region, EKS version, configuration, and whatever AWS is doing behind the scenes.

Sharing numbers emails everyone watching this 3 year old issue and doesn't provide Amazon any more information than they already have. Even when provisioning times are adequately fast (which means different things for different people) there will still be other things (scaling, upgrades, etc.) that will be slower than they should be.

Many of the past improvements to provisioning speed have reverted back to old numbers or are disingenuous by not counting worker nodes or only measuring specific regions and configurations.

I don't think this issue being in the "Working on it" project section for 3 years earns trust with customers and it should either be moved back to researching or this issue should be closed as won't fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
EKS Amazon Elastic Kubernetes Service
Projects
None yet
Development

No branches or pull requests