-
Notifications
You must be signed in to change notification settings - Fork 323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EKS] : Reduction in EKS cluster creation time #1227
Comments
Reducing the upgrade time would also be nice, especially since the upgrade today involves several manual steps, see #600 . If upgrades could be faster, with the same reliability, this would be great. And it might be even more important than the creation time, assuming that clusters are upgraded several times in their lifetime. |
Upgrading from 1.18 to 1.19 today took 47 mins according to Terraform logs for the control plane, then 34 mins to upgrade a single very small nodegroup (7 nodes), then you have to manually update core-dns, kube-proxy and aws-cni. So overall you're talking an absolute minimum of 1.5 hours if you're watching the thing like a hawk and not wasting any time with gaps. This does seem a little crazy and unsustainable :\ I do hope it doesn't keep getting worse with future versions too... |
@billinghamj thanks for the feedback. You are being heard. On a tangent (and not as a mean to respond to your specific question/need) I am wondering if you have considered using Fargate in your EKS deployments. Among other advantages, one thing that in the context of this thread would be interesting is the fact there are no nodes to upgrade and also that AWS embeds, as part of the Fargate, service components you don't need to care about (kubelet/proxy, cni, log routers, etc). In your extremely specific example you would "just" had to upgrade the control plane and core-dns. Just curious if you took Fargate into account and, if you did, what made you stick to "regular" EC2 worker nodes. |
@mreferre in my opinion, running Fargate in EKS is not cost-effective:
I don't think there's much of a use case to run the main EKS workloads in Fargate. Side note: Price reductions of any kind are always welcome :-) |
Thanks for the feedback @heidemn. The raw compute costs of Fargate are (on average) only roughly 20% more expensive than standard EC2 prices (after a consistent price reduction we announced a while back). Ironically, I think that for very tiny/small workloads, Fargate isn't very cost effective given that the smallest pod size is 0.25vCPU/512MB of memory (and in most cases it would be more convenient to consolidate many tiny workloads on EC2 instances). However for larger workloads, assuming your pod utilization is high, Fargate may become cost effective pretty quickly given there are no worker nodes waste (most K8s clusters are utilized only at a fraction of their full capacity, which you are paying for). If you have a real life example of where you concluded that Fargate was more expensive could you share it please (even offline, I can be reached at mreferre @ amazon dot com). Also, I did not mean this to become a distraction from the original question in this thread. Thanks! PS here a few more considerations around EKS/Fargate economics. I'd like to understand where these assumptions aren't correct (we want to learn more about practical cases where these assumptions are not applicable). |
Thanks, that blog post is definitely helpful. I will give Fargate on EKS a try soon. To close this (off-)topic: |
Our services tend to use around 30MB RAM, are IO bound, and we run hundreds/thousands of them on our non-prod cluster. Even when using exclusively ARM instance types on spot, due to the pods-per-node limits, EKS is doing pretty poorly for us right now cost-wise (with Rancher, we ran the entire thing on a single instance with no performance issues). Obviously Fargate would exacerbate this massively Aside from that, we generally are happier having a bit more control, and want as close as possible to a "plain vanilla" K8s setup. History has told us not to trust AWS too much when it comes to behind-the-scenes magic. Before managed spot instances were available, we were quite happy with self-managed nodes too On principle, lack of ARM support is also a complete non-starter for us. We want to push for that future hard, so we're voting with our feet |
@billinghamj this makes a lot of sense. Thanks for the feedback. |
We have now introduced a change for 1.19 clusters that reduces control plane creation time by 40%, enabling you to create an EKS control plane in 9 minutes on an average. The improvement is also coming to other supported Kubernetes versions in a few weeks. |
@kirtichandak Does this just apply to cluster creation or are some of these improvements going to be seen in upgrading a cluster version? |
The change to reduce control plane creation time by 40% is now available for all EKS supported versions. This enables you to create an EKS control plane in 9 minutes on an average. We are currently working on reducing this time further and we'll keep using this issue to track upcoming improvements. |
@kirtichandak Is this improvement specific to certain regions ? |
Is there any news (or another issue) on improving the control plane upgrade time? Taking in excess of 45 mins to upgrade a managed system that can be created from scratch in less than 10 mins isn't great and in practice is almost un-workable. |
Has there been a regression? It seems that all my cluster creations at 1.21 take approximately 14 minutes :( |
This is my observation as well. |
I'm in the process of creating a 1.21 cluster and it took 14mins and 10secs to complete. |
I think the 14 minutes time is for both control plane and worker nodes. |
What would help greatly as well with EKS cluster rollouts are allowing concurrent operations on the cluster, e.g. creating 2 Fargate profiles, enabling control plane logging and an OIDC provider at the same time. Currently we have to use waiters in our CF stack code to create all these things sequentially. |
Our focus recently has been reducing the time for cluster updates. We are in the process of rolling out changes that will reduce cluster version upgrade time down to ~12 minutes. After that completes, we'll roll out the same improved update workflow for OIDC provider associations and KMS encryption updates. |
i just created a 1 node cluster (1.21) in Oregon, took 21 min end-to-end. |
Seeing the same thing on our end. We're halfway into the idea of moving to Rancher entirely. :-\ Unfortunately for us, we're so invested in EKS and all the work we put into it, that sunk-cost keeps me from getting the OK by our CTO to make the jump. |
Seeing creation times over 20 minutes all the time |
@mikestef9 do you have any progress to report on the encryption upgrade times? We're seeing this consistently take over an hour for EKS v1.22 clusters. |
16mins to create cluster and then another 10mins to get any nodes in. |
I have been following this thread for almost 2 years and it does not appear EKS has gotten any better in creation times. I have a consistent 15 - 20 mins. Spinning up and tearing down EKS clusters is something that would go a long way for people to use it more. The overhead of running a cluster for extended periods of time make it so much of a headache that many of our clients decide its better to use competing tech. |
I suspect AWS is not motivated to really reduce the time - if they did then people would start creating EKSes, do their work and tear them down. |
@przemolb - I used to work at AWS. I can vouch for the fact that I never heard a single product team, or a single engineer ever say "let's do I think a primary reason for the slowdown in cluster provisioning is probably the AutoScaler API. The AutoScaler API and AutoScale Groups are a notoriously OLD and SLOW. Two other components in the EKS stack that slow things down are: 1/ OIDC Provider creation/association, 2/ KMS Encryption & key association IF you want to be able to create/destroy K8s clusters VERY quickly, while still benefiting from all the nice things EKS provides, you can create an EKS cluster, configure it to use This is the architecture I've developed for our Dev & QA environments.
When idle (or when no workloads/environments are scheduled), the Dev/QA EKS clusters run just 1 single server (cheap). When we need to "spin up" a new "cluster" for a specific workload (or because we need another "environment" for testing), I use VCluster to schedule that into the EKS cluster. Because Karpenter is blazing-fast, those nodes and containers usually come up within 58 seconds (yes, we timed them). When done, we tear them down in seconds, and the EKS cluster goes back to just idling on a single small node. This way, you can "create a cluster" - via VCluster - in less than 2 minutes, then destroy it in just a few minutes as well. The EKS cluster that hosts the other "nested" VCluster(s) can have a static nodegroup with 1 small node in it, to keep it running all the time, but keep it very cheap (monthly costs) as well. |
Thanks @armenr for the hint about VCluster - it seems really good work around time requested to spin up a new EKS cluster. |
In case this helps or is useful to anyone: Launching EKS clusters in |
Launching EKS v1.23 clusters in |
Appreciate the patience here, changes have now been rolled out for reduction in encryption and OIDC association time. |
@mikestef9 - Thanks for updating all of us! From my cursory tests, I believe I see improvement. Clocks in around ~9 minutes 30 seconds Using the
|
us-east-1, twenty five (25) minutes and counting. Nothing special, no special configs, etc. It must take real innovation to be this bad. But, hey, Uncle Jeff went to space in a really awkward cowboy hat that we all are paying for so I shouldn't complain, right? And, no, it's not a violation of the community guidelines to make candid, non-personal comments about objective measures of poor performance and absolutely insane operational & capital priorities of majority shareholders, upper management, etc. |
no change, still ~14-15mins (eu-north-1, 1.24, eksctl) for the cluster. then node groups etc take loooong as well.
|
For comparison: |
Seven months since my last post and 27 months since the ticket was opened; still hasn't gotten better. Still sitting in the 15 min range. |
@matti - if it helps at all, we avoid some of the additional timing overhead of Node Group(s) creation by just spinning up 1 "base" nodegroup (with either 1 or 2 nodes max)...and then installing Karpenter, and relying on Karpenter to take care of everything else we'd typically delegate to EKS Node Groups. In the abstract, we use different Karpenter Provisioners + Node Templates together to achieve the same logical separation provided by having different/multiple EKS Node Groups. For everyone else: AFAIK, Node Groups rely on AWS ASGs under the hood...and the ASG mechanism is a notoriously slow - and also old - bit of plumbing...which is exactly why Karpenter was built as an alternative to using Cluster AutoScaler + ASGs. EKS creation times have appeared to improve incrementally (by ~3 minutes from our experience, in us-west-2), but are still pretty bad when compared to AKS or Digital Ocean. To be quite honest, even as a former Amazonian, it kind of raises some eyebrows when you consider that Digital Ocean manages to provision hosted Kubernetes clusters faster than AWS can. |
@armenr interesting pick'n'mix approach, thank you for sharing. |
Latest eksctl, 11 minutes to create the cluster and then all the spices on top of it.
|
latest eksctl, 10 minutes to create 1.30 eks |
I don't think sharing provisioning times on this issues is helpful because it is highly dependent on region, EKS version, configuration, and whatever AWS is doing behind the scenes. Sharing numbers emails everyone watching this 3 year old issue and doesn't provide Amazon any more information than they already have. Even when provisioning times are adequately fast (which means different things for different people) there will still be other things (scaling, upgrades, etc.) that will be slower than they should be. Many of the past improvements to provisioning speed have reverted back to old numbers or are disingenuous by not counting worker nodes or only measuring specific regions and configurations. I don't think this issue being in the "Working on it" project section for 3 years earns trust with customers and it should either be moved back to researching or this issue should be closed as won't fix. |
EKS cluster control plane provisioning time currently averages 15 minutes. We’ll use this issue to track the ongoing improvements we are making to reduce the creation time.
Which service(s) is this request for?
EKS
The text was updated successfully, but these errors were encountered: