-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ECS ENI Density Increases #7
Comments
Will this benefit EKS workers also? |
@MarcusNoble EKS uses secondary IPs, so it allows for much bigger tasks in each pod. |
@MarcusNoble Can you please tell us more about your EKS pods-per-node density requirements? |
It'd be great if we could make use of some of the smaller instance types (in terms of CPU and memory) but still benefit from being able to have a large number of pods. When we were picking the right instance type we've had to pick much more resources than we need because of the IP limitation when balanced with the cost of running more smaller ones. |
yes please. this would be valuable. increasing container density on ECS/.EKS. (no matter if IP or port based) also having a one pager max containers per instnace flavor would be useful too |
An acceptable level of ENI density should be about 1 ENI / 0.5 VCPU and scale linearly with instance size, not every other as it is today. |
I would say 1ENI / .5 VCPU would be on the low end. Honestly at that rate we probably still wouldn't bother with awsvpc networking mode. We regularly run 10-16 tasks on hosts with as few as 2 VCPUs. |
I would point out that on other providers this limit is not in place. So coming in with purely a k8s familiarity.. I expect that there is a hard coded limit of 110 pods per node. This one caught us a bit off guard. Started migrating from GCP and chose as close to same sized machines as we could in AWS. Start the migration and suddenly pods aren't starting. It was only because we had happened to have remembered reading about ips per ENI that we were able to figure this out. I can definitely understand the context switching for the CPU and other factors being an issue with traditional EC2. But with much smaller jobs running it would be nice to at least be able to acknowledge these risks and do it anyways. Especially with EKS where we can / are responsible for setting usage needs to let k8s best schedule across our node capacity |
I can explain a good use case for this. We currently have a EKS cluster on AWS and a AKS cluster on Azure. On the Azure cluster we run many small pods (80 pods approx. per node): they are so small that they can easily fit on the equivalent of a m5.xlarge. Unfortunately, the m5.xlarge allows only 59 pods per node (of which at least 2 pods are needed by the system itself). So we are basically using the Azure cluster for cost optimization. |
Any news on when we can expect an update? We are planning to move workloads to ECS using awsvpc but are currently blocked by this issue. We could use the the bridge networking mode for now, but for this it would be good to know whether an update to this issue is imminent or rather something for next year (both are fine, but information on this would be great) |
@peterjuras We are currently actively working on this feature. Targeting a release soon, this year. @emanuelecasadio Please note this issue tracks the progress of ENI density increases for ECS. We are also working on networking improvements for EKS, just not as part of this issue. |
@ofiliz Does this mean "calendar year?" (ie, 2019). We were initially under the impression this feature would be shipping months ago. Until it does ship, awsvpc (and thus App Mesh) is not usable for us. |
I second this, I struggle to see AppMesh working for the majority of use cases with ECS given the current ENI limitations and sole support for awsvpc networking mode. It's a shame there is so much focus on EKS support when K8s already has tons of community support and tooling around service-mesh architectures. Meanwhile today, for ECS, all service-mesh deployments have to be more or less home-rolled due to limited support. I've been patiently waiting, but I'm about to just roll Linkerd out across all of our clusters because the feature set of AppMesh as is right now is still very limited, and this ENI density issue is a non-starter for us. It seems AppMesh was prematurely announced, since it's just now GA 6 months after announcement, and is still effectively unusable for any reasonably sophisticated ECS deployments. |
AWS tend to release services as soon as they are useful for some subset of their intended customer base. If you are running reasonably heavy memory containers then, depending on the instance type you use, you won't hit the ENI limits when using awsvpc networking. While this is a problem for you (and myself) there are clearly going to be some people where this is useful and so it's obviously good to release it to those people before solving a much harder issue around ENI density or reworking the awsvpc networking on ECS to use secondary IPs such as with EKS via network policies on top of security groups. There's certainly a nice level of simplicity in that with the awsvpc networking then each task gets its own ENI and thus you can use AWS networking primitives such as security groups natively. EKS' use of secondary IPs for pods sits on top of the already well established network policies used by overlay networks in Kubernetes but for a lot of people this is way more complexity than necessary. I personally prefer the simplicity of ECS over Kubernetes for exactly these types of decisions. |
I've said this before in multiple places. |
That's pretty outrageous speculation there. Whatever you do you're still restricted by the physical limitations of the actual tin and part of that ENI per core thing is just because that's how instances are divided up as part of the physical kit. Even if the networking is entirely virtualised or offloaded there's still some cost to it and AWS needs to be able to portion that out to every user of the tin as fairly as possible. |
true @tomelliff but would lift this entire problem to a different scale |
@joshuabaird @mancej Yes, this calendar year, coming soon. We appreciate your feedback. We are aware that this issue impact AppMesh on ECS and are working hard to increase the task density without requiring our customers to make any disruptive changes or lose any functionality that they expect from VPC networking using awsvpc mode. |
Hi everyone: I'm on the product management team for ECS. We're going to be doing an early access period soon for this feature prior to being generally available. In the event you're interested in participating: can you please email me at bsheck [at] amazon with your AWS account ID(s). I'll ensure your accounts get access and follow up with more specific instructions when the early access period is opened up. |
With the release of the Amazon ECS agent v1.28.0 released today, the introduction of high density awsvpc tasks support was announced. What's the new limit ? Is it more ENI per EC2 instances ? more IP addresses per ENI ? Thanks! |
@mfortin The agent release today is staged in anticipation for when we open up the feature for general availability relatively soon. At that point, we'll be publishing all the documentation with all the various ENI increases on a per-instance basis and I'll report back here at that time. |
@Bensign I sent you an email last month to be part of the beta test from my corporate email, we love being guinea pigs ;) If you prefer, I can make this request more official through our TAM. |
@abby-fuller is this limited to the specific families listed on the docs, or does it also include sub families like c5d? |
It is currently limited to the specific instance types listed in the docs. We are working on adding additional instance types. |
How does this work? Is there any reason why we wouldn't opt into this mode? Are there any limitations? |
Is this actually working for anyone? I have the account setting defined, running newest ECS AMI (w/ 1.28.1 ECS agent, etc), but I still can only run 3 tasks on a m5.2x. I don't see that the trunk interface is being provisioned. Talking to support now, but I think they may be stumped as well. |
An update: I enabled |
@joshuabaird Yup. I had the same issue. You need to enable the |
Does this apply just to ECS or also EKS? Was directed here by a couple of aws solution architects before this was closed. Was under the impression it would be usable by eks as well. The announcement doesn’t mention it though |
Hi @geekgonecrazy, this feature is currently only for ECS. Do you want more pods per node in EKS? Or do you want VPC security groups for each EKS pod? If you can tell us more about your requirements, we can suggest solutions or consider adding such a feature in our roadmap. |
To quote my initial comment here 4 months ago. Every other provider we can do the k8s default of 110 pods per node. With eks we have to get a machine with more interfaces and way more specs then we need just to get 110 pods per node. |
Are there any plans to also bring this to the smallest instance types (e.g. t2/t3.micro)? I would rather plan on using this feature for DEV environments, where we would bin pack as much as possible, on production environments I don't see as much need here. |
@ofiliz we have a workload running on a different cloud provider that we would like to move to EKS, but the fact that we cannot allocate 110 pods on a t3.medium or t3.large node is a no-go for us. |
@geekgonecrazy @emanuelecasadio Thanks for your feedback. We are working on significantly improving the EKS pods-per-node density, as well as adding other exciting new networking features. We have created a new item in our EKS roadmap: #398 |
ENI trunking doesn't work when opting in via console as non-root user. You would need to opt-in as the root user via console or run the following command as root/non-root user. |
ENI trunking doesn't work for instances launched in a shared VPC subnet : https://docs.aws.amazon.com/vpc/latest/userguide/vpc-sharing.html The instances fail to register to the cluster when launched in a shared VPC and ENI trunking feature being enabled. |
Bumping @peterjuras question - will you ever support Running at least |
Due to technical constraints with how ENI trunking works, we do not currently have pans to support t2 and t3. |
15 months later and ENI density remains a major issue for ECS users. Microservice architectures fundamentally need a high density of tasks per CPU to make sense or else one ends up with ridiculous costs. We need 1 or 2 VCPU instances which can host say 10 tasks. Please come up with a way of enabling this even if it has a performance impact. Customers need to be able to run light microservices with a high density. |
I know this probably isn't helpful, but what about a m5.large? 2VCPU, supports 10 ENI's with ENI trunking, but does cost slightly more than a t3.small. |
We looked at that but the commercials do not make sense (EU West 1): t3.small | 2 | Variable | 2 GiB | EBS Only | $0.0228 per Hour | Max 2 ECS Tasks m5.large | 2 | 10 | 8 GiB | EBS Only | $0.107 per Hour | Max 10 ECS Tasks So t2.small is $0.0114 per task per hour where each task gets 1 VCPU and 1GB RAM So I'd go with t3.small and pay $0.50 extra a month per task and get 5X the CPU. Hence the m5.large is too expensive for what it is compared to using lots of smaller nodes. |
Have you considered Fargate? You can go as small as 0.25 vCPU per task and there are no limits on task density per se because you are not selecting or managing EC2 instances at all with Fargate. |
We actually used to run everything on fargate but found it did not perform nearly as well as plain EC2 instances (there are lots of discussions on this topic) and pricing was horrific. Also, adding fargate to the analysis: t3.small | 2 | Variable | 2 GiB | EBS Only | $0.0228 per Hour | Max 2 ECS Tasks So t2.small is $0.0114 per task per hour where each task gets 1 VCPU and 1GB RAM ... Fargate is 4X more expensive than t3.small!!! |
You also have to remember that t2.small only gives you about 20% of 1 vCPU, not a whole vCPU. So when you factor that in to your calculation, it's actually slightly more expensive than an m5.large. Plus you probably need to assign less EBS storage to one m5.large instance running 10 tasks than you do for 10 t2.small instances. The T*-series instance types make a lot of sense for fractional usage of a whole instance, but when you start factoring in being able to run multiple concurrent tasks with ECS, this becomes less important as you're already able to use fractional computing by assigning more tasks. |
Also, if these are microservice instances that could stand to be replaced within 2 minutes, have you considered Spot instances? They're often times 5x+ cheaper than on-demand. Even with the fact these can go away with a 2 minute warning, we've found capacity to be remarkably stable. In practice, if you mix instance types and availability zones, you are very unlikely to never have capacity available when an instance type becomes unavailable and the spot-based allocation strategy uses extra knowledge to try and avoid using instance types that are likely to be low on capacity. Also, capacity of instance types in EC2 are set on a per instance-type per AZ basis, so even if they run out of capacity of m5.xlarge, there'll still be m5.large (for example). And if you're worried about complete instance-type exhaustion, you could use something like https://github.com/AutoSpotting/AutoSpotting which will start everything as on-demand and swap them for Spot. So even if Spot instances do get exhausted, it would replace capacity with on-demand instances. |
@bcluap I came here to say that your pricing analysis needs some considerations but @waynerobinson already hinted at that. As far as Fargate is concerned comparing a t3.small 1vCPU to a Fargate 1vCPU it's not apple to apple unless you factor in the t3 bursting price. All in all Fargate raw capacity cost is usually around 20% more expensive than similar spec'ed EC2 costs. Of course this does not take into account the operational savings Fargate can allow customers to achieve (this is a blog centered around EKS/Fargate but many of the considerations are similar for ECS/Fargate as well). Don't get me wrong, if your workload pattern is such that you can take advantage of the bursting characteristics of T3 and allows you to burst on a need basis without having to pay for the ENTIRE discrete resources you have available, T3 is the way to go. However if you are sizing your tasks for full utilization of discrete resources then M5 (and possibly Fargate based on your specific workload patterns) may be cheaper (along with more ENI flexibility). As usual, it depends. |
For 24/7 stable workloads, using Fargate is just throwing money away. Operational savings of Fargate only shine if you have to provision capacity dynamically or only for a period of time. Otherwise, you only configure your EC2 infrastructure once and then run your tasks 24/7 for a much lower cost than Fargate. It has been said before that support for T family instances were not in the plans, but as @mreferre also said, specific use case for these types of instances are more common than use cases where we occupy 100% resources all the time. Any workload aiming for 24/7 availability but only real usage from 8am to 6pm on a specific time zone should use burstable instance types. And this should cover a big portion (maybe more than half) of the whole cloud usage. In the end I don't think anyone should be dictating which instance types should be used for anyone else's workload. It's their own choice in the end and they will be billed for their choices. What I want to know is, whether there really is a technical problem which cannot be solved that is preventing Amazon to enable ENI Trunking on T2/T3/T4g instances, or is it just a business decision forcing customers to use non-burstable instances for higher cost. If there really is a technical blocker, I would like to know what it is if possible. I have asked a similar question on a related open issue #1094 but received no response yet |
Instances running in awsvpc networking mode will have greater allotments of ENIs allowing for greater Task densities.
The text was updated successfully, but these errors were encountered: