Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ECS ENI Density Increases #7

Closed
abby-fuller opened this issue Nov 28, 2018 · 53 comments
Closed

ECS ENI Density Increases #7

abby-fuller opened this issue Nov 28, 2018 · 53 comments
Labels
ECS Amazon Elastic Container Service

Comments

@abby-fuller
Copy link
Contributor

abby-fuller commented Nov 28, 2018

Instances running in awsvpc networking mode will have greater allotments of ENIs allowing for greater Task densities.

@abby-fuller abby-fuller added the ECS Amazon Elastic Container Service label Nov 28, 2018
@Bensign Bensign changed the title ENI Trunking Support for ECS ENI Density Increases for ECS Dec 5, 2018
@Akramio Akramio changed the title ENI Density Increases for ECS ECS ENI Density Increases Dec 5, 2018
@MarcusNoble
Copy link

Will this benefit EKS workers also?

@FernandoMiguel
Copy link

@MarcusNoble EKS uses secondary IPs, so it allows for much bigger tasks in each pod.
If we get new density to EC2, maybe it can benefit EKS too, but right now it is a much smaller issue than it is for ECS

@ofiliz
Copy link

ofiliz commented Dec 13, 2018

@MarcusNoble Can you please tell us more about your EKS pods-per-node density requirements?

@MarcusNoble
Copy link

It'd be great if we could make use of some of the smaller instance types (in terms of CPU and memory) but still benefit from being able to have a large number of pods. When we were picking the right instance type we've had to pick much more resources than we need because of the IP limitation when balanced with the cost of running more smaller ones.

@jpoley
Copy link

jpoley commented Dec 15, 2018

yes please. this would be valuable. increasing container density on ECS/.EKS. (no matter if IP or port based) also having a one pager max containers per instnace flavor would be useful too

@jespersoderlund
Copy link

An acceptable level of ENI density should be about 1 ENI / 0.5 VCPU and scale linearly with instance size, not every other as it is today.

@mancej
Copy link

mancej commented Jan 5, 2019

An acceptable level of ENI density should be about 1 ENI / 0.5 VCPU and scale linearly with instance size, not every other as it is today.

I would say 1ENI / .5 VCPU would be on the low end. Honestly at that rate we probably still wouldn't bother with awsvpc networking mode. We regularly run 10-16 tasks on hosts with as few as 2 VCPUs.

@geekgonecrazy
Copy link

I would point out that on other providers this limit is not in place. So coming in with purely a k8s familiarity.. I expect that there is a hard coded limit of 110 pods per node.

This one caught us a bit off guard. Started migrating from GCP and chose as close to same sized machines as we could in AWS. Start the migration and suddenly pods aren't starting.

It was only because we had happened to have remembered reading about ips per ENI that we were able to figure this out.

I can definitely understand the context switching for the CPU and other factors being an issue with traditional EC2. But with much smaller jobs running it would be nice to at least be able to acknowledge these risks and do it anyways.

Especially with EKS where we can / are responsible for setting usage needs to let k8s best schedule across our node capacity

@emanuelecasadio
Copy link

emanuelecasadio commented Mar 29, 2019

I can explain a good use case for this. We currently have a EKS cluster on AWS and a AKS cluster on Azure.

On the Azure cluster we run many small pods (80 pods approx. per node): they are so small that they can easily fit on the equivalent of a m5.xlarge. Unfortunately, the m5.xlarge allows only 59 pods per node (of which at least 2 pods are needed by the system itself).

So we are basically using the Azure cluster for cost optimization.

@peterjuras
Copy link

Any news on when we can expect an update? We are planning to move workloads to ECS using awsvpc but are currently blocked by this issue. We could use the the bridge networking mode for now, but for this it would be good to know whether an update to this issue is imminent or rather something for next year (both are fine, but information on this would be great)

@ofiliz
Copy link

ofiliz commented Apr 7, 2019

@peterjuras We are currently actively working on this feature. Targeting a release soon, this year.

@emanuelecasadio Please note this issue tracks the progress of ENI density increases for ECS. We are also working on networking improvements for EKS, just not as part of this issue.

@joshuabaird
Copy link

@ofiliz Does this mean "calendar year?" (ie, 2019). We were initially under the impression this feature would be shipping months ago. Until it does ship, awsvpc (and thus App Mesh) is not usable for us.

@mancej
Copy link

mancej commented Apr 8, 2019

@ofiliz Does this mean "calendar year?" (ie, 2019). We were initially under the impression this feature would be shipping months ago. Until it does ship, awsvpc (and thus App Mesh) is not usable for us.

I second this, I struggle to see AppMesh working for the majority of use cases with ECS given the current ENI limitations and sole support for awsvpc networking mode. It's a shame there is so much focus on EKS support when K8s already has tons of community support and tooling around service-mesh architectures. Meanwhile today, for ECS, all service-mesh deployments have to be more or less home-rolled due to limited support.

I've been patiently waiting, but I'm about to just roll Linkerd out across all of our clusters because the feature set of AppMesh as is right now is still very limited, and this ENI density issue is a non-starter for us. It seems AppMesh was prematurely announced, since it's just now GA 6 months after announcement, and is still effectively unusable for any reasonably sophisticated ECS deployments.

@tomelliff
Copy link

AWS tend to release services as soon as they are useful for some subset of their intended customer base. If you are running reasonably heavy memory containers then, depending on the instance type you use, you won't hit the ENI limits when using awsvpc networking.

While this is a problem for you (and myself) there are clearly going to be some people where this is useful and so it's obviously good to release it to those people before solving a much harder issue around ENI density or reworking the awsvpc networking on ECS to use secondary IPs such as with EKS via network policies on top of security groups.

There's certainly a nice level of simplicity in that with the awsvpc networking then each task gets its own ENI and thus you can use AWS networking primitives such as security groups natively. EKS' use of secondary IPs for pods sits on top of the already well established network policies used by overlay networks in Kubernetes but for a lot of people this is way more complexity than necessary.

I personally prefer the simplicity of ECS over Kubernetes for exactly these types of decisions.

@FernandoMiguel
Copy link

I've said this before in multiple places.
having native SG per ENI is a huge benefit for any org.
Powered by Nitro technology, it should be possible to create a new instance family that removes the limit of ENI per vCPU/core that currently limits EC2.

@tomelliff
Copy link

That's pretty outrageous speculation there.

Whatever you do you're still restricted by the physical limitations of the actual tin and part of that ENI per core thing is just because that's how instances are divided up as part of the physical kit. Even if the networking is entirely virtualised or offloaded there's still some cost to it and AWS needs to be able to portion that out to every user of the tin as fairly as possible.

@FernandoMiguel
Copy link

true @tomelliff but would lift this entire problem to a different scale

@ofiliz
Copy link

ofiliz commented Apr 17, 2019

@joshuabaird @mancej Yes, this calendar year, coming soon. We appreciate your feedback. We are aware that this issue impact AppMesh on ECS and are working hard to increase the task density without requiring our customers to make any disruptive changes or lose any functionality that they expect from VPC networking using awsvpc mode.

@Bensign
Copy link

Bensign commented Apr 17, 2019

Hi everyone: I'm on the product management team for ECS. We're going to be doing an early access period soon for this feature prior to being generally available.

In the event you're interested in participating: can you please email me at bsheck [at] amazon with your AWS account ID(s). I'll ensure your accounts get access and follow up with more specific instructions when the early access period is opened up.

@mfortin
Copy link

mfortin commented May 16, 2019

With the release of the Amazon ECS agent v1.28.0 released today, the introduction of high density awsvpc tasks support was announced. What's the new limit ? Is it more ENI per EC2 instances ? more IP addresses per ENI ?
We have instances running as many as 120 tasks on them, wondering where the limit is now.

Thanks!

@Bensign
Copy link

Bensign commented May 16, 2019

@mfortin The agent release today is staged in anticipation for when we open up the feature for general availability relatively soon. At that point, we'll be publishing all the documentation with all the various ENI increases on a per-instance basis and I'll report back here at that time.

@mfortin
Copy link

mfortin commented May 16, 2019

@Bensign I sent you an email last month to be part of the beta test from my corporate email, we love being guinea pigs ;) If you prefer, I can make this request more official through our TAM.

@abby-fuller
Copy link
Contributor Author

shipped: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-account-settings.html

@FernandoMiguel
Copy link

@abby-fuller is this limited to the specific families listed on the docs, or does it also include sub families like c5d?

@coultn
Copy link

coultn commented Jun 6, 2019

@abby-fuller is this limited to the specific families listed on the docs, or does it also include sub families like c5d?

It is currently limited to the specific instance types listed in the docs. We are working on adding additional instance types.

@joshuabaird
Copy link

https://docs.aws.amazon.com/AmazonECS/latest/developerguide/container-instance-eni.html

@sargun
Copy link

sargun commented Jun 8, 2019

How does this work? Is there any reason why we wouldn't opt into this mode? Are there any limitations?

@joshuabaird
Copy link

Is this actually working for anyone? I have the account setting defined, running newest ECS AMI (w/ 1.28.1 ECS agent, etc), but I still can only run 3 tasks on a m5.2x. I don't see that the trunk interface is being provisioned. Talking to support now, but I think they may be stumped as well.

@joshuabaird
Copy link

An update: I enabled awsvpcTrunking for the account using a non-root account/role. This role was also used to provision the ECS container instance and the ECS service, but ENI trunking was still not working/available. We then logged into the ECS console using the root account and enabled the setting (which sets the default setting for the entire account). After doing this, ENI trunking started working as expected.

@iwarshak
Copy link

@joshuabaird Yup. I had the same issue. You need to enable the awsvpcTrunking as the root user. It's not obvious.

@geekgonecrazy
Copy link

geekgonecrazy commented Jun 11, 2019

Does this apply just to ECS or also EKS? Was directed here by a couple of aws solution architects before this was closed. Was under the impression it would be usable by eks as well. The announcement doesn’t mention it though

@ofiliz
Copy link

ofiliz commented Jun 11, 2019

Hi @geekgonecrazy, this feature is currently only for ECS. Do you want more pods per node in EKS? Or do you want VPC security groups for each EKS pod? If you can tell us more about your requirements, we can suggest solutions or consider adding such a feature in our roadmap.

@geekgonecrazy
Copy link

@ofiliz

I would point out that on other providers this limit is not in place. So coming in with purely a k8s familiarity.. I expect that there is a hard coded limit of 110 pods per node.

This one caught us a bit off guard. Started migrating from GCP and chose as close to same sized machines as we could in AWS. Start the migration and suddenly pods aren't starting.

It was only because we had happened to have remembered reading about ips per ENI that we were able to figure this out.

I can definitely understand the context switching for the CPU and other factors being an issue with traditional EC2. But with much smaller jobs running it would be nice to at least be able to acknowledge these risks and do it anyways.

Especially with EKS where we can / are responsible for setting usage needs to let k8s best schedule across our node capacity

To quote my initial comment here 4 months ago.

Every other provider we can do the k8s default of 110 pods per node. With eks we have to get a machine with more interfaces and way more specs then we need just to get 110 pods per node.

@peterjuras
Copy link

Are there any plans to also bring this to the smallest instance types (e.g. t2/t3.micro)? I would rather plan on using this feature for DEV environments, where we would bin pack as much as possible, on production environments I don't see as much need here.

@emanuelecasadio
Copy link

emanuelecasadio commented Jul 10, 2019

@ofiliz we have a workload running on a different cloud provider that we would like to move to EKS, but the fact that we cannot allocate 110 pods on a t3.medium or t3.large node is a no-go for us.

@ofiliz
Copy link

ofiliz commented Jul 18, 2019

@geekgonecrazy @emanuelecasadio Thanks for your feedback. We are working on significantly improving the EKS pods-per-node density, as well as adding other exciting new networking features. We have created a new item in our EKS roadmap: #398

@mailjunze
Copy link

ENI trunking doesn't work when opting in via console as non-root user. You would need to opt-in as the root user via console or run the following command as root/non-root user.
aws ecs put-account-setting-default --name awsvpcTrunking --value enabled --region

@pradeepkamath007
Copy link

ENI trunking doesn't work for instances launched in a shared VPC subnet : https://docs.aws.amazon.com/vpc/latest/userguide/vpc-sharing.html

The instances fail to register to the cluster when launched in a shared VPC and ENI trunking feature being enabled.

@tomaszdudek7
Copy link

Bumping @peterjuras question - will you ever support t2/t3 family?

Running at least c5 family on dev/qa/preprod environment costs way too much.

@coultn
Copy link

coultn commented Oct 22, 2019

Due to technical constraints with how ENI trunking works, we do not currently have pans to support t2 and t3.

@bcluap
Copy link

bcluap commented Jan 21, 2021

15 months later and ENI density remains a major issue for ECS users. Microservice architectures fundamentally need a high density of tasks per CPU to make sense or else one ends up with ridiculous costs. We need 1 or 2 VCPU instances which can host say 10 tasks. Please come up with a way of enabling this even if it has a performance impact. Customers need to be able to run light microservices with a high density.

@joshuabaird
Copy link

I know this probably isn't helpful, but what about a m5.large? 2VCPU, supports 10 ENI's with ENI trunking, but does cost slightly more than a t3.small.

@bcluap
Copy link

bcluap commented Jan 22, 2021

We looked at that but the commercials do not make sense (EU West 1):

t3.small | 2 | Variable | 2 GiB | EBS Only | $0.0228 per Hour | Max 2 ECS Tasks

m5.large | 2 | 10 | 8 GiB | EBS Only | $0.107 per Hour | Max 10 ECS Tasks

So t2.small is $0.0114 per task per hour where each task gets 1 VCPU and 1GB RAM
So m5.large is $0.0107 per task per hour where each task gets 0.2 VCPU and 0.8GB RAM

So I'd go with t3.small and pay $0.50 extra a month per task and get 5X the CPU. Hence the m5.large is too expensive for what it is compared to using lots of smaller nodes.

@coultn
Copy link

coultn commented Jan 22, 2021

15 months later and ENI density remains a major issue for ECS users. Microservice architectures fundamentally need a high density of tasks per CPU to make sense or else one ends up with ridiculous costs. We need 1 or 2 VCPU instances which can host say 10 tasks. Please come up with a way of enabling this even if it has a performance impact. Customers need to be able to run light microservices with a high density.

Have you considered Fargate? You can go as small as 0.25 vCPU per task and there are no limits on task density per se because you are not selecting or managing EC2 instances at all with Fargate.

@bcluap
Copy link

bcluap commented Jan 22, 2021

We actually used to run everything on fargate but found it did not perform nearly as well as plain EC2 instances (there are lots of discussions on this topic) and pricing was horrific.

Also, adding fargate to the analysis:

t3.small | 2 | Variable | 2 GiB | EBS Only | $0.0228 per Hour | Max 2 ECS Tasks
m5.large | 2 | 10 | 8 GiB | EBS Only | $0.107 per Hour | Max 10 ECS Tasks
Fargate $0.04048 for VCPU/h and $0.004445 for GB RAM/h

So t2.small is $0.0114 per task per hour where each task gets 1 VCPU and 1GB RAM
So m5.large is $0.0107 per task per hour where each task gets 0.2 VCPU and 0.8GB RAM
So Fargate is $0.044925 per task per hour where each task gets 1 VCPU and 1GB RAM

... Fargate is 4X more expensive than t3.small!!!

@waynerobinson
Copy link

You also have to remember that t2.small only gives you about 20% of 1 vCPU, not a whole vCPU. So when you factor that in to your calculation, it's actually slightly more expensive than an m5.large.

Plus you probably need to assign less EBS storage to one m5.large instance running 10 tasks than you do for 10 t2.small instances.

The T*-series instance types make a lot of sense for fractional usage of a whole instance, but when you start factoring in being able to run multiple concurrent tasks with ECS, this becomes less important as you're already able to use fractional computing by assigning more tasks.

@waynerobinson
Copy link

Also, if these are microservice instances that could stand to be replaced within 2 minutes, have you considered Spot instances? They're often times 5x+ cheaper than on-demand.

Even with the fact these can go away with a 2 minute warning, we've found capacity to be remarkably stable.

In practice, if you mix instance types and availability zones, you are very unlikely to never have capacity available when an instance type becomes unavailable and the spot-based allocation strategy uses extra knowledge to try and avoid using instance types that are likely to be low on capacity.

Also, capacity of instance types in EC2 are set on a per instance-type per AZ basis, so even if they run out of capacity of m5.xlarge, there'll still be m5.large (for example).

And if you're worried about complete instance-type exhaustion, you could use something like https://github.com/AutoSpotting/AutoSpotting which will start everything as on-demand and swap them for Spot. So even if Spot instances do get exhausted, it would replace capacity with on-demand instances.

@mreferre
Copy link

@bcluap I came here to say that your pricing analysis needs some considerations but @waynerobinson already hinted at that. As far as Fargate is concerned comparing a t3.small 1vCPU to a Fargate 1vCPU it's not apple to apple unless you factor in the t3 bursting price. All in all Fargate raw capacity cost is usually around 20% more expensive than similar spec'ed EC2 costs. Of course this does not take into account the operational savings Fargate can allow customers to achieve (this is a blog centered around EKS/Fargate but many of the considerations are similar for ECS/Fargate as well).

Don't get me wrong, if your workload pattern is such that you can take advantage of the bursting characteristics of T3 and allows you to burst on a need basis without having to pay for the ENTIRE discrete resources you have available, T3 is the way to go. However if you are sizing your tasks for full utilization of discrete resources then M5 (and possibly Fargate based on your specific workload patterns) may be cheaper (along with more ENI flexibility).

As usual, it depends.

@kgns
Copy link

kgns commented Sep 15, 2021

For 24/7 stable workloads, using Fargate is just throwing money away. Operational savings of Fargate only shine if you have to provision capacity dynamically or only for a period of time. Otherwise, you only configure your EC2 infrastructure once and then run your tasks 24/7 for a much lower cost than Fargate.

It has been said before that support for T family instances were not in the plans, but as @mreferre also said, specific use case for these types of instances are more common than use cases where we occupy 100% resources all the time. Any workload aiming for 24/7 availability but only real usage from 8am to 6pm on a specific time zone should use burstable instance types. And this should cover a big portion (maybe more than half) of the whole cloud usage.

In the end I don't think anyone should be dictating which instance types should be used for anyone else's workload. It's their own choice in the end and they will be billed for their choices.

What I want to know is, whether there really is a technical problem which cannot be solved that is preventing Amazon to enable ENI Trunking on T2/T3/T4g instances, or is it just a business decision forcing customers to use non-burstable instances for higher cost. If there really is a technical blocker, I would like to know what it is if possible.

I have asked a similar question on a related open issue #1094 but received no response yet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ECS Amazon Elastic Container Service
Projects
None yet
Development

No branches or pull requests