Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AlterNAT at scale Questions #108

Open
ghost opened this issue Jul 12, 2024 · 5 comments
Open

AlterNAT at scale Questions #108

ghost opened this issue Jul 12, 2024 · 5 comments

Comments

@ghost
Copy link

ghost commented Jul 12, 2024

Our Dev/Staging environments quite frankly suck, low traffic at best and in testing we encountered no issues at all (this is good)

When we tried to simulate a production environment, AlterNAT we noticed the instance would start to drop traffic when we reached higher Lambda Execution volumes. We hit the instance's PPS limit and packets were dropped.
Increasing the NAT instance class seemed to help.

Our Production environment is a different beast.
The vast majority of our NAT traffic is from Lambda executions. (Occasionally bursting past 300,000 executions per minute)

I'm concerned about hitting a PPS Limit and having drops in production.

Since you stated you send several PB of traffic, I'm going to guess your traffic is a lot more than ours (it would make sense)

Our short lived lambdas (for SSO, Dynamodb lookups, Small API Requests) are all quick in and out, but our long running lambdas can run upwards of 6 minutes (Data continues to flow to/from the browser during this so the pps do not stop)

Without going into specifics:

  • Do you notice any dropped packets / throttling with a c6gn.8xlarge?
  • I'm going to guess that your NAT Instances are in every private subnet and thus you can spread out the load across multiple subnets.

All of our lambda's use a single subnet with a NAT Gateway in that subnet. I unfortunately cannot do that as re-engineering the architecture is not feasible until winter 2025.

(This has been transferred from an email with @bwhaley for public visibility and comments)

@bwhaley
Copy link
Member

bwhaley commented Jul 12, 2024

Thanks for the question.

We have observed non-zero values in the ENA metrics, namely bw_{in,out}_allowance_exceeded and pps_allowance_exceeded. The values have been so low relative to the total bandwidth/pps that we've just ignored it and counted on TCP to sort it out. Some number of them are probably queued/delayed, but some number are dropped and will result in a retransmit.

These most likely happen because of microbursts in traffic. It's pretty hard to avoid. You can keep upsizing instances until you get to the max to see if that resolves it. It will definitely help, as you observed and as AWS states, such as in this article. If it doesn't, fixing microbursts can be challenging. This article mentions some advanced strategies for mitigating bursts if you cannot scale horizontally (e.g. in your case since you have the constraint of a single subnet/route table), but those approaches are not going to work with Lambdas.

If you haven't already seen it, this article discusses how you can measure pps limits if you want to test different instance types. You may also be able to set up packet captures and look for retransmits to see how widespread the problem actually is.

I opened #107 to make it easier to expose these metrics with Alternat.

@bwhaley
Copy link
Member

bwhaley commented Jul 17, 2024

@thedoomcrewinc Does this help at all? Are you going to try some larger instances or anything as a next step?

@ghost
Copy link
Author

ghost commented Jul 17, 2024

@bwhaley Apologies for the delay in update.

We're in a push for our Back to School effort, and I won't be able to test this until after Aug 1st. I'll update shortly there after.

@ghost
Copy link
Author

ghost commented Aug 16, 2024

Follow up as promised:

After testing various instances classes and size combos up to a c7gn.16xlarge, we determined that the sweet spot was a c6gn.8xlarge instance.

We too observed non-zero values, however like you observed and commented on, we can safely say we are not worrying about the impact.

Microbursts do occur but in general we don't worry about it.

I'll report back in September after the vast majority of schools are back in session and our traffic levels have stabilized at the new "normal"

@bwhaley
Copy link
Member

bwhaley commented Aug 22, 2024

Thanks, I appreciate that you're following up here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant