-
-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix excessive receive buffer allocation #1611
Conversation
ac6e864
to
e58224e
Compare
}); | ||
let mut iovs = unsafe { iovs.assume_init() }; | ||
let iovs = unsafe { slice_assume_init_mut(&mut iovs[..recv_count]) }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If gro is available, then recv_count is 1. But AIUI, in a single iovec, you can only receive messages from a single source. So this means that in each syscall you can receive messages from at most one remote.
Would it make more sense to split this more dynamically? E.g. if we detect burst activity, create 1 big iovec with 64 gro segments. But if activity is more spread out, use many iovecs with one gro segment each.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch; this is probably a regression for endpoints with many active connections. However, per #1354, if we're using GRO we must accept exactly 64 segments, or else we'll get inadvertent packet truncation. I guess we'll have to bite the bullet and lower the max datagram size...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the current default 64k? Is this useful outside of communication on localhost?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default is from RFC 9000, though we're free to pick something else. Whatever value we choose will impose a hard upper limit to MTU detection, and it'd be nice to Just Work on networks with jumbo frames. However, AFAIK such networks are rare, so I'm willing to drop this to a best-case internet MTU and let jumbo frame users adjust their configurations if necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have to allocate per segment or can we allocate a large buffer which would allow some jumbo frames provided that most frames are smaller?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we receive a GRO batch consisting of many jumbo frame packets, and we did not allocate enough space for all of them, packets may be truncated (i.e. lost) per #1354.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consensus in the implementers' slack seems to be that recvmmsg
isn't adding a lot here, i.e. we probably won't see much regression from doing one syscall per sender as in this PR. This is in principle workload dependent... it's tempting to leverage that to simplify our APIs a bunch, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, sounds like ditching recvmmsg()
could make sense.
Allocation logic previously considered a GRO batch to be one datagram, rather than (when supported) 64. This led to an unintended ~128MiB allocation, now scaled back down to ~4MiB.
e58224e
to
130de25
Compare
Closing this, as we decided to merge #1615 instead. |
Allocation logic previously considered a GRO batch to be one datagram, rather than (when supported) 64. This led to an unintended 128MiB allocation, now scaled back down to ~4MiB.
Fixes #1608.