-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-4963: Kube-proxy Services Acceleration #4964
base: master
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: aojea The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/assign @thockin @danwinship I leave up to you if is worth to go through the feature gate process if this is an opt-in option |
I like this proposal. Giving users the choice to opt in makes a lot of sense and reduces risk. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I do like the idea of using flowtables in Kubernetes, though I'm not sure we've figured out all of the details of how it makes the most sense yet...
|
||
Users will be able to opt-in to Service traffic acceleration by passing a CEL expression using the flag `--accelerated-interface-expression` or the configuration option `AcceleratedInterfaceExpression` to match the network interfaces in the node that are subjet to Service traffic acceleration. The absence of a CEL expression disables the feature. | ||
|
||
Kube-proxy will create a `flowtable` in the kube-proxy table with the name `kube-proxy-flowtable` and will monitor the network interfaces in the node to populate the `flowtable` with the interfaces that match the configured CEL expression. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point we need to investigate the performance implications of this, since it seems clear that this isn't really how flowtables were meant to be used, and there could potentially be things that are O(n^2)
in the length of devices
.
(Also, I wouldn't be surprised if it turned out that recreating the flowtable with an updated devices
value discards all existing flow offload
entries. This isn't horrible since the offload will just get recreated on the next packet, but it's not ideal and suggests maybe we should be working with the netfilter team to optimize this all a bit more...)
(This is maybe a topic for "Risks and Mitigations".)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I also need to consult this with the netfilter folks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point we need to investigate the performance implications of this, since it seems clear that this isn't really how flowtables were meant to be used, and there could potentially be things that are
O(n^2)
in the length ofdevices
.
Regarding "since it seems clear that this isn't really how flowtables were meant to be used", could you expand on that? I'm unsure what the intended use of flowtables is, and how this use case is unintended.
I'm assuming this may be written down somewhere, so happy if you point me at a readme or similar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not really written down, and I don't know exactly what the model is. But if they expected people to set up flows between every pair of devices on the node, then they wouldn't have required you to enumerate specific devices, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if they expected people to set up flows between every pair of devices on the node, then they wouldn't have required you to enumerate specific devices, right?
AFAIK the flowtable (cache table) is per device, so this may be linking the specific flow to the specific device
|
||
### Upgrade / Downgrade Strategy | ||
|
||
kube-proxy reconcile the nftables rules so the rules will be reconciled during startup and added or removed depending on how kube-proxy is configured. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW when downgrading, any existing flow table entries will be preserved, so any Service connections that remain up across a kube-proxy downgrade would still be offloaded and there'd be no way to un-offload them with the older kube-proxy.
(That could be mitigated by recommending that people restart kube-proxy with the feature disabled before downgrading.)
(Except that wouldn't work either; you need to delete the flowtable at startup if the feature is disabled.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I need to document and understand this better, if there is a way to cleanup we already have a kube-proxy cleanup
command so I wonder if we can reuse it for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was talking about downgrading to a release that doesn't know anything about flowtables, but I guess that's only a problem for people using it during Alpha.
Anyway, you just need to make it so that if you start nftables kube-proxy in a version that knows about flowtables, but with flowtables disabled, then it should ensure the flow table doesn't exist when it does its first sync rather than ensuring that it does exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danwinship I see now in the code that we flush chain by chain instead of the whole table when starting ... is there any reason for doing that? flushing the whole table allow to handle downgrade/upgrade of the table schema without further considerations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I first wrote the code, I thought nft flush table ip kube-proxy
would flush the contents of sets too, which would mean that restarting kube-proxy would break any active affinity. But it turns out that it doesn't actually do that anyway; flushing a table just flushes (but does not delete) every chain within the table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then, should we change kube-proxy to flush the entire table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flush still keeps the flowtable , so it is not valid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to think if is possible to have an integrity check 🤔 and do a full resync
Use the kernel flowtables infrastructure to allow kube-proxy users to accelerate service traffic. Change-Id: Iee638c8e86a4d17ddbdb30901b4fb4fd20e7dbda
keps/sig-network/4963-kube-proxy-services-acceleration/README.md
Outdated
Show resolved
Hide resolved
|
||
The [elephant flow detection is a complex topic with a considerable number of literature about it](https://scholar.google.pt/scholar?q=elephant+flow+detection&hl=es&as_sdt=0&as_vis=1&oi=scholart). For our use case we proposa a more simplistic approach based on the number of packets, so it can give us good trade off between performance improvement and safety, we want to avoid complex heuristics and have a more predictible and easy to think about behavior based on the lessones learned from [KEP-2433 Topology aware hints](https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/2433-topology-aware-hints#proportional-cpu-heuristic). | ||
|
||
We chose 20 as the number of packets as the threshold to offload, based on existing thresold used by Cisco systems in [Cisco Application Centric Infrastructure](https://scholar.google.com/scholar_lookup?hl=en&publication_year=2018&author=G.+Tam&title=Cisco+Application+Centric+Infrastructure), Cisco referred to an elephant flow if the flow contains more than 15 packet sizes, i.e., short flow is less than 15 packets, we add 5 packets of buffer to be on the safe side. This means that, using TCP as an example, and assuming an MTU of 1500 bytes and removing the overhead of the TCP headers (that can vary from 20-60 bytes, use 40 for this example), offloading will benefit workloads that transmit more than: TCP Payload * Number of packets = 1460 bytes/packet * 20 = 29200 bytes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems sensible, and I like that it uses the lessons of users to inform how Kubernetes could operate.
What I'm wondering, is there a disadvantage to stream of 21 packets in length?
I assume this implementation will mean that packets 1-20 will go via the normal path, then packet 21 will be offloaded to the fast path.
I'm just wondering what happens, to performance, if the stream is offloaded to the fast path just before it ends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just wondering what happens, to performance, if the stream is offloaded to the fast path just before it ends.
The great thing of IP is that is independent of the path, same as you go to google.com through a lot of devices and links, from one pod or the other that the packet takes does not matter unless there are operations on the packet ... but that should be extraordinary and that is why we proposal a knob to let users to completely disable this behavior
If you see the netperf test below, they are exercising:
- no offload
- offload after connection established
- offload after packet number 20
and the performance improvement is still considerable, since only a max of 29200 bytes or 20 packets are hitting the slow path ... of course the larger your stream of data the closer to the asymptotic limit
Co-authored-by: Adrian Moisey <adrian@changeover.za.net>
/lgtm |
One-line PR description: Service acceleration for Kube-proxy using netfilter’s flowtable infrastructure
Issue link: Kube-proxy Services Acceleration #4963
Other comments: See prototype implementation in [WIP] KEP-4963: Kube-proxy Services Acceleration kubernetes#128392