-
Notifications
You must be signed in to change notification settings - Fork 486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal to allow mtu changes #926
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
And of course, as the initial author |
|
||
### Goals | ||
|
||
* Allow to change MTU post install on OVN Kubernetes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we had wanted to do this for openshift-sdn too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we do but I understood that not at the same time and not for 4.10 anyway so I did not cover it in this enhancement just because of time constraints and the fact that I don't know anything about it.
3. Once all the previous pods are finished, deploy other set of pods on every | ||
node that will handle the actual change of the MTU. Wait for them to be | ||
ready and running. | ||
4. Cordon every node, we don't want pods created from this point on. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cordoning every node simultaneously may have drastically bad failure modes... we'll need discussion with other teams about this to make sure this is OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it because being it done simultaneously? Could you give a probable example? Would it be better to document it as a precondition to the procedure instead of part of it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, for example, CNO does not tolerate the unschedulable
taint, so if something bad happened and it exited/crashed/was killed, it would not be possible to restart it, and then the MTU change process would be stalled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed to remove the need of cordoning as it might bring more problems to the table than we need.
But know the procedure is to restart ovn-kube node before changing running pod's MTU always, regardless of whether we are decreasing or increasing MTU. Increased chances of temporary disruption when decreasing the MTU, but overall more robust procedure.
5. If any of these steps failed, the pod will exit with code 1, if all were | ||
successful it will exit with code 0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it make any attempt at rollback?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, should we? Assuming rollback would fail as well at that point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, no rollback because we will reboot the node on failure anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming rollback would fail as well at that point.
Well, like, maybe it changes the pod MTUs but then fails to change the bridge MTU. It could probably roll back the pod MTU changes in that case...
Actually, no rollback because we will reboot the node on failure anyway?
Does it say that somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Step 10. Also changed that node iunterfaces MTU is changed by restart of ovnkube-node. So this specific pod with only change pods MTU.
Since nothing happens atomically, especially across all nodes/pods in a cluster, things will be inconsistent for a time and must be addressed. Any changes to MTU will ONLY be used by new connections. Existing TCP [1] connections, node-node, node-pod, pod-pod will continue to use the negotiated MSS [2] For existing TCP [1] connections, best case, changing MTU will not change a thing. The MSS that was negotiated when the connection was established will continue to be used. If you change MTU to a larger value, no problem. If you change to a smaller value, an ICMP Type 3 Code 4 will result (best case) or traffic will be black holed (worse case.) This is the simple case. The term pod can be ambiguous, e.g. a pod on the host network is also a pod. For simplicity:
In general, increasing MTU could work. As the change rolls out, existing connections are not impacted. New connections where one side has larger MTU and the other still does not yet have the larger value will continue to use MIN(MSS from remote, local MTU). This assumes the order described for increase of "physical", br-ex, veth of OVS and veth of pod. (Note: Dan's earlier comment about not changing node at all, leaving it up to user to do still results in what I describe, but the "roll out" is longer and manual. Thus the time to converge is longer.) It also assumes the other node(s) "physical", br-ex, veth of OVS and veth of pod already is larger, if not, packet [3] will be dropped and ICMP generated. Since one node at least has to go first, the cluster will fill up with Route Cache Entries (RCE) due to the ICMP Type 3 Code 4 from all the nodes that haven't got the memo to use larger value. This continues until cluster converges on the new settings. It may black hole traffic. I'm still drinking on this. I suggest that all ICMP Type 3 Code 4 messages be dropped (via iptables rules added as part of the enhancement) until we think we are done with the rollout. Just dropping DF=1 packets is enough. Also sending ICMP telling every node/pod that the MTU should be what we are trying to get it to be only wastes cycles (and executes code paths not often used.) Decreasing MTU on underlay is problematic. As before, existing connections will ignore the change. When a datagram from an existing connection meets a now lower MTU on a interface on the path, an ICMP Type 3 Code 4 will result [3]. It's possible that control traffic is black holed "long enough" to make e.g. apiserver not ready. On overlay, there will be some churn (as described above) but things should be ok eventually. [1] The same is true for e.g. SCTP [2] The MSS can change after the fact due to PMTU discovery as I describe. [3] Assuming DF=1 which is typical or IP Fragmentation will be performed. |
node that will handle the actual change of the MTU. Wait for them to be | ||
ready and running. | ||
4. Cordon every node, we don't want pods created from this point on. | ||
5. Ensure via machine-config-operator, that upon reboot, configure-ovs will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cybertron FYI just in case there is any interaction with keyfiles placed by MCO with #817
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this will be a problem, at least in terms of day 1 or MCO. The day 1 configuration will be baked into the image and not managed by MCO, so after deployment if any changes are made MCO won't overwrite them.
The place where it could be an issue is kubernetes-nmstate. Any keyfiles that nmstate writes are at risk of being overwritten if they're modified outside nmstate. If the MTU of an interface configured by kubernetes-nmstate needs to be modified, it probably needs to be done via nmstate.
I'm not sure that's actually a problem right now as I don't think we support nmstate modifying the configuration on the OVNK interface anyway, but I believe they were looking to add support for that upstream so it might be in the future.
- Progressing: false | ||
- Degraded: false | ||
|
||
The steps to change the MTU performed by pods of previous step 3 are: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this pod detect if there are MTU problems after migration is complete, and post an event or something to indicate if it was successful or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean with detect?
If openshift has a verification procedure to health check deployments then we probably can suggest in the documentation to run it after this procedure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean that in your step 5:
5. Once all the previous pods finish successfully, deploy other set of pods with
`restartPolicy: Never` on every node that will handle the actual change of
the MTU (explained in more detail below). Wait for them to be ready and
running.
So you are going to deploy another set of pods that do the configuration. I'm wondering if these pods will remain for some time after configuration and if they can run a healthcheck until all of the nodes are finished updating MTU. The check could be pinging from this pod to other "configuration pods" on other nodes with max MTU. If it doesn't come up after some time, then maybe an event can be posted or something to indicate to the user that MTU change failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually when you do a new deployment you ran some verification to check that the deployment has been done correctly and that cluster is healthy. If this exists for openshift we can suggest in documentation to run it again. Otherwise, it would probably be better to have a different set of pods with aliveness probe or the like rather than adding to these specific pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have the network check target pods, but I was thinking something specifically scoped to the MTU change to give the user a signal that the MTU update worked as part of the MTU update process itself. Like the pods that you launch for doing the MTU upgrade exit successfully and log some message like MTU upgrade complete, or if they check network connectivity and something is now broken, they either crash or post an event to their pod saying MTU upgrade problem. If you think it's not necessary then that's fine to ignore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably then use the network check target pods and enhance that for any specific MTU verification we think we need to do. Do you know where I can check them out?
These MTU change pods only change the MTU of pods, which is an operation for which we should know definitively if it succeeded or not, and is only one step of a 3 step process which also includes changing the host sdn interfaces MTU and the host external interfaces MTU, so I feel that a final verification of the MTU in these pods could be out of place.
I am working on a section to analyze this. |
8512ed1
to
ca6125d
Compare
The OVS side of any veth pair should have an MTU of 65000, similar to what we appear to do for geneve (and vxlan):
Then we never have to change them. There is no reason to deal with MTU issues inside a bridge/switch. If the datagram is really to large, the other side of the veth pair (e.g. pod) will have the L3 knowledge to know what to do. It might make sense to do this first, as part of z-stream, so that this enhancement doesn't have to worry about the lack of an atomic change of both pod veth and "OVS veth" |
I think this is generally a good idea so we don't worry about managing it but it won't help with traffic itself.
If we are sending |
125c363
to
7e122f1
Compare
@jcaamano: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
||
### Non goals | ||
|
||
* Change the MTU without service disruption. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Change the MTU without service disruption. | |
* Change the MTU with absolutely no service disruption. |
9. Set the new MTU value to the applied-cluster config map AND wait for pods of | ||
step 3 to complete successfully. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
step 5
10. If any of the previous steps (8,9) failed, reboot the node, wait for the | ||
kubelet to be reporting as Ready again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If any of the previous steps (8,9) failed on any nodes, drain and reboot the failed nodes, one at a time, and wait for each one to be reporting as Ready again.
11. Upon completion, set conditions to: | ||
- Progressing: false | ||
- Degraded: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Upgradeable: true
- Progressing: false | ||
- Degraded: false | ||
|
||
The steps to change the MTU performed by pods of previous step 3 are: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Degraded: true | ||
Update the operator configuration status with a description of the problem. | ||
At this point the process is interrupted and we require manual intervention. | ||
8. Force a rollout of the ovnkube-node daemonset. This will ensure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this means the ovnkube upgrade is totally out of sync with the pod-level upgrades, and every node needs to wait for every other node to finish its pod-level upgrades before any of them can do the ovnkube-level upgrade.
A better approach might be: instead of having CNO force a re-rollout of the DaemonSet, just have the step 5 pod kill the local ovnkube-node process, forcing it to be restarted.
And then it could even choose to do that step before or after the pod-level fixes, depending on which direction the MTU is changing in...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing is that if we restart ovn-kube after the step 5 pod changes the MTUs, there is a time in between where new pods my allocate with the old MTU.
That's why we restart ovn-kube first, we'll do the roll-out with max unavailability so that is quick and then we let the step 5 pod proceed with the MTU changes. Yes, it is out of sync, but hopefully quick enough.
An administrator should be able to change the cluster network MTU through | ||
CNO configuration change. This would encompass the following tasks: | ||
|
||
##### Implement a pod that changes the actual MTU on running pods |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So again, the parts talking about implementation details don't belong in "User Stories". And they're redundant with what you've already said, so you can just remove them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will move them to a specific section. The thing is that there is no way to map (user) stories here to (non-user) stories in Jira.
|
||
### Risks and Mitigations | ||
|
||
* If unexpected problems ocurr this procedure, the mitigation is an automated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* If unexpected problems ocurr this procedure, the mitigation is an automated | |
* If unexpected problems occur during this procedure, the mitigation is an automated |
There are circumstances that prevent an endpoint from being aware of the actual | ||
MTU to a destination, which depends on Path MTU discovery and specific ICMP | ||
`FRAG_NEEDED` messages: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which in general seem to not work over OVS
|
||
## Alternatives | ||
|
||
### New ovn-k setting: `routable-mtu` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this sounds better than the proposed solution... why aren't we doing it this way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A double rolling reboot seemed unacceptable. But I don't know what is the latest stance on it. Perhaps @vpickard can comment on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was concerned that 2 reboots would not be acceptable from a customer perspective. @mcurry-rh What are your thoughts on having to perform 2 reboots to change the mtu?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replicating the feedback we got form @mcurry-rh
not ideal...acceptable...MTU adjustment is a rare event, so 2 reboots, while painful, is not fatal and achieves the objective
So the second alternative is based on already available node maintenance knowledge, simpler to implement and a safer approach all around while the main alternative is more efficient at the cost of that safety. We could prototype as well.
@abhat @trozet @knobunc @dcbw we would need to make a call on this. Do you have any opinion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, if the cluster is actually "broken" because of the bad MTU, then having to do two reboots isn't that bad since you're probably not running anything useful anyway.
And if it's not broken, then the MTU change probably isn't urgent, and the procedure doesn't actually require that the two reboots happen back-to-back; they could happen 24 hours apart or something. (Right? The cluster is stable/consistent in the inter-reboot phase?) So we could even just make it so that the CNO doesn't initiate any rolling reboots itself, it just does:
- CNO makes the initial change to mtu/routable-mtu
- CNO observes nodes until it sees that every node has rebooted (for whatever reason) and is using the changed configuration.
- CNO makes the second change to mtu/routable-mtu
- CNO observes nodes until it sees that every node has rebooted and is using the changed configuration.
- CNO updates the operator status accordingly
So then the admin could schedule two sets of rolling reboots on consecutive nights, or even just make the config change and then forget about it, and the first change would complete the next time they did a z-stream update and the second change would complete after the next update after that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Right? The cluster is stable/consistent in the inter-reboot phase?)
Yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"So, if the cluster is actually "broken" because of the bad MTU" -- That is not always a safe assumption. One case we had was where a customer had a large, running cluster and wanted to add new nodes. But the new nodes were on OpenShift and they needed to drop the MTU to allow for the VxLAN header in the OSP networking. I assume most cases will be like that, otherwise they could just reinstall...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"So, if the cluster is actually "broken" because of the bad MTU" -- That is not always a safe assumption.
Hence the "if"
since the actual interfaces MTU did not change they will not drop traffic | ||
coming from other nodes. | ||
* Set in ovn-config a `mtu` equal to `routable-mtu` or replace `mtu` with the | ||
`routing-mtu` value and remove the latter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what routing-mtu is here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That should be routable-mtu
traffic drop is expected. | ||
|
||
Increase example: | ||
* Set in ovn-config the actual `mtu` as `routable-mtu` and a new `mtu` setting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is confusing... I would suggest just using some numbers in your example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prototyped it in ovn-kubernetes/ovn-kubernetes#2654, perhaps the description I gave there is easier to understand:
routable-mtu setting is introduced to faciliate a procedure allowing to
change the MTU on a running cluster with minimum service disruption.
Given current and target mtu values:
1. Set mtu to the higher MTU value and routable-mtu to the lower
MTU value.
2. Do a rolling reboot. As a node restarts, routable-mtu is set on
all appropriate routes while interfaces have mtu configured.
The node will effectively use the lower routable-mtu for outgoing
traffic, but be able to handle incoming traffic up to the higher
mtu.
3. Change the MTU on all interfaces not handled by ovn-k to the target
MTU value. Since the MTU effectively used in the cluster is the lower
one, this has no impact on traffic.
4. Set mtu to the target MTU value and unset routable-mtu.
5. Do a rolling reboot. As a node restarts, the target MTU value is set
on the interfaces and the routes are reset to default MTU values.
Since the MTU effectively used in other nodes of the cluster is the
lower onei but able to handle the higher one, this has no impact on
traffic.
routable-mtu is set as the MTU for the following routes:
* pod default route
* non link scoped management port route,
* services route
* link scoped node routes
Closed in favor of #963 which expands on the alternative of performing the MTU change through rolling reboots. |
This is a reboot of #603 restricted to MTU changes.
/cc @mccv1r0
For MTU changes
/cc @trozet
For OVN kubernetes
/cc @danwinship @knobunc @dcbw
For general thoughts