Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal to allow mtu changes #926

Closed
wants to merge 1 commit into from

Conversation

jcaamano
Copy link
Contributor

@jcaamano jcaamano commented Oct 7, 2021

This is a reboot of #603 restricted to MTU changes.

/cc @mccv1r0
For MTU changes
/cc @trozet
For OVN kubernetes
/cc @danwinship @knobunc @dcbw
For general thoughts

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 7, 2021

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign trozet after the PR has been reviewed.
You can assign the PR to them by writing /assign @trozet in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jcaamano
Copy link
Contributor Author

jcaamano commented Oct 7, 2021

And of course, as the initial author
/cc @juanluisvaladas

@openshift-ci openshift-ci bot requested a review from juanluisvaladas October 7, 2021 17:33

### Goals

* Allow to change MTU post install on OVN Kubernetes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we had wanted to do this for openshift-sdn too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do but I understood that not at the same time and not for 4.10 anyway so I did not cover it in this enhancement just because of time constraints and the fact that I don't know anything about it.

3. Once all the previous pods are finished, deploy other set of pods on every
node that will handle the actual change of the MTU. Wait for them to be
ready and running.
4. Cordon every node, we don't want pods created from this point on.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cordoning every node simultaneously may have drastically bad failure modes... we'll need discussion with other teams about this to make sure this is OK.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it because being it done simultaneously? Could you give a probable example? Would it be better to document it as a precondition to the procedure instead of part of it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, for example, CNO does not tolerate the unschedulable taint, so if something bad happened and it exited/crashed/was killed, it would not be possible to restart it, and then the MTU change process would be stalled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed to remove the need of cordoning as it might bring more problems to the table than we need.

But know the procedure is to restart ovn-kube node before changing running pod's MTU always, regardless of whether we are decreasing or increasing MTU. Increased chances of temporary disruption when decreasing the MTU, but overall more robust procedure.

Comment on lines 108 to 109
5. If any of these steps failed, the pod will exit with code 1, if all were
successful it will exit with code 0.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it make any attempt at rollback?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, should we? Assuming rollback would fail as well at that point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, no rollback because we will reboot the node on failure anyway?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming rollback would fail as well at that point.

Well, like, maybe it changes the pod MTUs but then fails to change the bridge MTU. It could probably roll back the pod MTU changes in that case...

Actually, no rollback because we will reboot the node on failure anyway?

Does it say that somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Step 10. Also changed that node iunterfaces MTU is changed by restart of ovnkube-node. So this specific pod with only change pods MTU.

@mccv1r0
Copy link

mccv1r0 commented Oct 12, 2021

Since nothing happens atomically, especially across all nodes/pods in a cluster, things will be inconsistent for a time and must be addressed. Any changes to MTU will ONLY be used by new connections. Existing TCP [1] connections, node-node, node-pod, pod-pod will continue to use the negotiated MSS [2]

For existing TCP [1] connections, best case, changing MTU will not change a thing. The MSS that was negotiated when the connection was established will continue to be used. If you change MTU to a larger value, no problem. If you change to a smaller value, an ICMP Type 3 Code 4 will result (best case) or traffic will be black holed (worse case.) This is the simple case.
OCP lives in the not so simple case: change MTU on overlay while leaving the MTU of the node (underlay) as is. Tunneled traffic might be too big/small for existing settings.

The term pod can be ambiguous, e.g. a pod on the host network is also a pod. For simplicity:

  • when I refer to nodes, I mean anything using the host network interfaces, even one that connects to overlay such as ovn-k8s-mp0
  • when I refer to pods, I mean anything just a using the overlay

In general, increasing MTU could work. As the change rolls out, existing connections are not impacted. New connections where one side has larger MTU and the other still does not yet have the larger value will continue to use MIN(MSS from remote, local MTU). This assumes the order described for increase of "physical", br-ex, veth of OVS and veth of pod. (Note: Dan's earlier comment about not changing node at all, leaving it up to user to do still results in what I describe, but the "roll out" is longer and manual. Thus the time to converge is longer.) It also assumes the other node(s) "physical", br-ex, veth of OVS and veth of pod already is larger, if not, packet [3] will be dropped and ICMP generated. Since one node at least has to go first, the cluster will fill up with Route Cache Entries (RCE) due to the ICMP Type 3 Code 4 from all the nodes that haven't got the memo to use larger value. This continues until cluster converges on the new settings. It may black hole traffic. I'm still drinking on this.

I suggest that all ICMP Type 3 Code 4 messages be dropped (via iptables rules added as part of the enhancement) until we think we are done with the rollout. Just dropping DF=1 packets is enough. Also sending ICMP telling every node/pod that the MTU should be what we are trying to get it to be only wastes cycles (and executes code paths not often used.)

Decreasing MTU on underlay is problematic. As before, existing connections will ignore the change. When a datagram from an existing connection meets a now lower MTU on a interface on the path, an ICMP Type 3 Code 4 will result [3]. It's possible that control traffic is black holed "long enough" to make e.g. apiserver not ready. On overlay, there will be some churn (as described above) but things should be ok eventually.

[1] The same is true for e.g. SCTP

[2] The MSS can change after the fact due to PMTU discovery as I describe.

[3] Assuming DF=1 which is typical or IP Fragmentation will be performed.

node that will handle the actual change of the MTU. Wait for them to be
ready and running.
4. Cordon every node, we don't want pods created from this point on.
5. Ensure via machine-config-operator, that upon reboot, configure-ovs will
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cybertron FYI just in case there is any interaction with keyfiles placed by MCO with #817

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this will be a problem, at least in terms of day 1 or MCO. The day 1 configuration will be baked into the image and not managed by MCO, so after deployment if any changes are made MCO won't overwrite them.

The place where it could be an issue is kubernetes-nmstate. Any keyfiles that nmstate writes are at risk of being overwritten if they're modified outside nmstate. If the MTU of an interface configured by kubernetes-nmstate needs to be modified, it probably needs to be done via nmstate.

I'm not sure that's actually a problem right now as I don't think we support nmstate modifying the configuration on the OVNK interface anyway, but I believe they were looking to add support for that upstream so it might be in the future.

- Progressing: false
- Degraded: false

The steps to change the MTU performed by pods of previous step 3 are:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this pod detect if there are MTU problems after migration is complete, and post an event or something to indicate if it was successful or not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean with detect?

If openshift has a verification procedure to health check deployments then we probably can suggest in the documentation to run it after this procedure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that in your step 5:

5. Once all the previous pods finish successfully, deploy other set of pods with
   `restartPolicy: Never` on every node that will handle the actual change of
   the MTU (explained in more detail below). Wait for them to be ready and
   running.

So you are going to deploy another set of pods that do the configuration. I'm wondering if these pods will remain for some time after configuration and if they can run a healthcheck until all of the nodes are finished updating MTU. The check could be pinging from this pod to other "configuration pods" on other nodes with max MTU. If it doesn't come up after some time, then maybe an event can be posted or something to indicate to the user that MTU change failed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually when you do a new deployment you ran some verification to check that the deployment has been done correctly and that cluster is healthy. If this exists for openshift we can suggest in documentation to run it again. Otherwise, it would probably be better to have a different set of pods with aliveness probe or the like rather than adding to these specific pods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have the network check target pods, but I was thinking something specifically scoped to the MTU change to give the user a signal that the MTU update worked as part of the MTU update process itself. Like the pods that you launch for doing the MTU upgrade exit successfully and log some message like MTU upgrade complete, or if they check network connectivity and something is now broken, they either crash or post an event to their pod saying MTU upgrade problem. If you think it's not necessary then that's fine to ignore.

Copy link
Contributor Author

@jcaamano jcaamano Nov 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably then use the network check target pods and enhance that for any specific MTU verification we think we need to do. Do you know where I can check them out?

These MTU change pods only change the MTU of pods, which is an operation for which we should know definitively if it succeeded or not, and is only one step of a 3 step process which also includes changing the host sdn interfaces MTU and the host external interfaces MTU, so I feel that a final verification of the MTU in these pods could be out of place.

@jcaamano
Copy link
Contributor Author

jcaamano commented Oct 13, 2021

Since nothing happens atomically, especially across all nodes/pods in a cluster, things will be inconsistent for a time and must be addressed. Any changes to MTU will ONLY be used by new connections. Existing TCP [1] connections, node-node, node-pod, pod-pod will continue to use the negotiated MSS [2]

For existing TCP [1] connections, best case, changing MTU will not change a thing. The MSS that was negotiated when the connection was established will continue to be used. If you change MTU to a larger value, no problem. If you change to a smaller value, an ICMP Type 3 Code 4 will result (best case) or traffic will be black holed (worse case.) This is the simple case. OCP lives in the not so simple case: change MTU on overlay while leaving the MTU of the node (underlay) as is. Tunneled traffic might be too big/small for existing settings.

The term pod can be ambiguous, e.g. a pod on the host network is also a pod. For simplicity:

  • when I refer to nodes, I mean anything using the host network interfaces, even one that connects to overlay such as ovn-k8s-mp0
  • when I refer to pods, I mean anything just a using the overlay

In general, increasing MTU could work. As the change rolls out, existing connections are not impacted. New connections where one side has larger MTU and the other still does not yet have the larger value will continue to use MIN(MSS from remote, local MTU). This assumes the order described for increase of "physical", br-ex, veth of OVS and veth of pod. (Note: Dan's earlier comment about not changing node at all, leaving it up to user to do still results in what I describe, but the "roll out" is longer and manual. Thus the time to converge is longer.) It also assumes the other node(s) "physical", br-ex, veth of OVS and veth of pod already is larger, if not, packet [3] will be dropped and ICMP generated. Since one node at least has to go first, the cluster will fill up with Route Cache Entries (RCE) due to the ICMP Type 3 Code 4 from all the nodes that haven't got the memo to use larger value. This continues until cluster converges on the new settings. It may black hole traffic. I'm still drinking on this.

I suggest that all ICMP Type 3 Code 4 messages be dropped (via iptables rules added as part of the enhancement) until we think we are done with the rollout. Just dropping DF=1 packets is enough. Also sending ICMP telling every node/pod that the MTU should be what we are trying to get it to be only wastes cycles (and executes code paths not often used.)

Decreasing MTU on underlay is problematic. As before, existing connections will ignore the change. When a datagram from an existing connection meets a now lower MTU on a interface on the path, an ICMP Type 3 Code 4 will result [3]. It's possible that control traffic is black holed "long enough" to make e.g. apiserver not ready. On overlay, there will be some churn (as described above) but things should be ok eventually.

[1] The same is true for e.g. SCTP

[2] The MSS can change after the fact due to PMTU discovery as I describe.

[3] Assuming DF=1 which is typical or IP Fragmentation will be performed.

I am working on a section to analyze this.

@jcaamano jcaamano force-pushed the change-mtu branch 2 times, most recently from 8512ed1 to ca6125d Compare October 13, 2021 12:25
@mccv1r0
Copy link

mccv1r0 commented Oct 13, 2021

The OVS side of any veth pair should have an MTU of 65000, similar to what we appear to do for geneve (and vxlan):

# ip addr show genev_sys_6081 
6: genev_sys_6081: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65000 qdisc noqueue master ovs-system state UNKNOWN group default qlen 1000
    link/ether e6:44:9b:7a:48:6b brd ff:ff:ff:ff:ff:ff
    inet6 fe80::e444:9bff:fe7a:486b/64 scope link 
       valid_lft forever preferred_lft forever
# 

Then we never have to change them. There is no reason to deal with MTU issues inside a bridge/switch. If the datagram is really to large, the other side of the veth pair (e.g. pod) will have the L3 knowledge to know what to do.

It might make sense to do this first, as part of z-stream, so that this enhancement doesn't have to worry about the lack of an atomic change of both pod veth and "OVS veth"

@jcaamano
Copy link
Contributor Author

The OVS side of any veth pair should have an MTU of 65000, similar to what we appear to do for geneve (and vxlan):

# ip addr show genev_sys_6081 
6: genev_sys_6081: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65000 qdisc noqueue master ovs-system state UNKNOWN group default qlen 1000
    link/ether e6:44:9b:7a:48:6b brd ff:ff:ff:ff:ff:ff
    inet6 fe80::e444:9bff:fe7a:486b/64 scope link 
       valid_lft forever preferred_lft forever
# 

Then we never have to change them. There is no reason to deal with MTU issues inside a bridge/switch. If the datagram is really to large, the other side of the veth pair (e.g. pod) will have the L3 knowledge to know what to do.

I think this is generally a good idea so we don't worry about managing it but it won't help with traffic itself.

veth_A (65000) <-----> veth_B(1500)

If we are sending >1500 from veth_A to veth_B it will drop anyway. veth treats MTU as MRU as well.

@jcaamano jcaamano force-pushed the change-mtu branch 2 times, most recently from 125c363 to 7e122f1 Compare October 13, 2021 17:05
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 4, 2021

@jcaamano: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/markdownlint 4fcde8e link true /test markdownlint

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.


### Non goals

* Change the MTU without service disruption.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Change the MTU without service disruption.
* Change the MTU with absolutely no service disruption.

Comment on lines +88 to +89
9. Set the new MTU value to the applied-cluster config map AND wait for pods of
step 3 to complete successfully.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

step 5

Comment on lines +90 to +91
10. If any of the previous steps (8,9) failed, reboot the node, wait for the
kubelet to be reporting as Ready again.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If any of the previous steps (8,9) failed on any nodes, drain and reboot the failed nodes, one at a time, and wait for each one to be reporting as Ready again.

Comment on lines +96 to +98
11. Upon completion, set conditions to:
- Progressing: false
- Degraded: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- Upgradeable: true

- Progressing: false
- Degraded: false

The steps to change the MTU performed by pods of previous step 3 are:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- Degraded: true
Update the operator configuration status with a description of the problem.
At this point the process is interrupted and we require manual intervention.
8. Force a rollout of the ovnkube-node daemonset. This will ensure
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this means the ovnkube upgrade is totally out of sync with the pod-level upgrades, and every node needs to wait for every other node to finish its pod-level upgrades before any of them can do the ovnkube-level upgrade.

A better approach might be: instead of having CNO force a re-rollout of the DaemonSet, just have the step 5 pod kill the local ovnkube-node process, forcing it to be restarted.

And then it could even choose to do that step before or after the pod-level fixes, depending on which direction the MTU is changing in...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is that if we restart ovn-kube after the step 5 pod changes the MTUs, there is a time in between where new pods my allocate with the old MTU.

That's why we restart ovn-kube first, we'll do the roll-out with max unavailability so that is quick and then we let the step 5 pod proceed with the MTU changes. Yes, it is out of sync, but hopefully quick enough.

An administrator should be able to change the cluster network MTU through
CNO configuration change. This would encompass the following tasks:

##### Implement a pod that changes the actual MTU on running pods
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So again, the parts talking about implementation details don't belong in "User Stories". And they're redundant with what you've already said, so you can just remove them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will move them to a specific section. The thing is that there is no way to map (user) stories here to (non-user) stories in Jira.


### Risks and Mitigations

* If unexpected problems ocurr this procedure, the mitigation is an automated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* If unexpected problems ocurr this procedure, the mitigation is an automated
* If unexpected problems occur during this procedure, the mitigation is an automated

Comment on lines +203 to +205
There are circumstances that prevent an endpoint from being aware of the actual
MTU to a destination, which depends on Path MTU discovery and specific ICMP
`FRAG_NEEDED` messages:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which in general seem to not work over OVS


## Alternatives

### New ovn-k setting: `routable-mtu`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this sounds better than the proposed solution... why aren't we doing it this way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A double rolling reboot seemed unacceptable. But I don't know what is the latest stance on it. Perhaps @vpickard can comment on this.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was concerned that 2 reboots would not be acceptable from a customer perspective. @mcurry-rh What are your thoughts on having to perform 2 reboots to change the mtu?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replicating the feedback we got form @mcurry-rh
not ideal...acceptable...MTU adjustment is a rare event, so 2 reboots, while painful, is not fatal and achieves the objective
So the second alternative is based on already available node maintenance knowledge, simpler to implement and a safer approach all around while the main alternative is more efficient at the cost of that safety. We could prototype as well.
@abhat @trozet @knobunc @dcbw we would need to make a call on this. Do you have any opinion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if the cluster is actually "broken" because of the bad MTU, then having to do two reboots isn't that bad since you're probably not running anything useful anyway.

And if it's not broken, then the MTU change probably isn't urgent, and the procedure doesn't actually require that the two reboots happen back-to-back; they could happen 24 hours apart or something. (Right? The cluster is stable/consistent in the inter-reboot phase?) So we could even just make it so that the CNO doesn't initiate any rolling reboots itself, it just does:

  • CNO makes the initial change to mtu/routable-mtu
  • CNO observes nodes until it sees that every node has rebooted (for whatever reason) and is using the changed configuration.
  • CNO makes the second change to mtu/routable-mtu
  • CNO observes nodes until it sees that every node has rebooted and is using the changed configuration.
  • CNO updates the operator status accordingly

So then the admin could schedule two sets of rolling reboots on consecutive nights, or even just make the config change and then forget about it, and the first change would complete the next time they did a z-stream update and the second change would complete after the next update after that.

Copy link
Contributor Author

@jcaamano jcaamano Nov 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Right? The cluster is stable/consistent in the inter-reboot phase?)

Yes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"So, if the cluster is actually "broken" because of the bad MTU" -- That is not always a safe assumption. One case we had was where a customer had a large, running cluster and wanted to add new nodes. But the new nodes were on OpenShift and they needed to drop the MTU to allow for the VxLAN header in the OSP networking. I assume most cases will be like that, otherwise they could just reinstall...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"So, if the cluster is actually "broken" because of the bad MTU" -- That is not always a safe assumption.

Hence the "if"

since the actual interfaces MTU did not change they will not drop traffic
coming from other nodes.
* Set in ovn-config a `mtu` equal to `routable-mtu` or replace `mtu` with the
`routing-mtu` value and remove the latter.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what routing-mtu is here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be routable-mtu

traffic drop is expected.

Increase example:
* Set in ovn-config the actual `mtu` as `routable-mtu` and a new `mtu` setting
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is confusing... I would suggest just using some numbers in your example.

Copy link
Contributor Author

@jcaamano jcaamano Nov 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prototyped it in ovn-kubernetes/ovn-kubernetes#2654, perhaps the description I gave there is easier to understand:

routable-mtu setting is introduced to faciliate a procedure allowing to
change the MTU on a running cluster with minimum service disruption.

Given current and target mtu values:

1. Set mtu to the higher MTU value and routable-mtu to the lower
MTU value.
2. Do a rolling reboot. As a node restarts, routable-mtu is set on
all appropriate routes while interfaces have mtu configured.
The node will effectively use the lower routable-mtu for outgoing
traffic, but be able to handle incoming traffic up to the higher
mtu.
3. Change the MTU on all interfaces not handled by ovn-k to the target
MTU value. Since the MTU effectively used in the cluster is the lower
one, this has no impact on traffic.
4. Set mtu to the target MTU value and unset routable-mtu.
5. Do a rolling reboot. As a node restarts, the target MTU value is set
on the interfaces and the routes are reset to default MTU values.
Since the MTU effectively used in other nodes of the cluster is the
lower onei but able to handle the higher one, this has no impact on
traffic.

routable-mtu is set as the MTU for the following routes:

* pod default route
* non link scoped management port route,
* services route
* link scoped node routes

@jcaamano
Copy link
Contributor Author

Closed in favor of #963 which expands on the alternative of performing the MTU change through rolling reboots.

@jcaamano jcaamano closed this Nov 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants