-
Notifications
You must be signed in to change notification settings - Fork 475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ovs-hardware-offload enhancement #296
Conversation
daemon and SR-IOV hardware NIC. | ||
|
||
- OpenvSwitch daemon is managed by Cluster Network Operator (CNO) and runs in | ||
`ovs-node` DaemonSet. A boolean configurable field will be added in CNO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this boolean node specific? AFAIK, it is not mandatory to have all the nodes with offload and ovs running on the specific node should have the offload config enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be a cluster-wide configuration for ovn-kubernetes.
I think it's worth discussion whether we want per-node or per-role control of offload enablement.
One possible way of doing per-role control is to add a pool selector in spec.defaultNetwork.ovnKubernetesConfig
to specify which MachineConfigPool shall be enabled with hw-offload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that possible to make it pod specific? Can offload and non-offload interfaces work in parallel in one OVS instance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ovs hw offload will be supported on primary pod interface with SR-IOV device. Having second interface in one OVS instance implies that ovn-k8s-cni-overlay
be used for additional pod interface, which is not considered in this enhancement.
Traffic will only be offloaded when SR-IOV VF (in switchdev mode) is attached to pod primary network. so ovs hw offload can be request in pod basis by using customized default net-attach-def.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also the configuration is global to ovs so all datapath flows that the NIC vendor support hw offload will be offloaded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@moshe010 For a non-switchdev pod that uses veth pair as primary interface, how would OVS decides not to offload the traffic? for example, does it check whether the hw-tc-offload feature is enabled or not on the attached device (veth or VF representor)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rethinking on this, it might make sense to enable ovs hw offload for a specific MachineConfigPool instead of the whole cluster, to isolate the potential impact of "the whole cluster is down due to hw-offload failure" and "not all cluster nodes have hw offload capable NICs".
@pliurh @krsacme @moshe010 @danwinship wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it should be configurable on a node group rather than the whole cluster, to allow mixed capability on the cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@moshe010 For a non-switchdev pod that uses veth pair as primary interface, how would OVS decides not to offload the traffic? for example, does it check whether the hw-tc-offload feature is enabled or not on the attached device (veth or VF representor)?
When OVS is configured with hardware offload, it uses TC as its default datapath. if the DP rule cannot be handled in TC then OVS kernel DP is used.
in TC, if the underlying device is capable of hardware offload it will try to offload the rule to the device (unless otherwise specified in tc-policy , i.e skip-sw / skip-hw) for VETH device, as its not capable of hardware offloads, the flow will not be offloaded (but would still be handled by TC)
+1 on a per-node group configurations for it. i would not move the entire cluster to use OVS with TC Datapath until it was properly validated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When OVS is configured with hardware offload, it uses TC as its default datapath. if the DP rule cannot be handled in TC then OVS kernel DP is used.
in TC, if the underlying device is capable of hardware offload it will try to offload the rule to the device (unless otherwise specified in tc-policy , i.e skip-sw / skip-hw) for VETH device, as its not capable of hardware offloads, the flow will not be offloaded (but would still be handled by TC)
I was thinking to have the default tc-policy configured to skip-sw so that OVS will do either hw offload or kernel datapath until the TC datapath is properly validated and exposing tc-policy as a confiugrable option in case people want to try tc software datapath.
+1 on a per-node group configurations for it. i would not move the entire cluster to use OVS with TC Datapath until it was properly validated.
To identify a group of nodes, we have two options:
one is to use existing MachineConfigPool, the other is to define customized labels for a node group.
The first group may not contains the exact hw-offload capable nodes (some nodes may have capable NIC, others may not as the pool is not grouped according to hw-offload capability). The second would allow us to accurately define a hw-offload capable node pool. I think the second is what we should do.
is `false`. | ||
|
||
- Setting `tc-policy` in CNO `ovs-node` DaemonSet is equivalent to executing | ||
`ovs-vsctl set Open_vSwitch . other_config:tc-policy=skip_sw`. The default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default value none
uses both - writes to software and if available to hardware. Any particular reason to use skip-sw
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no particuar reason, this is an example of tc-policy
, it will be configurable by cluster admin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just set hw-offload=true
and tc-policy=
on all the nodes by default, and not expose the knob to users? Is there any drawback for using tc-flower on no hardware offloading node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ovs hw offload is experimental in OVS project and is not in GA support on RHEL, until then I think we may want to have it disabled by default or be controlled per cluster.
If tc-policy
is configured to allow both sw and hw, the traffic can fall back to use ovs software datapath if flow cannot be offloaded to hw.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@krsacme Using skip-sw allows ovs to switch back to ovs kernel datapath(the original datapath used on OpenShift) from hw offload failure, this excludes tc software datapath. There are uncertainties of using tc software datapath: is it matured enough? what's the performance look like compared with ovs kernel datapath? what is the debuggbility?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are uncertainties of using tc software datapath
I don't know about uncertainties on using tc. But it would recommeded to use the configuration which offload is validated with. OvS goes with the default options as tc than the ovs. Mellanox can be in a better possition to answer it.
what's the performance look like compared with ovs kernel datapath?
There is no analysis that i know of, between "none" and "skip_sw". It is a better to do that to choose the mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are uncertainties of using tc software datapath
I don't know about uncertainties on using tc. But it would recommeded to use the configuration which offload is validated with. OvS goes with the default options as tc than the ovs. Mellanox can be in a better possition to answer it.
Would the kernel be the default datapath type if not specified? or is it changed to use tc as default, in which ovs version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it would recommeded to use the configuration which offload is validated with. OvS goes with the default options as tc than the ovs. Mellanox can be in a better possition to answer it.
when hw-offload
is enabled the default datapath would be TC, we recommend to use tc-policy=None
which is how this feature is validated when SR-IOV is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any known issue when using tc-policy=skip-sw or it's just not validated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any known issue when using tc-policy=skip-sw or it's just not validated?
its not validated
|
||
#### SR-IOV Operator | ||
|
||
- SR-IOV Operator will support new device type `switchdev` in Operator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any udev rules created in order to retain the name of the PF device? After chaning the mode to switchdev
, the PF devices name will be changed. Since the VF-Reps are already renamed, there is no need for udev rules for VF-reps, but PF naming has to be managed appropriately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, PF will be renamed to its original name when provisioned to switchdev mode using udev rules. otherwise, bond interface (ifcfg-bond) will not be able to find the right interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so in openstack we do have udev rules for the VF representors, but in ovn-kubernetes we rename the VF representor to the container id same has veth. It may be better to use udev rules and VF rep name consistante because it easier to debug and also on driver restart to kernel put the old name and I am not sure what will be in the ovs. Also we udev need to disable NetworkManager has it can remove tc qdisc see [1]
[1] - https://bugzilla.redhat.com/show_bug.cgi?id=1815060
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@moshe010 if using shared-gateway mode, do you think we still have this issue (NetworkManager restart interface which causes qdisc rules be removed)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so we solve it in openstack by adding udev rule [1] to unmanage the VF representors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so. In openstack we add udev rule to disable NM on VF represntors [1]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a sentence in the config daemon section which will create/apply udev rules for both PF device and VF representors
- SR-IOV Operator will support new device type `switchdev` in Operator | ||
[SriovNetworkNodePolicy API](https://github.com/openshift/sriov-network-operator/blob/master/pkg/apis/sriovnetwork/v1/sriovnetworknodepolicy_types.go#L33). | ||
|
||
- SR-IOV config daemon (a DaemonSet managed by SR-IOV Operator) will support |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
switchdev
mode configuration has to be applied again on the node reboot (along with creatin VFs). How is this acheived?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SR-IOV config daemon will reconcile the existing policies when it's launched after rebooting.
It should also apply the configuration before network service starts(when bond interface will be created).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for reference, Mellanox team has proposed a PR in SNO for this feature, openshift/sriov-network-operator#75
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A systemd service will be used to apply the switchdev mode config as well as num of VFs. This is to guarantee the order of: numVfs
-> switchdev config
-> bond config
-> enslave bond in ovs
daemon and SR-IOV hardware NIC. | ||
|
||
- OpenvSwitch daemon is managed by Cluster Network Operator (CNO) and runs in | ||
`ovs-node` DaemonSet. A boolean configurable field will be added in CNO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that possible to make it pod specific? Can offload and non-offload interfaces work in parallel in one OVS instance?
- Multus supports overwriting cluster network definition by specifying a | ||
customized default net-attach-def in pod annotation. Customized net-attach-def | ||
will still use `ovn-k8s-cni-overlay` CNI plugin, but with an special annotation | ||
that requests a SR-IOV device. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it may be better if we could have an admission controller to do the modification automatically which can make it transparent to the users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There needs to be a place where user can request a hw offload pod/interface, currently the design relies on user to specify default cluster-network annoatation, how would admission controller knows whether user wants to have a hw-offload pod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use a pod annotation to indicate whether a pod needs HW offloading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The wording here is not clear about who is writing this annotation and why. (ie, that it needs to be specified on the pods that will have offload, and that any pod that doesn't have the annotation won't get hw offload). I agree with Peng that we shouldn't require the admin to write out a complete (and correct) net-attach-def. There should be some simple way to declare it.
Does this need any security / authorization associated with it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re the wording, I'll update and add the owner of actions to make it clear who will do what.
Current pod creation flow is the same as how normal SR-IOV pod is created, user first defines a SR-IOV net-attach-def via SR-IOV Operator API, then specifies the net-attach-def in pod annotation. If we were to add a another way to declare the use of ovs hw offload pod, it would be inconsistent. The customized net-attach-def for ovs hw offload in my opinion is not complex as below:
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
name: switch
annotations:
k8s.v1.cni.cncf.io/resourceName: openshift.io/switchdev_network
spec:
Config: '{"cniVersion":"0.3.1","name":"ovn-kubernetes","type":"ovn-k8s-cni-overlay","ipam":{},"dns":{}}'
It has the same spec.Config
body as the default net-attach-def in Multus CNI config, the only difference is the annotation part which varies per resource name (randomly defined by cluster network admin). so it will not be a fixed net-attach-def that can be pre-defined, but I agree that with the help of admission controller (whether it's multus or sriov), it can be achieved by adding a different annotation in pod to declare the use of ovs hw offload and the annotation should contain the desired sr-iov resource name, then admission controller would inspect each pod annotation, create and add the customized net-attach-def for ovs-hw-offload pod.
Re security / authorization, it will have the same security /authorization as was applied to normal net-attach-def, like namespace isolation etc. An extra security check may need to be added for the customized net-attach-def to make sure the body of net-attach-def is not configured with any unexpected values in ipam/dns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to create customized net-attach-def automatically by sriov-operator which knows the switchdev resource name to be used in the net-attach-def. With this, user will only need to add the default network annotation in pod annotation and no need to create the customized net-attach-def themselves.
is `false`. | ||
|
||
- Setting `tc-policy` in CNO `ovs-node` DaemonSet is equivalent to executing | ||
`ovs-vsctl set Open_vSwitch . other_config:tc-policy=skip_sw`. The default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just set hw-offload=true
and tc-policy=
on all the nodes by default, and not expose the knob to users? Is there any drawback for using tc-flower on no hardware offloading node?
|
||
- Multus will support specifying net-attach-def with customized annotations for | ||
[default cluster network in pod annotations](https://github.com/intel/multus-cni/blob/master/doc/configuration.md#specify-default-cluster-network-in-pod-annotations). | ||
Customized annotations will contain a resource name requesting SR-IOV device. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, maybe we can create an admission controller to mutate the pods automatically.
- SR-IOV Operator will support new device type `switchdev` in Operator | ||
[SriovNetworkNodePolicy API](https://github.com/openshift/sriov-network-operator/blob/master/pkg/apis/sriovnetwork/v1/sriovnetworknodepolicy_types.go#L33). | ||
|
||
- SR-IOV config daemon (a DaemonSet managed by SR-IOV Operator) will support |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for reference, Mellanox team has proposed a PR in SNO for this feature, openshift/sriov-network-operator#75
- Multus supports overwriting cluster network definition by specifying a | ||
customized default net-attach-def in pod annotation. Customized net-attach-def | ||
will still use `ovn-k8s-cni-overlay` CNI plugin, but with an special annotation | ||
that requests a SR-IOV device. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The wording here is not clear about who is writing this annotation and why. (ie, that it needs to be specified on the pods that will have offload, and that any pod that doesn't have the annotation won't get hw offload). I agree with Peng that we shouldn't require the admin to write out a complete (and correct) net-attach-def. There should be some simple way to declare it.
Does this need any security / authorization associated with it?
- Mellanox SR-IOV VF-LAG provides hardware link-aggregation (LAG) for hardware | ||
offloaded ports. It allows offloading of OVS rules to a linux `bond` network | ||
device that combines two idential and hardware offload capable Physical | ||
Functions (PFs). In order to use SR-IOV VF-LAG, a bond interface needs to be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear if you're describing a required implementation detail of OVS offload ("you have to create a bond
device even if you aren't using bonding") or if you're describing how OVS offload and bonding interact with each other. ("If you're using bonding and want OVS offload too, then you have to set things up like THIS, rather than like THAT.")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, please take a second look on the update.
It now starts with To use OVS Hardware Offload with link redundency, a bond interface needs to be created and used as geneve endpoint of OVS bridge.
### Test Plan | ||
|
||
- The changes in each relevant component must pass full e2e suite. | ||
- OVS hardware offload functional tests must pass on supported NICs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are these "OVS hardware offload functional tests"?
So QE will have hardware to test this with? Do they have this already?
We don't plan to do full e2e runs on clusters with hardware offload configured?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no defined functional tests in any of ovs hw offload related repos, the basic functional tests I was thinking includes following steps:
- setup a baremetal cluster with ovs hw offload enabled.
- create a pod with SR-IOV device working in switchdev mode
- check the ovs flows for the pod are offloaded to hardware NIC and traffic between pods are not blocked.
QE has the same model of Mellanox CX-5 NIC as we had in baremetal CI, it will be used for running the ovs hw offload testing. The NIC is currently used for SR-IOV testing.
Regarding e2e with hardware offload enabled, it will be added as in current OVN CI once the relevant change is ready.
|
||
### Graduation Criteria | ||
|
||
Tech Preview: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this for graduating to Tech Preview or from Tech Preview?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's to Tech Preview
. Updated.
- Cluster Network Operator has API definition to enable OVS hardware offload | ||
- OVN-Kubernetes can configure VF and associated representor | ||
- Document how to use customized net-attach-def with Multus | ||
- SR-IOV Operator has API definition to configure `switchdev` device type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very low-level, and it doesn't actually include the requirement "... and it actually works" 🙂. I would just say that the criteria is that you can bring up a cluster using hardware offload, and you can do X, Y, and Z on it. (eg, you can bring up a mix of pods using offload and not using offload, on different nodes, and they can all communicate with each other, and you can apply NetworkPolicies to block some traffic, and that works too, ...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
|
||
### Goals | ||
|
||
Enable OpenFlow hardware offload for OVN datapath |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that the goal targets not only OvS, also other OpenFlow devices (i.e. OpenFlow hardware switch, like ovs-db/OSP)? Or just for OvS offloading?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be change to ovs hardware offload
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@s1061123 it will be just ovs offloading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@moshe010 will update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
|
||
### Goals | ||
|
||
Enable OpenFlow hardware offload for OVN datapath |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be change to ovs hardware offload
|
||
### Non-Goals | ||
|
||
Using hardware offload for any part of the OpenShift control plane |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain how the control plane is configured today. Is it all managed by ovs? if you then I don't think you can control what will be offload or not. The hardware offload is global configration to ovs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we cannot control what will be offload if it's enabled on ovs instance, but what can be controlled is which node ovs is enabled with hw-offload and which pod can have switchdev VF.
I think by saying not offload any part of openshift control plane, it means any OVN infra pods(ovnkube-master, ovnkube-node, ovs-node) are not using VF for flow offloading. @danwinship did I understand it correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's what I meant: all pods managed by core OCP operators will still use non-accelerated veths; only pods created by the end user will use accelerated VFs.
(In particular, if an admin tries to add an annotation to a "core OCP" pod to make it accelerated, then the operator that owns that pod will just revert the change. So if we wanted to support a use case like "make prometheus have an accelerated interface", then we'd need to add some explicit way to configure that. There's no reason we couldn't do that later if we wanted to, but it requires some additional stuff beyond what's already in this proposal.)
daemon and SR-IOV hardware NIC. | ||
|
||
- OpenvSwitch daemon is managed by Cluster Network Operator (CNO) and runs in | ||
`ovs-node` DaemonSet. A boolean configurable field will be added in CNO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also the configuration is global to ovs so all datapath flows that the NIC vendor support hw offload will be offloaded.
|
||
#### SR-IOV Operator | ||
|
||
- SR-IOV Operator will support new device type `switchdev` in Operator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so in openstack we do have udev rules for the VF representors, but in ovn-kubernetes we rename the VF representor to the container id same has veth. It may be better to use udev rules and VF rep name consistante because it easier to debug and also on driver restart to kernel put the old name and I am not sure what will be in the ovs. Also we udev need to disable NetworkManager has it can remove tc qdisc see [1]
[1] - https://bugzilla.redhat.com/show_bug.cgi?id=1815060
|
||
- When enabling OVS Hardware Offload with SR-IOV VF LAG, MCO needs to apply | ||
linux bond configuration. It's important that SR-IOV Operator (installed day 2) | ||
makes sure `switchdev` configurations are applied before linux bond |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we may want to add other FW config while using switchev mode to improve performance can we had them to SR-IOV operator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would customer get the firmware? I'm not sure if we shall expose an interface for such configuration in openshift level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think @moshe010 means that additional vendor specific operations may be required (performed by sriov-network-operator plugin) to improve performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can add it via SR-IOV Operator plugins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@moshe010 @adrianchiris what other FW configs shall be used to improve performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for now im not aware of anything specific we would want to do in OpenShift.
There were requests in OSP to address specific use-cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adrianchiris do you have a reference to the OSP specific use-case? just trying to understand the case.
d9451b5
to
c88f9ec
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: zshi-redhat The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
be created and used as geneve endpoint of OVS bridge. Mellanox SR-IOV VF-LAG | ||
provides hardware link-aggregation (LAG) for hardware offloaded ports. It | ||
allows offloading of OVS rules to a linux `bond` network device that combines | ||
two idential and hardware offload capable Physical Functions (PFs). With the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: idential -> identical
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will fix in next update.
Reviewing the upstream documentation, I see that this has a complicated interaction with multus and device plugins. If a pod is scheduled to a node with ovs hardware offload enabled, and that pod does not have the NetAttachDefinition applied, will it still function? |
@squeed Yes, it will still work with a regular veth interface attached to pod. |
I see. I'm not familiar with how all these pieces fit together. Assuming there will "always" be non-offload pods on the node, will they be able to reach each other? I'm wondering if we should add the requisite device plugin support in to ovn-kubernetes directly. |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
Rotten issues close after 30d of inactivity. Reopen the issue by commenting /close |
@openshift-bot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
No description provided.