Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deploy Kubernetes-nmstate with openshift #161

Closed
wants to merge 7 commits into from
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
141 changes: 141 additions & 0 deletions enhancements/kubernetes-nmstate-integration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
---
title: kubernetes-nmstate
authors:
- "@schseba"
reviewers:

approvers:
- TBD

creation-date: 2019-12-18
last-updated: 2019-12-18
status:
---

# kubernetes-nmstate

## Release Signoff Checklist

- [ ] Enhancement is `implementable`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can set the implementable box

- [ ] Design details are appropriately documented from clear requirements
- [ ] Test plan is defined
- [ ] Graduation criteria for dev preview, tech preview, GA
- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/)

## Summary

A proposal to deploy [kubernetes-nmstate](https://github.com/nmstate/kubernetes-nmstate/) on OpenShift.

Node-networking configuration driven by Kubernetes and executed by
[nmstate](https://nmstate.github.io/).

## Motivation

With hybrid clouds, node-networking setup is becoming even more challenging.
Different payloads have different networking requirements, and not everything
can be satisfied as overlays on top of the main interface of the node (e.g.
SR-IOV, L2, other L2).
The [Container Network Interface](https://github.com/containernetworking/cni)
(CNI) standard enables different
solutions for connecting networks on the node with pods. Some of them are
[part of the standard](https://github.com/containernetworking/plugins), and there are
others that extend support for [Open vSwitch bridges](https://github.com/kubevirt/ovs-cni),
[SR-IOV](https://github.com/hustcat/sriov-cni), and more...

However, in all of these cases, the node must have the networks setup before the
pod is scheduled. Setting up the networks in a dynamic and heterogenous cluster,
with dynamic networking requirements, is a challenge by itself - and this is
what this project is addressing.

### Goals

- Deploy kubernetes-nmstate as part of openshift
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a goal. This is a solution. What's the high-level goal?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This document goal is to deploy k-nmstate as part of openshift with the motivation expressed above. It describes below the solution (use operators etc...).
I guess it is possible to specify in the goal something like Manage day 2 node networking in an openshift cluster, which summarizes the motivation, then say that installing k-nmstate is the solution and the rest is an implementation.

IMO it is borderline and both seem ok to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this defines a solution and not a goal. For those less familiar with nmstate, it would be good to explain why nmstate is the right solution to addressing a goal and if there are others evaluated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bcrochet we could add a section that gives background about nmstate being a declarative interface to Enterprise Linux's network interface configuration tool of choice (NetworkManager).


### Non-Goals

- Replace SRIOV operator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for spelling out this as a non-goal :-)
When both SR-IOV Operator and Kubernetes-nmstate are deployed. Is there a mechnism to guarantee that device managed by SR-IOV Operator or Kubernetes-nmstate is not mis-configured by the other?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kubernetes-nmstate will touch interfaces only when you explicitly ask it too. The only SR-IOV related feature on nmstate is setting number of VFs AFAIK. So unless somebody creates a Policy changing that, there should be no conflict.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, Kubernets-nmstate will be able to take over the control of VF devices even if the provision of VFs is done by SR-IOV Operator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the use want to create a policy that change the VF configuration we don't block it right now.


## Proposal

A new kubernetes-nmstate handler DaemonSet is deployed in the cluster part of the OpenShift installation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear to me how the kubernetes-nmstate is going to be deployed.
Is it an Operator deployed via Operator hub or during initial OpenShift installation?
If it runs as DaemonSet, which component will be responsible to create manifests such as CRDs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest version of the proposal outlines the operator that is being developed that is in charge of that. The plan is to have it installed during initial OpenShift installation.

This DaemonSet contains nmstate package and interacts with the NetworkManager
on the host by mounting the related dbus. The project contains two
Custom Resource Definitions, `NodeNetworkState` and `NodeNetworkConfigurationPolicy`.
`NodeNetworkState` objects are created per each node in the cluster and can be
used to report available interfaces and network configuration. These objects
are created by kubernetes-nmstate and must not be touched by a user.
`NodeNetworkConfigurationPolicy` objects can be used to specify desired
networking state per node or set of nodes. It uses API similar to `NodeNetworkState`.

kubernetes-nmstate DaemonSet creates a custom resource of `NodeNetworkState` type per each node and
updates the network topology from each OpenShift node.

User configures host network changes and apply a policy in `NodeNetworkConfigurationPolicy` custom
resource. Network topology is configured via `desiredState` section in `NodeNetworkConfigurationPolicy`.
Multiple `NodeNetworkConfigurationPolicy` custom resources can be created.

Upon receiving a notification event of `NodeNetworkState` update,
kubernetes-nmstate Daemon verify the correctness of `NodeNetworkState` custom resource and
apply the selected profile to the specific node.

### User Stories
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a use case where user doesn't want to assign SR-IOV VFs to a SR-IOV Pod, instead they'd like to use macvlan on top of VFs. Do you think kubernetes-nmstate can be used in such case to manage this kind of VFs once they are created by SR-IOV Operator? By managing VFs, I mean configuring network attrributes such as vlan, mtu etc on that VF device (which, in this case, can be considered as a host device) .

Copy link
Member

@phoracek phoracek Dec 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK nmstate does not support configuring these attributes of VF. @EdDev? Not sure whether there is a support to configure VF parameters through NetworkManager.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A VF will eventually become a regular interface, so you could define whatever you want on it.
In addition, nmstate/nmstate#648 extends PF and VF definition to expose other capabilities.

If the PF or VF is defined by a different tool (CNI?), nmstate will "see" it but will consider it "down". You can take control over them using nmstate and define whatever you want.

Copy link
Contributor

@zshi-redhat zshi-redhat Dec 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the linked PR for supporting PF/VF capabilites via nmstate.

Current SR-IOV Operator is responsible for VF provisioning (creating number of VFs on the host) and manages SR-IOV sub-components, such as SR-IOV CNI and SR-IOV Device Plugin. but we are adding VF index support in SR-IOV Operator to allow it only manages sub-set of VFs device from PF. This, once merged, will leave the rest of VFs on the same PF become un-managed where I see kubernetes-nmstate can fit and take over the management of the rest VFs.

SR-IOV CNI managed via SR-IOV Operator contains the capability to configure VF properties such as spoof check, trusted VF, MAC address, link state and tx_rate etc which are the same as Kubernetes-nmstate. But it will only be used when VF is requested and attached to a Pod. I think this is a clear dividing line that we shall keep in mind going forward between use of kubernetes-nmstate and SR-IOV Operator for configuring VF properties. If it's for Pod VF configuration, then SR-IOV CNI shall be used. If it's for host-level VF configuration that may be used for other purpose instead of directly used in SR-IOV Pod, then Kubernetes-nmstate shall be used.

Several questions regarding provisioning VFs (setting number of VFs ):

  1. Is Kubernetes-nmstate also responsible for driver binding such as vfio-pci? or will it be ?
  2. When provision VFs for Mellanox card, does it support confguring NIC firmware to set the max number of total_numvfs? Further to this, do you see Kubernetes-nmstate can be extended to do vendor specific config, such as changing link type (ethernet or infiniband) for mellanox card?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Is Kubernetes-nmstate also responsible for driver binding such as vfio-pci? or will it be ?

DPDK and 3rd party interfaces have been discussed as part of OpenStack requirements.
The discussions and requirements have not materialized into any plan yet, mainly because there was not enough demand for it.
If you are interested in exposing and managing this through nmstate, I suggest pushing it through the regular channels to raise its priority.
I hope we will implement a pluggable provider interface in nmstate in the near future (https://nmstate.atlassian.net/browse/NMSTATE-262), then one can develop and expose custom data using plugins (which can later be embedded into the formal support of nmstate).

  1. When provision VFs for Mellanox card, does it support confguring NIC firmware to set the max number of total_numvfs?

Current implementation uses NetworkManager as the provider to change these. If it supports it, nmstate and knmstate supports it.
If it does not support it, a BZ needs to be opened in front of NM.
(in special cases, nmstate can extend missing parts in NM as well. We do so already.)

Further to this, do you see Kubernetes-nmstate can be extended to do vendor specific config, such as changing link type (ethernet or infiniband) for mellanox card?

Seems to fit the discussions we had with OpenStack on 3rd party interfaces.
If the vendor HW fits an existing type, like ethernet, then nmstate can expose an additional subtree for specific capabilities of the interface.
As with my DPDK answer, I hope the plugin option will allow vendors specific config could be exposed easily by writing a plugin and later to consider it for full nmstate support.

Copy link
Contributor

@zshi-redhat zshi-redhat Dec 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As nmstate is a library, you can use it from whatever operator that fits your need.
IMO, implementing the "how" again in another code base is wasteful, but I may not see the whole picture.

Thanks for the advice! The fact I'm having this question is SR-IOV Operator has implemented VF provisioning functions & GA the support in 4.3 and I didn't see an equivalent in nmstate at this point. But I'm seeing nmstate as a project to converge the implementation of VF configuration/provosioning in future releases on OpenShift. Until we get to it, there might be issues that we need to pay attention to for co-existance of VF provision functions in both nmstate and sriov operator. One example I can think of is that if user configures number of VFs(with different number) from both Operators which all work in declarative ways, then node enters a infinite loop of being cordoned (SR-IOV Operator cordons the node before re-provisioning VFs).

If the declerative nature of the nmstate api fits your needs, I would recommend investing the effort there and gain the shared support advantage.

Btw, What is the use case that user would use NMstate to provision VFs?

For example, one can define a RED network on a host and assign it to a pool of VF/s that connect to the same network. Then, a VM comes up and requests to connect to the RED network and the hypervisor will provide a VF from the relevant pool.

Ack, I think we use different way to define and manage the VF resource pools on Kubernetes. But that doesn't affect us using the same nmstate library for host-level VF provisioning and configration.

Applications may also take advantage of VF/s for data path acceleration, but these scenarios are less common.

Can we list this as one user story in favor of supporting customer case?
For example, Manage/Configure host SR-IOV network device, this may include configuration of vlan, mtu on VF device on the host.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done @zshi-redhat can you please take another look?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @SchSeba, the change looks good to me.
Nit: manage -> managed. vlans, mtu drive <- do you mean driver here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right my bad thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SchSeba @zshi-redhat can you modify the PR to include the VF-related use case? The discussion above is quite long and involves plenty of implementation and calendar concerns, so I ended up not understanding the motivation. Do you want to create the VF? Only set its vlan/mac/mtu/ip? Why do you want to do that rather than use the PF?


Copy link
Contributor

@squeed squeed May 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meta-criticisim - these aren't user stories. User stories look like this:

"As a (person), I'd like to do (high level goal)." Start with a goal, then show how your proposed solution fulfills this story. "With (feature x, y, and z), I can accomplish this.""

It's important that user stories are written without the proposed solution in mind. Ideally they come "first" in the design process. Their role is to identify real needs, rather than extant features. That way you can be sure you're designing a solution for problems, not the other way around.

An example:

I am cluster administrator with a dedicated storage network, because performance for me is critical. I need to make sure that all storage traffic uses the dedicated network, and I need to be able to schedule my Gluster pods on nodes with 10GB interfaces.
I would configure my routing tables with k8s-mnstate and (filling in the details here). The existing functionality doesn't work for me because (of all these reasons).

Make sense?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The format is just a guideline, not a fixed spec of what a user story is.
The content is what matters and it needs indeed to express the need/goal.

The needs below are focused on network specific points, like "Be able to crate a bond". It is not the intent here to explain why bonds are needed or how an admin should use it. We are at a level below that, where the need to have these stuff should be obvious.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly agree with @squeed .

The most important argument this enhancement needs to make is a clear argument for why this capability is required on all OpenShift clusters by default. Nothing in this enhancement resonates in its current form explaining why this is needed univerally in OpenShift rather than as an advanced optional add-on component delivered via OLM.

#### Bond creation

* Be able to create bond interfaces on OpenShift nodes.
* Create a vlan interface on top of the bond inter.

#### Assign ip address

* Assign static and/or dynamic ip address on interfaces
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do I want to do this?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are asking why we want to control the IP on day 2?
Will something like this answer this? It may be necessary to add an IPv6 address to an existing IPv4 one. or Secondary interfaces may be defined on day 2 configuration.

* Assign ipv4 and/or ipv6

#### Create/Update/Remove network routes

* Be able to Create/Update/Remove network routes for different interfaces like (bond,ethernet,sriov vf and sriov pf)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What goal does this accomplish?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Control routes for custom needs.


#### Manage/Configure host SR-IOV network device

* Be able to change host Virtual functions configuration (not managed by the sriov-operator) like vlans mtu driver etc..
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would I want to do this?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the reason is a bit vague because there is a CNI plugin in the picture.
Being able to control SRIOV in general is partially a node level resource, so it makes sense to provide the means for controlling HW at this level. One application could be to run other services on the node (daemonsets?) that need special network access or in case one wants to use VF/s as regular secondary interfaces. Another option would be to control a NIC level setting that the CNI does not.

Maybe there are more interesting scenarios.


#### Rollback

* Be able to rollback network configuration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be part of a meta-user-story that applies to all of them. Something like "I want to be able to roll out changes in a staged, safe manner without reboots. I am concerned that nodes will go unreachable."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is strongly related to the "integrate with MCO" path - if nmstate does things like roll out network configuration to multiple nodes at once that makes it wayyy too easy to take down a cluster. Plus even if it doesn't, and e.g the MCO is starting to drain+reboot one master node while nmstate applies a broken networking change to a different master that's also very problematic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kubernetes-nmstate has an automatic rollback. If it detects that API is no longer reachable, it will rollback to a good state. This is already implemented.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bcrochet where can we go to learn more about this rollback? how do we know it will not conflict with mco itself?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automatic rollbacks get tricky. Have you thought about scenarios where e.g. a new network configuration lands, and then something else causes the API server to become temporarily unreachable, and then the problem becomes compounded by a rollback to a previous network configuration?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@derekwaynecarr we haven't document rollback yet, usually the u/s documentation is at https://github.com/nmstate/kubernetes-nmstate, I will add a PR there to include the info related to rollback.

About rollback details kubernetes-nmstate try to apply some configuration with some timeout so if knmstate dies the networkmanager at nodes will rollback it, then it does a pair of "probes" one is related to ping external world (we do that by pinging default gw) and the other is realted to checking API server works fine (access to a mandatory namespace is done) if nothing related to this works for some time rollback to old config is done.

@cgwalters since something "else" has break API server you are as bad with old or new network config I suppose, even being with old at least you are at the starting point, but I am not sure, this is going to be quite different depending on what was being changed at network and what happend to API server.

if we lose connectivity to the openshift api server after applying a policy.

### Implementation Details

The proposal introduces kubernetes-nmstate as Tech Preview.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all tech preview features must have opt-in. this includes opt-in to installation of any operator that provides the capability. how is opt-in handled in this enhancement if from what i can gather it makes the capability a universal capability of all openshift installs. having a feature dev/tech preview get delivered via opt-in through OLM is much simpler.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is reasonable to do OLM so that it would be opt in.


## Design Details
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to fill this section out :)


### Test Plan
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And write some tests :-)


- Functional tests will be implemented

### Graduation Criteria

Initial support for kubernetes-nmstate will be Tech Preview

#### Tech Preview

- kubernetes-nmstate can be installed via container image
- Host network topology can be configured via CRDs

#### Tech Preview -> GA
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Questions you need to answer when adding a rich feature such as this:

  • How CEE be trained?
  • How will this be documented?
  • How will it be monitored? What data should be provided to Telemetry?


### Upgrade / Downgrade Strategy

### Version Skew Strategy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be filled out as well. How will the configuration change? What are the components affected here? What are the APIs between them? Are these versioned? This goes far beyond defining a CRD.

How tightly is this coupled to the version of NetworkManager? To the kernel? Will this work on RHEL7? Will users need to care? What if NetworkManager deprecates a configuration directive? Adds a new one?

How tightly is this coupled to the advanced CNI plugins? SR-IOV, OVN-Kubernetes, etc? Remember, component upgrades are not strictly ordered.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a single strong dependency and that is between kubernetes-nmstate and the NetworkManager running on the host. The communication happens through D-Bus. NetworkManager API is backwards compatible, which makes kubernetes-nmstate forward compatible. When we release, we support certain version of NetworkManager and everything after it. nmstate release cycle (not kubernetes-nmstate) is tightly bound to RHEL and NetworkManager version, so the only thing we need to ensure on kubernetes-nmstate here, is that we don't push a version that would be newer than NetworkManager currently available on RHCOS.

There should be no coupling with kernel version apart from that done by NetworkManager.

This will not work on RHEL7 since RHEL7 has old versions of NetworkManager. Support for RHEL7 is not on nmstate roadmap.

Users don't need to care. We just need to make sure to release only nmstate that is compatible with RHCOS/RHEL supported by the current version of OpenShift (with the exception of RHEL7). Our operator will not deploy kubernetes-nmstate on RHEL7 nodes.

In case there is a breaking change on NetworkManager side, we can tackle it in nmstate or kubernetes-nmstate with a workaround. However, since there is guaranteed backward compatibility, it would be a NetworkManager bug.

It is not coupled with any CNI. kubernetes-nmstate is just controlling configuration making sure to reach desired state. It won't intervene into any CNI configuration unless it is asked to.


kubernetes-nmstate runs as a DaemonSet.

## Implementation History

### Version 4.4

Tech Preview

## Infrastructure Needed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What testing infrastructure will you need? Special hardware? Special kernels? Will this have CI coverage, or will it have to be entirely manual?


This requires a github repo be created under openshift org to hold a clone from kubernetes-nmstate