Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Baremetal IPI Network Configuration for Day-1 #817

Merged
merged 4 commits into from
Oct 27, 2021

Conversation

cybertron
Copy link
Member

Describe user-facing API for day-1 network customizations in the IPI workflow,
with particular focus on baremetal where such configuration is a common
requirement.

Co-authored-by: Steve Hardy shardy@redhat.com

@openshift-ci openshift-ci bot requested review from squeed and trozet June 28, 2021 21:57
#### API Open Questions

* How do we map an individual machine's keyfiles to the machine? Hostname was previously suggested because it was already an available label in NNCP, but if we're not using NNCP then there might be a better way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current plan is to initially enable this interface only for the baremetal platform, so there may be concerns about adding a new top-level install-config section.

In that case, we could instead add the config to the platform-specific hosts list (which already contains other host-specific data needed to automate deployment), and this also solves the mapping to a host (since we're already consuming this per-host data in the baremetal flow, e.g BMC credentials)

Copy link

@kirankt kirankt Jun 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Steve. This should belong in the hosts section of the baremetal platform. e.g.

platform:
  baremetal:
    hosts:
      - name: openshift-master-0
        networkData:
          eth0:
            [connection]
            id=eth0
            uuid=18e0cec7-041c-4fb3-957c-a60c80dd9b85
            type=ethernet
etc...     

Please note networkData is a placeholder name I'm using as an example.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, this sounds reasonable. I'd be more concerned about using a baremetal-specific structure if we had the common nmstate API, but since this implementation isn't going to be useful to other groups like OVNK anyway I'm good with it.

Copy link
Contributor

@hardys hardys Jun 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that raises the question (already mentioned in the alternatives) of whether we want this to be nmstate format or just inlined keyfiles? I guess we need the interface to enable future nmstate support even if we start with keyfiles.

Also for those familiar with OpenStack networkData implies a different format - maybe we should call this something else e.g networkConfig or networkFiles (then adding e.g nmstateConfig later would be simple)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use nmstate yaml for install-config then we need nmstate available on the provisioning host so we can convert it to keyfiles. I'm fine with doing that (it's how we'd have to process NNCP records too), but it does add a dependency. I'm not sure how controversial that will be. It's also a bit more complex implementation-wise than straight keyfiles.

One option would be to replace the interface name with the filename, i.e. eth0 -> eth0.nmconnection. That's going to have to happen at some point anyway, and then we could potentially add support for eth0.yaml, which would indicate nmstate data. But maybe that's a little too magic?

+1 to not overloading networkData.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or to say it more simply another way: today openshift-install generates stub/pointer configs - I'm saying the user can provide a config which is included by that one, not that it replaces the generated pointer config.

@cgwalters this doesn't work for controlplane network configuration as previously discussed via coreos/ignition#979 - there is a chicken/egg problem e.g networking to be up before it can retrieve the merged config from the MCS.

So instead we're adopting the same process as UPI here, generate an ignition config for the coreos installer live-iso which then contains the network configs for the live-iso and coreos-installer --copy-network - the issue is there's no corresponding openshift-install interface that allows the user to provide network configs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cgwalters this doesn't work for controlplane network configuration as previously discussed via coreos/ignition#979 - there is a chicken/egg problem e.g networking to be up before it can retrieve the merged config from the MCS.

I'm talking about injecting into the pointer configuration, not the MCS config. But I think that's not relevant because:

generate an ignition config for the coreos installer live-iso which then contains the network configs for the live-iso and coreos-installer --copy-network

Right, OK! So I think a similar case applies though - this "inject arbitrary network configuration into live ISO" really generalizes into "inject arbitrary Ignition into live ISO". I know some people in the past have disagreed with exposing this "arbitrary Ignition config" functionality, but in practice we already expose completely arbitrary things via MachineConfig, so having this generic functionality to me does not cost much in terms of support. I think some prior objections were rooted in us not having an opinionated high level sugar for Ignition, which is partially being addressed by shipping butane.

Now, we don't need to gate this enhancement on having an official way to inject extra ignition into the installer's generated configs. I'm more arguing that that's the way to think of it - this is just "syntactic sugar" for an ignition config which writes the provided data to /etc/NetworkManager.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm talking about injecting into the pointer configuration

As I said, this doesn't work, because Ignition tries to resolve the merge directive before applying any other config in the pointer ignition, so you can't use pointer customizations to influence controlplane network config

UPI works around this limitation by allowing users to embed network configuration into the live-iso, then apply that to the deployed OS via coreos-installer --copy-network (so ignition is not involved other than as a means to drop the files into the live-iso) - we're doing the same for baremetal IPI.

this "inject arbitrary network configuration into live ISO" really generalizes into "inject arbitrary Ignition into live ISO"

This is true, but we're trying to reduce the friction around a really common for baremetal case here, not expose a way to arbitrarily customize the live-iso process.

Use of the live-iso here is arguably an implementation detail, particularly given that no other IPI platforms currently use it, so exposing a generic cross-platform installer interface would not be possible?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(so ignition is not involved other than as a means to drop the files into the live-iso)

I think it's misleading to say it that way. Ignition is the mechanism to configure the Live ISO - including networking for the Live ISO. The whole design is intended to enable exactly this.

The way I'd say it instead is that configuring the Live ISO via Ignition allows handling effectively arbitrary network/install requirements.

not expose a way to arbitrarily customize the live-iso process.

We already do expose this to be clear; coreos-installer iso ignition embed is stable public API. I'm just arguing to make it more ergonomic to use for baremetal IPI.

Use of the live-iso here is arguably an implementation detail, particularly given that no other IPI platforms currently use it, so exposing a generic cross-platform installer interface would not be possible?

Well, I think it would boil down to having a "configure the pointer config" hook, and a "configure the live ISO config", where the latter would only apply on baremetal IPI.

Anyways...hmm, I guess my bottom line is I am not opposed to the proposal of a networkData type stanza in the install config only for baremetal. I'm just saying we should think of that as sugar for a generalized liveIgnitionConfig that we may want in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just saying we should think of that as sugar for a generalized liveIgnitionConfig that we may want in the future.

Ack, thanks @cgwalters - I think we're basically in agreement - I don't think anything here precludes potentially adding a more generalized live-iso-ignition "escape hatch" in future.


#### Deployment of the controlplane hosts via terraform

TODO - we'll have to match the keyfiles to the master Machines, then generate image/images for deployment via terraform/ironic
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do this by consuming the install-config data, then generating some files to be consumed on the bootstrap VM by the ironic startup script - this can then generate the images for each controlplane host prior to starting any ironic services (terraform waits for ironic-api to start before attempting to deploy)

This raises the question of if we can detect when the configuration is common for all hosts - it'll be simpler to just generate 3 images (with predictable filenames so we can pass the necessary URL to terraform/Ironic), but obviously that wastes some disk space in the bond+vlan case (where the same config will be applied to all 3 master hosts).

@kirankt I think you've been looking at this area, any thoughts?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now we consume one rhcos image for all control plane nodes from the rhcos-stream data. I don't think it will be necessary to send separate copies of the images for each control plane node. All we have to do is send separate network data to the ironic agent and the agent should perform the copy network action independently.

https://github.com/dtantsur/ironic-agent-image/blob/08c3803db844e5bd50d21017c7f0c81158dc6a1a/hardware_manager/ironic_coreos_install.py#L80

Although, I'm not sure why we're using metadata here rather than just passing networkdata directly. Perhaps @dtantsur can clarify.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a reasonable optimization, but if we don't get it for the initial implementation it will still work so I'd consider this a stretch goal if we manage to finish this feature early. :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, sounds like we may have a few possible optimizations here - maybe we could add them to the PR as possible future work so we don't lose track?


* How do we map an individual machine's keyfiles to the machine? Hostname was previously suggested because it was already an available label in NNCP, but if we're not using NNCP then there might be a better way.

* If we're using RHCOS functionality, do we still need the custom image build discussed below? Or can we just pass the keyfiles as part of the deployment process?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need it for the IPA image, where there's no other way to pass it (we can't rely on the network working without the network config).

### Goals

* Define API for day-1 network customizations
* Enable common on-premise network configurations (bond+vlan, static ips) via IPI
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should mention if it's a goal to provide different bond/vlan config per host (not just static IPs, which obviously have to be per-hosts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I guess bond configuration could differ between hosts or groups of hosts (different nic names etc) - I think the API described here is basically always config per-host, but we can then perhaps optimize by detecting common config (to avoid duplicate images).

I also wonder if we have per-host config in the install-config.yaml if we can make yaml anchors/aliases work so that e.g in the case where you have a single bond/vlan config for every host you don't have to copy/paste it for each host.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this kind of relates to the de-duplication conversation below. I'll add something about that as a stretch goal.


We could expose an interface to the user which accepts raw nmstate data (not a CR wrapper). This would mean a common DSL to the NodeNetworkConfigurationPolicy resource, but not directly reusing that API, so there is risk of drift between the two APIs.

The user would have to pass the nmstate data via the install-config, and also as a Secret for use on scale-up, then also define a NNCP resource if day-2 management of the configuration is required (which could be risky if the configuration isn't exactly the same...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't the installer set up all of these?

Or the CAPBM could create the Secret to use for day-1 automatically from a NNCP resource that is to be used for day-2.

Copy link
Contributor

@hardys hardys Jun 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is ref #747 that the NNCP CRD provided via kubernetes-nmstate doesn't exist at the point of install - it's currently only applied if/when you install the operator day-2 from OLM.

We could look at making CAPBM look for such a resource in future though, then fall back to using secrets (which might resolve some of the concerns around day1/day2 interface overlap, if we can make it clear to users which API they should interact with)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I was missing that context.
I wonder if the objections to #747 could be overcome by saying we'll enable the operator at install time whenever the user specifically provides network configuration in the installer, rather than always on particular platforms.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zaneb perhaps - the impression I got from discussion with @dhellmann and @derekwaynecarr is that we don't want knmstate in the release payload at all, because it's not relevant to all platforms (although arguably it could be e.g to support secondary interfaces for OCS) (there are also some unresolved questions around support status and ownership to deal with).

But if we're going to block core-platform functions like network configuration from the release payload long term, it probably does mean we need some install-time way to enable optional dependencies from OLM or other sources.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hardys I would argue that we need network configuration on all platforms. OVN requires OVS network configuration on every single platform, which today is done by the MCO ovs-configuration bash script (creating network manager config) and would be much better served by nmstate or something more robust.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trozet I completely agree, but since we've not managed to achieve consensus on how to do that (ref #747) the scope of this PR is limited to using the interfaces which are currently supported (e.g keyfiles).


We could expose an interface to the user which accepts raw nmstate data (not a CR wrapper). This would mean a common DSL to the NodeNetworkConfigurationPolicy resource, but not directly reusing that API, so there is risk of drift between the two APIs.

The user would have to pass the nmstate data via the install-config, and also as a Secret for use on scale-up, then also define a NNCP resource if day-2 management of the configuration is required (which could be risky if the configuration isn't exactly the same...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this problem, of having 3 different copies of the same thing at

  • install-config
  • secret
  • day-2 NNCP

exists also in the proposed solution; it's not a special downside of this alternative.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The install-config data doesn't exist after installation, but yes I agree - we'll potentially have two copies of network configuration which isn't ideal.

Until we get a fully supported kubernetes-nmstate and an API derived from that included by default in OCP, I don't see how this can be avoided though. We had a previous draft proposal which aimed to use NNCP for both day-1 and day-2 - but unless that API is available on day-1 (which seems unlikely given #747 is stalled) that doesn't seem to be a viable option.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We agree about this. But it is wrong to attach this to this alternative; this text should move up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this entry was largely referring to a design we've abandoned for now. I'll update it to compare/contrast with the current proposal. I think we've already discussed the drawbacks in terms of duplicate configuration in the Risks and Mitigations section.


### Provide raw nmstate data

We could expose an interface to the user which accepts raw nmstate data (not a CR wrapper). This would mean a common DSL to the NodeNetworkConfigurationPolicy resource, but not directly reusing that API, so there is risk of drift between the two APIs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tyll & @cathay4t may correct me, but I think that NetworkManager keyfiles are similarly susceptible to drift. For example, I'm not sure that the NM team would be happy to learn that secret created for RHEL-8 NM are being used with RHEL-9 workers. NMState exists in order to provide a stable, human-readable representation of node network state; and in that it is more committed for backward and forward compatibility.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very interesting. One of the objections to enabling NMState was that keyfiles provide the same functionality and the same backward compatibility. If that's not true then it eliminates one of the major arguments against NMState.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps @cgwalters may have some comments here - we did discuss this recently and he mentioned that the keyfile interface supported for UPI has proven stable, but I suspect if/when any major-rhel-version rebase happens for RHCOS we may be faced with some larger incompatibilities.

I'm not clear what our strategy would be if/when such a major-version bump happens, would we anticipate a path for in-place upgrades or is it more likely to be a reinstall?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bias: I am developer of nmstate.

  • The keyfile is not standardized as yaml/json. The only specification like document I found is http://www.gtkbook.com/gtkbook/keyfile.html . Using keyfile directly will need your project sharing mutual understanding of keyfile spec as NetworkManager.
  • Besides creating NM connection files, nmstate has done a lot workarounds on complex network layouts like activation order, link depedency, preserving link in up state with best efforts, rollback on verification failure. Eventually, you will find yourself facing the old problems nmstate had.
  • IMHO. Keyfile of NetworkManager has never been treated as API.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thom311 As NetworkManager developer, any suggestions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zaneb ++

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@phoracek I don't think Python was ever the objection, but I'll keep that in mind as we continue the discussions.

@dankenigsberg I'm not sure whether adding a dependency to the installer is necessarily trivial

Just to add some context on the constraints here, the installer is written in golang and it is delivered as a single binary with no dependencies. So anything that requires a Python, or even Rust, program at install time is a non-starter.

Copy link
Contributor

@dankenigsberg dankenigsberg Jul 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the installer is written in golang and it is delivered as a single binary with no dependencies.

Thanks @zaneb, I did not know. This constraint defies the ancient Unix principle of multiple applications interacting with each other over well-defined interfaces. It is sad that we need to re-learn it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dankenigsberg well, note that the installer isn't delivered as an RPM, nor is it restricted to platforms we control, nor even to Unix-like platforms. If potential customers wanting to try out OpenShift in their cloud account are spending their time wrestling with getting the right version of Python on their Windows laptop then we are losing a good proportion of them as customers.
That said, I didn't make the rules :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote up this change and proposed it as a pull request to this pull request (cybertron#1).

Copy link
Member Author

@cybertron cybertron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll push my updated local copy of the doc later when we've had a chance to wrap up some of the existing conversations here.

### Goals

* Define API for day-1 network customizations
* Enable common on-premise network configurations (bond+vlan, static ips) via IPI
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this kind of relates to the de-duplication conversation below. I'll add something about that as a stretch goal.

#### API Open Questions

* How do we map an individual machine's keyfiles to the machine? Hostname was previously suggested because it was already an available label in NNCP, but if we're not using NNCP then there might be a better way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use nmstate yaml for install-config then we need nmstate available on the provisioning host so we can convert it to keyfiles. I'm fine with doing that (it's how we'd have to process NNCP records too), but it does add a dependency. I'm not sure how controversial that will be. It's also a bit more complex implementation-wise than straight keyfiles.

One option would be to replace the interface name with the filename, i.e. eth0 -> eth0.nmconnection. That's going to have to happen at some point anyway, and then we could potentially add support for eth0.yaml, which would indicate nmstate data. But maybe that's a little too magic?

+1 to not overloading networkData.

@cybertron
Copy link
Member Author

Okay, I've pushed an updated version of this based on the comments so far. The two remaining points I'm aware of that need to be addressed are:

  • How do we encode the user-provided config into a secret? Base64 encode in the format discussed in the user API section? Maybe as JSON instead of YAML?
  • Any more details about the master deployment process? I gave a (very) high level overview, but someone with more experience in that process can probably do better.

@cybertron cybertron force-pushed the day-one-network branch 3 times, most recently from 0dc3d3e to 004d6a2 Compare July 7, 2021 21:05

However, we have been unable to get agreement on a method to deliver the operator on day-1, which makes this problematic.
It's possible we could implement some way to pull just the NNCP CRD from the operator payload and isntall it on day-1, but at this point we likely don't have time to do that and it's not clear we even want to if the operator can't be installed on day-1 too.
This would essentially orphan the CRs after the initial deployment completes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we're doing in the PAO is having its render mode create the CR needed to have OLM install the operator when the cluster comes up. Does that not work in this case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because a lot of baremetal deployments are done disconnected, we can't rely on the operator being available to OLM. In many environments the deployer will need to take additional steps to mirror the content into their environment before it can be deployed. In addition, because of the sheer amount of content that needs to be mirrored for OLM there may be deployers who don't want to do that at all. I know there are discussions happening that may improve the mirroring situation, but nothing that will happen soon enough to help us here.

Also, part of the lack of agreement on NMState for day-1 was whether we should use it at all. I hope we can eventually get past that since it's already in use for day-2, but we just ran out of time to have that discussion if we wanted to deliver this in 4.9.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, it would be good to capture that level of detail here in this document.

I'm not sure I buy in to the "mirroring is hard" argument. We're certainly expecting them to mirror the PAO to get those features if they want it. There's also work going on to make the process for pruning the catalog easier.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main point is that OpenShift today lacks any API for day-1 host network configuration (other than the keyfiles passthrough provided by coreos-installer

IMO that is a core function of the platform, and it's not ideal to ask users to enable optional add-on components to achieve it, but since there isn't consensus on that (ref #747) we've decided to go with the interface that does exist in the meantime e.g keyfiles - that doesn't preclude adding some higher-level API in future, but it also means we don't have to block on it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the interest of not reiterating the entire discussion in #747 here, can we just link to that PR, explain that we did not have time to resolve it, and note that we want to revisit it in the future once we have the base keyfile functionality implemented?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I've done what I said in my previous comment. Let me know what you think.

@cybertron
Copy link
Member Author

Added the missing sections and tried to address the comments so far.

BareMetalHost resources for workers will be created with the Secret containing the network data referenced in the `preprovisioningNetworkData` field defined in the Metal³ [image builder integration design](https://github.com/metal3-io/metal3-docs/blob/master/design/baremetal-operator/image-builder-integration.md#custom-agent-image-controller).
This will cause the baremetal-operator to create a PreprovisioningImage CRD and wait for it to become available before booting the IPA image.

An OpenShift-specific PreprovisioningImage controller will use the provided network data to build a CoreOS IPA image with the correct network data in place, and the baremetal-operator will then use this to boot the Host.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/network data/ignition configuration/ (for clarity)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, done.


### Upgrade / Downgrade Strategy

There should be little impact on upgrades and downgrades. Nodes are deployed with network configuration baked into the image, which means it will remain over upgrades or downgrades. NetworkManager keyfiles are considered a stable interface so any version of NetworkManager should be able to parse them equally.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we talk about the whole CoreOS IPA business here? It does have upgrade impact.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that needs another enhancement proposal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proposed as #879.

#### API Open Questions

* How do we map an individual machine's keyfiles to the machine? Hostname was previously suggested because it was already an available label in NNCP, but if we're not using NNCP then there might be a better way.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably simplify the config a bit and also use the suggestion to include file names like so:

platform:
  baremetal:
    hosts:
      - name: openshift-master-0
        networkConfig:
        - eth0.nmconnection: |
            [connection]
            id=eth0
            uuid=18e0cec7-041c-4fb3-957c-a60c80dd9b85
            type=ethernet
            etc...


Currently in the IPI flow, there is no way to provide day-1 network configuration
which is a common requirement, particularly for baremetal users. We can build
on the [UPI static networking enhancements](rhcos/static-networking-enhancements.md)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I missed updating this. Will fix shortly.

#### API Open Questions

* How do we map an individual machine's keyfiles to the machine? Hostname was previously suggested because it was already an available label in NNCP, but if we're not using NNCP then there might be a better way.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also wanted to clarify which format we're all leaning towards: NM Keyfiles or nmstate config. Either way, I hope I hope we're just consuming a blob of network config data. I was just looking at the nmstate schema and its pretty extensive. I don't think we should include it in the install-config. At least in this proposal.
https://github.com/nmstate/nmstate/blob/base/libnmstate/schema.py

## Release Signoff Checklist

- [ ] Enhancement is `implementable`
- [ ] Design details are appropriately documented from clear requirements
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could probably mark the status as implementable and check both these I think, given the pretty thorough discussion so far?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@hardys
Copy link
Contributor

hardys commented Oct 12, 2021

/lgtm

Not approving since I contributed to an earlier draft of this, but overall this lgtm and I think outlines an actionable plan which won't impact any platforms outside IPI baremetal (long term OpenShift should gain a platform-agnostic API for network config IMO, but that's outside the scope of this iteration)

Copy link
Contributor

@trozet trozet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looks good. Some comments inline. @fedepaol can you PTAL?

In install-config this would look like:
```yaml
platform:
baremetal:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we put it under baremetal if there might be plans in the future to use it on other platforms? I'm thinking here about replacing ovs-configuration with this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with doing anything cross-platform for this release is that we need image customization functionality that is currently only provided by baremetal IPI. As I understand it there is a similar feature in development that will work on all platforms, but it wasn't going to be available soon enough to satisfy our baremetal use cases. I suppose we could put the configuration somewhere more generic, but since it won't work on anything but baremetal right now that could be confusing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per our conversation it seems like we could use this for baremetal to replace configure-ovs, and configure multiple nics to be on OVS bridges. configure-ovs can still exist (we need it anyway for hte other platforms), and today just skips doing anything if br-ex is already up and configured. However, this will allow us to do some more complex types of network deployments on baremetal, so seems like a solid first step to me.


## Drawbacks

Using NetworkManager keyfiles as the user interface for network configuration makes day-2 configuration more problematic. The only existing mechanism for modifying keyfiles after initial deployment is machine-configs, which will trigger a reboot of all affected nodes when changes are made. It also provides no validation or rollback functionality.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the keyfiles are laid down as machine-configs, and we modify them in ovs-configuration, wont that be an issue?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a good point. Although I suppose every time MCO rewrites them the script would run again to redo the changes? I know we have deployers doing this today and it works, but I don't know for sure that they're using OVNK.

baremetal:
hosts:
- name: openshift-master-0
networkConfig:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will support all types of nmstate configuration right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so? Under the covers we'll be using the nmstatectl cli to convert this to nmconnection files that will be baked into the image. I'm not aware of any nmstate configurations that couldn't be handled that way, but that doesn't mean there aren't any.

```yaml
openshift-master-0:
interfaces:
- name: eth0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this works if you know the NIC name. What about doing something like TripleO did with introspection... identifying NICs dynamically and mapping them to networks. I get this might not be applicable to the scope of this enhancement, just thinking longer term how that would work with this API, or if that would need a new API?

An example of this is I might have a scenario where I want my kapi traffic to go over the ctlplane network, but I want to define another network for all my default egress OVN traffic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would need a higher level API. We're using raw nmstate configuration data here, and it's not going to know anything about OCP networks. A higher level api might be something to consider for the cross-platform implementation of this?

The one drawback is that we were trying to stay as close as possible to the kubernetes-nmstate CRD so if someday we have kubernetes-nmstate available on day 1 we could use that CRD and avoid the custom config. A higher level interface could just intelligently populate the CRD though, so those might not be mutually exclusive goals.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This config has to be applied before introspection, because introspection happens... over the network.

In the long term I think we should just think of this as a way to configure enough of the network to be able to communicate with ironic. In the short term people will inevitably use it to put their whole network config in because we're not providing them with another way of setting up the network config for the running cluster nodes at install time, but we shouldn't focus too much on that.

This is a good point that the end solution for cluster networking configuration may not look like just sticking this same nmstate file into k8s-nmstate, but may involve some other new operator that takes in higher-level data about mapping NICs to networks, inspection data from hosts, and the initial nmstate, then combines them to produce the network config for the cluster node that it then passes to k8s-nmstate.

Perhaps this also answers your other question about whether this field should be baremetal-specific: yes, because no other platforms have this pre-boot configuration problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to mention, I know that the nmstate folks (@qinqon in particular) are working on a dynamic way to generate nmstate configurations depending on the current network layout (i.e. defining the bridge on top of the default gateway, instead of sticking the interface name).

Not sure of the state, because we discussed it some time ago, but I think that would solve this issue.

Copy link
Contributor

@qinqon qinqon Oct 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, yes we are working on a tool named nmpolicy that will solve something at least similar to that.

We have a design document with examples, here it's the example you are interested on:

capture:
  default-gw: routes.running.destination=="0.0.0.0/0"  
  base-iface-routes: routes.running.next-hop-interface==capture.default-gw.routes.running[0].next-hop-interface
  base-iface:  interfaces.name==capture.default-gw.routes.running[0].next-hop-interface
  bridge-routes: capture.base-iface-routes | routes.running.next-hop-interface:="br1"
  delete-base-iface-routes: capture.base-iface-route | routes.running.state:="absent"
  bridge-routes-takeover: capture.delete-base-iface-routes.routes.running + capture.bridge-routes.routes.running
desiredState:
  interfaces:
    - name: br1
      description: Linux bridge with base interface as a port
      type: linux-bridge
      state: up
      ipv4: {{ capture.base-iface.interfaces[0].ipv4 }}
      bridge:
        options:
          stp:
            enabled: false
        port:
        - name: {{ capture.base-iface.interfaces[0].name }}
  routes:
    config: {{ capture.bridge-routes-takeover.running }}
 }}

With DHCP activated is simpler.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that will help; nmstatectl cannot run on the host, it must run on the cluster and produce keyfiles that are incorporated into the ISO before it even boots on the host. We don't currently have a way of building e.g. a container image into the ISO, so everything but the Ignition (which contains the keyfiles) must come over the network and therefore cannot be required to set up the network.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nmstate team is reimplementing nmstatectl so it's just a binary without python dependencies, will that help with that ? They are also going to implement a flag to configure directly on the kernel with netlink bypassing NetworkManager.

Copy link
Member

@zaneb zaneb Oct 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless it ships installed by default in the CoreOS live ISO, no.

Copy link

@romfreiman romfreiman Oct 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this also answers your other question about whether this field should be baremetal-specific: yes, because no other platforms have this pre-boot configuration problem.

@zaneb also UPI can gain from having static ips, right?
https://docs.openshift.com/container-platform/4.8/installing/installing_bare_metal/installing-bare-metal.html#installation-user-infra-machines-iso_installing-bare-metal

This is basically what is already happening in the assisted installer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@romfreiman UPI can already configure static ips, just not via nmstate (or automated image building), if we want UPI to support nmstate syntax in future that should be covered via another proposal, IMO it's outside of the scope of this enhancement.


The data provided by the user will need to have the following structure:
```yaml
<hostname>: <nmstate configuration>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you mentioned at first it will be per host, but I think that will be really annoying for some users. Assume in my scenario all my hosts are homogeneous and I have 120 of them. That's a giant config. Could we just support applying the same config to all nodes at least initially as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds reasonable. I'll have to defer to the hardware folks who are doing the implementation, but I'd be in favor of something like that. Maybe we could even provide a default configuration that would automatically be applied to all nodes, unless a given node had a specific configuration provided?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the implementation we won't care either way. It's just a question of whether we can give it an interface that makes sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for default config / templated configs and some sort of node selector mechanism

@fedepaol
Copy link
Member

overall looks good. Some comments inline. @fedepaol can you PTAL?

left couple of comments, looks good!

@romfreiman
Copy link

Did you take into account existing implementation of the assisted installer and ztp? Are those collide? Or can work together?

@hardys
Copy link
Contributor

hardys commented Oct 18, 2021

Did you take into account existing implementation of the assisted installer and ztp? Are those collide? Or can work together?

The ZTP flow currently uses a different flow from IPI (live-iso boot) so won't be impacted by these changes, it would be good in future to converge the flows and we have an epic tracking that but it's outside the scope of this iteration

Also note the existing implementation has been considered and is being reused in this cluster-integrated flow.

@romfreiman
Copy link

@hardys I mostly asking on the nmstate. What will happen if networkConfig will be set in the install config as proposed here, but ZTP flow will be used? Can it work together? Or the networkConfig will be ignored?

@hardys
Copy link
Contributor

hardys commented Oct 18, 2021

@hardys I mostly asking on the nmstate. What will happen if networkConfig will be set in the install config as proposed here, but ZTP flow will be used? Can it work together? Or the networkConfig will be ignored?

@romfreiman until/unless we converge the flows the networkConfig (and new install-config interface) won't be used by the ZTP flow - the ZTP flow builds an ISO with the network configuration embedded already - that process should continue to work regardless of this new flow, but hopefully we can look at ways to converge in future.

means of providing the configuration to the OS is required.

In the UPI flow this is achieved by consuming user-provided NetworkManager
keyfiles, as an input to `coreos-install --copy-network`, but there is

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a reference, assisted installer also uses this mechanism


### User-facing API

RHCOS already provides a mechanism to specify NetworkManager keyfiles during deployment of a new node. We need to expose that functionality during the IPI install process, but preferably using [nmstate](https://nmstate.io) files as the interface for a more user-friendly experience. There are a couple of options on how to do that:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd really recommend using nmstate files. It's simpler, and easy-enough to integrate. As a bonus point, it will make the experience between UPI, IPI, and assisted more consistent. Furthermore, this would make the dream of stronger integration between assisted and IPI more realistic.

Comment on lines +263 to +279
We probably want to avoid a proliferation of different CR wrappers for nmstate
data, but one option would be to convert that (or something similar) into a common
OpenShift API, could such an API be a superset of NNCP e.g also used for day-2?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned this in an earlier comment. There are some advantages to this wrapper, whether it uses it's own CR or just reads a secret and runs validations/actions on the data.

I'd love for assisted and IPI to have a common interface here, it would help with the integration of both projects and it will make it easier for users as well.

data, but one option would be to convert that (or something similar) into a common
OpenShift API, could such an API be a superset of NNCP e.g also used for day-2?

This would mean we could at least use a common configuration format based on nmstate with minimal changes (or none if we make it _the_ way OpenShift users interact with nmstate), but unless the new API replaces NNCP there is still the risk of configuration drift between the day-1 and day-2 APIs. And we still need a Secret to generate the PreprovisioningImage.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but unless the new API replaces NNCP there is still the risk of configuration drift between the day-1 and day-2

Agreed, maybe this is a good opportunity to bump this conversation? Try to find some alignment here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@flaper87 This PR has been up for nearly 4 months, and designing a new API for network config is explicitly stated as a non-goal, so I'd suggest we defer that discussion ;)

Context is - we wanted to use NNCP, but because kubernetes-nmstate (and thus that CRD) isn't deployed via the release payload, we can't consume that API at install-time - discussion on that stalled ref #747

The same problem exists for the new AI nmstate CRD, it's only available via AI or when using the CIM layered-product stack.

IMO we definitely need to provide a core-OCP API for network configuration, but given the historical lack of progress on that discussion, we decided instead to focus on the existing lower level interfaces instead (e.g Secrets), any future CRD API can obviously be layered on top if/when we manage to achieve alignment :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not try to boil the ocean and solve all the networking problems at once, particularly because, as noted in this section, "...we still need a Secret to generate the PreprovisioningImage." Any higher level general purpose networking API is going to have to build on this functionality anyway, not replace it. This is step one in a longer journey that has been delayed for years already.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR has been up for nearly 4 months, and designing a new API for network config is explicitly stated as a non-goal, so I'd suggest we defer that discussion ;)

FWIW, I was not proposing to design a new config. That said, I didn't realize this had been up for so long, 😅

#### API Open Questions

* How do we map an individual machine's keyfiles to the machine? Hostname was previously suggested because it was already an available label in NNCP, but if we're not using NNCP then there might be a better way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's be careful here about what the long-term plans are. Once the fields are put in the baremetal platform section, they must stay there, even if we add implementations for other platforms. So if we truly plan to offer this feature for other platforms, we should give serious consideration to whether or not this is the appropriate place for these fields. Can we see an alternative configuration that is not platform dependent?

Also, I am concerned about "consuming a blob of network config data" in the install-config.yaml. The install config should provide concise configuration options, that are well-defined and well-tested. It should not be a way to support any and all configurations. Can we support specific fields and give the user the ability to modify generated manifests with more customizations as needed?

* A new section in install-config.
* A secret that contains base64-encoded content for the keyfiles.

These are not mutually exclusive. If we implement the install-config option, we will still need to persist the configuration in a secret so it can be used for day-2. In fact, the initial implementation should likely be based on secrets, and then install-config integration can be added on top of that to improve the user experience. This way the install-config work won't block implementation of the feature.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to keep the install-config integration as a separate enhancement. I have serious concerns over the appropriateness of embedding nmstate files into the install config. I would like to start with secrets in manifests that the user can modify. I do not want to derail the viability of this the core of this enhancement with deliberation over whether to include the configuration in the install config.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Providing the data via manifests is probably a viable alternative, but can you expand on the concerns just so we all have relevant context if/when any subsequent proposal gets created?

/cc @cybertron @zaneb @sadasu what are your thoughts here, is it reasonable to fall back to manifests containing the underlying secret data given this latest feedback?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One problem with only exposing the Secret API to users, is they will also need to modify all of the installer-generated BareMetalHost manifests (so they reference the desired secret(s)).

That's likely to be error prone (particularly for environments with a lot of worker hosts and per-host configuration e.g static IPs), obviously it's scriptable but this is why we wanted to enable input of this data via the install-config.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to keep the install-config integration as a separate enhancement.

For all intents and purposes, this is a separate enhancement for the install-config integration.

I have serious concerns over the appropriateness of embedding nmstate files into the install config. I would like to start with secrets in manifests that the user can modify.

It's just the same data with the added downside that you'd have to specify the Secret wrapper (TBH users should never need to know that it's stored in a Secret) and get the name right so that the installer can guess which BaremetalHost to link it to. I assume you're suggesting this as a temporary workaround until we can agree on the right interface for the installer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so that the installer can guess which BaremetalHost to link it to

I'm not sure the installer can do this - after create manifests the installer has already created the BMH manifests, so it will be up to the user to create the Secrets, then modify every single BMH manifest generated by the installer, then run create cluster

Then at runtime, we'll have to read those BMH/Secret manifests on the bootstrap VM (hackily e.g reading manifests in startironic.sh so we can build custom RHCOS images for the controlplane hosts with the right data).

This seems pretty far from ideal in terms of user experience and implementation IMO

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point Steve.
I agree that might work (and perhaps this was @staebler's intention) as a temporary measure to enable us to test the rest of the flow while we agree on what the interface to the installer should be (although it will require a bunch of extra rework), but if editing the manifests is the only interface to it then we haven't really shipped anything usable to customers.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for this enhancement, we keep the install-config option and defer the (alternative option of) implementing the manifest creation until a future release? We will need to flesh out the details of reading/interacting with the user-created network config manifests. For 4.10 we can let users specify the networkConfig as part of the install-config only.

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Oct 27, 2021
@cybertron
Copy link
Member Author

Okay, I've updated this with the latest plan based on the discussions this week. Let me know if you find anything else that needs to be changed. Thanks!

cybertron and others added 3 commits October 27, 2021 15:15
Describe user-facing API for day-1 network customizations in the IPI workflow,
with particular focus on baremetal where such configuration is a common
requirement.

Co-authored-by: Steve Hardy <shardy@redhat.com>
Co-authored-by: Zane Bitter <zbitter@redhat.com>
Co-authored-by: Dan Kenigsberg <danken@gmail.com>
Use nmstate instead of NetworkManager keyfiles as the user interface for
configuring pre-deploy baremetal host networking.
@staebler
Copy link
Contributor

I approve of the general approach here. @kirankt will give a final approval on behalf of the installer team once all of the details that have already been discussed out-of-band are captured in this PR.

/approve

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 27, 2021
@trozet
Copy link
Contributor

trozet commented Oct 27, 2021

/approve

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 27, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: staebler, trozet

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

@kirankt kirankt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 27, 2021
@openshift-merge-robot openshift-merge-robot merged commit 3db41c0 into openshift:master Oct 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.