Move PAO to OCP enhancement #867

MarSik · 2021-08-12T12:28:03Z

This is a preliminary plan describing how to make
the Performance Addon Operator part of OpenShift core.

MarSik · 2021-08-12T12:41:18Z

/assign mrunalp
/assign jmencak

MarSik · 2021-08-12T12:42:08Z

cynepco3hahue · 2021-08-12T12:41:21Z

enhancements/node-tuning/pao-in-nto.md

+
+  The operator contains the high level orchestration logic + a [tool to simplify generation of Performance Profiles](https://docs.openshift.com/container-platform/4.8/scalability_and_performance/cnf-create-performance-profiles.html). This tool can be executed locally on admin's laptop using `podman run --entrypoint performance-profile-creator performance-addon-operator:ver`
+
+- [performance-addon-must-gather](https://catalog.redhat.com/software/containers/openshift4/performance-addon-operator-must-gather-rhel8/5ee0bfa4d70cc56cea83827f?container-tabs=overview&gti-tabs=registry-tokens)


probably we should just add the native support for the PAO under the must-gather

We can add that under the design goals then. in this section we are going over the existing deliverables.

cynepco3hahue · 2021-08-12T12:42:16Z

enhancements/node-tuning/pao-in-nto.md

+  The must gather image [enhances](https://github.com/openshift-kni/performance-addon-operators/tree/master/must-gather) the main OCP one and [collects](https://docs.openshift.com/container-platform/4.8/scalability_and_performance/cnf-performance-addon-operator-for-low-latency-nodes.html#cnf-about-gathering-data_cnf-master) some latency tuning configuration and data that are needed for debugging.
+
+- and the functional tests are also vendored into [cnf-tests](https://catalog.redhat.com/software/containers/openshift4/cnf-tests-rhel8/5e9d81f65a134668769d560b?container-tabs=overview&gti-tabs=registry-tokens) that can be used to [validate customer deployments](https://docs.openshift.com/container-platform/4.8/scalability_and_performance/cnf-performance-addon-operator-for-low-latency-nodes.html#cnf-performing-end-to-end-tests-for-platform-verification_cnf-master) (this includes other aspects like SCTP, PTP, SR-IOV as well).
+


we are missing the performance-profile-creator section

its mentioned in line 45:

The operator contains the high level orchestration logic + a tool to simplify generation of Performance Profiles

Are you missing some more info about it ?

nope, if it is mentioned it is good enough for me.

cynepco3hahue · 2021-08-12T12:45:13Z

enhancements/node-tuning/pao-in-nto.md

+
+- PAO has been installed via OLM since 4.4. Once it moves to core we will have to figure out the upgrade procedure, because OLM will not remove the optional operator by itself.
+
+  One of the options is to keep using the same leadership election method and make sure the newest version always wins. The OLM operator will then stay on the cluster (can be release noted) as an inactive component.


I do not think the leader election supports such conditioning. One of the options is to remove the relevant OLM stuff via the performance-addon controller and leave the CRDs without the change.

What are the relevant OLM stuff here ? if we remove the catalog source/csvs/subscriptions it will probably remove the pao-olm-operator and consequently its owned CRs no ?
This brings me to another thought, what will happen to the previous version Maching config/tuned generated artifacts.

CSV, subscription, operator group. It will not remove the CRD, so all CR's should be untouched.

So are we aiming for a single version finalizer here ?

Added the OLM artifacts removal as an option to this doc.

jmencak

Looks reasonable, spotted a typo.

jmencak · 2021-08-16T12:12:28Z

enhancements/node-tuning/pao-in-nto.md

+
+#### CI
+
+PAO contains both unit and functional tests implemented using Gingko.


s/Gingko/Ginkgo/

jmencak · 2021-08-18T09:42:34Z

enhancements/node-tuning/pao-in-nto.md

+
+PAO code would be moved into the NTO repository as a separate controller. The Telco team would keep ownership as it has the necessary domain knowledge needed to implement the low latency tuning.
+
+The code currently uses Operator SDK libraries in some of the tests and this might need to be changed.


Many of the core operators use either client-go directly or moved to https://github.com/openshift/library-go . You may want to investigate and look into that. My long-term plan for NTO is to evaluate a transition from directly using client-go to using https://github.com/openshift/library-go instead.

dhellmann · 2021-08-20T17:38:23Z

enhancements/node-tuning/pao-in-nto.md

+
+Integration with NTO is currently TBD, please advise if anything special is needed.
+
+One of the test suites (4_latency) implements an opinionated latency testing suite to get comparable results from customers and to prevent configuration / command line invocation mistakes. Keeping it in PAO/NTO or moving it into cnf-tests is TBD.


Please make sure that all of the open questions like this are listed in the Open Questions section (see line 155) so we can ensure we answer them all.

dhellmann · 2021-08-20T17:39:32Z

enhancements/node-tuning/pao-in-nto.md

+
+### Releases
+
+PAO should be shipped as part of core OCP. ART again. PAO already obeys the OCP github PR rules and uses the same PR bot so nothing changes in the lifecycle.


If it's going to be part of the node-tuning-operator, wouldn't this be moot? The code would be moving to the other repo, right?

There are two options:

Maintaining a separate copy of the code in the NTO repository

Vendoring the current PAO code as a "library" until we can stop supporting the standalone mode (4.6+)

But from OCP perspective it makes little difference as it would just be part of NTO release process indeed.

use as library in which binary?

Both will end up in the main NTO binary. The only difference is in the location of the source code.

adding the 2 options of maintainability to open questions section.

dhellmann · 2021-08-20T17:45:15Z

enhancements/node-tuning/pao-in-nto.md

+
+  One option is to remove the relevant OLM artifacts (CSV,Subscription,Operator group) via the performance-addon controller and leave the PAO CRDs without the change.
+
+  Another option, in case leader election supports such conditioning, is to keep using the same leadership election method and make sure the newest version always wins. The OLM operator will then stay on the cluster (can be release noted) as an inactive component.


Another option would be to have the copy in NTO disable itself when the standalone version is installed, updating the ClusterOperator resource with that reason. Then the user can remove the old standalone version from their cluster. This would be easier to implement reliably, since we would only need to look for the subscription or other indicator that there is a standalone copy installed. We also would not need to deliver any changes in the standalone package, so it would not matter which version is installed in the cluster at the start of the upgrade.

@cynepco3hahue after we check the POC lets see what direction we can move towards here.

Please don't do a lot of development work assuming that this enhancement will be approved. At this point, I think it's better to continue delivering PAO via OLM.

adding the option above.

dhellmann · 2021-08-20T17:45:42Z

enhancements/node-tuning/pao-in-nto.md

+
+### Upgrade
+
+- PAO has been installed via OLM since 4.4. Once it moves to core we will have to figure out the upgrade procedure, because OLM will not remove the optional operator by itself.


This is a critical question for this design document to answer.

Agree! Also, can we capture adding e2e tests that will exercise the upgrade scenarios?

We are still examining this, once decided we will add the conclusions here.

dhellmann · 2021-08-20T18:09:17Z

enhancements/node-tuning/pao-in-nto.md

+
+### Code
+
+PAO code would be moved into the NTO repository as a separate controller. The Telco team would keep ownership as it has the necessary domain knowledge needed to implement the low latency tuning.


I think the API definitions would need to go into openshift/api. @sttts or @deads2k can you confirm that?

I second this question. Where would the CRD be defined? NTO's own Tuned is defined in https://github.com/openshift/cluster-node-tuning-operator/blob/master/manifests/20-crd-tuned.yaml

Definitely into openshift/api. It must be visible there in order to be subject to API docs generation and general API compliance work.

Oh, interesting.. @jmencak should NTO move the CRD there as well maybe? Or is there a reason for the separation we need to consider for PAO as well?

Oh, interesting.. @jmencak should NTO move the CRD there as well maybe? Or is there a reason for the separation we need to consider for PAO as well?

Yes, moving into openshift/api is definitely on my list of TODO tasks for the future. Whether it is a good idea to move at the time of PAO/NTO merge, possibly, not sure about that.

adding the move at the time of PAO/NTO merge under open questions.

dhellmann · 2021-08-20T18:10:58Z

enhancements/node-tuning/pao-in-nto.md

+
+### Goals
+
+We (the telco team) would like to make PAO part of OpenShift. There are multiple paths we can take to achieve this, I am going to describe the one we prefer: **PAO becoming a sub-controller in NTO.**


I know we focus a lot on bare metal for PAO work. Do the APIs implemented by the PAO make sense on all platforms and deployment topologies? Are there any conditions under which we would not want those APIs available?

Does it make sense to add a configuration knob to disable it?

Is there an issue with PAO controller being deployed without creating its CRs ? meaning the controller will exist but will not perform any operations (instead of using a disable knob/feature gate)

If it's going to be disabled or not in use in a large percentage of clusters, that seems like it could be another reason to deliver it via OLM.

Hugepages can be configured nicely via PAO as well as adding of kernel arguments that are currently both done via low level MCO (and documented like that in the OCP docs).

https://docs.openshift.com/container-platform/4.7/scalability_and_performance/what-huge-pages-do-and-how-they-are-consumed-by-apps.html#configuring-huge-pages_huge-pages

If nothing else, PAO would improve the user experience and bring NUMA awareness.

dhellmann · 2021-08-20T18:18:07Z

enhancements/node-tuning/pao-in-nto.md

+
+The proposed change adds more code to the NTO component increasing the probability of bugs and possibly a higher resource usage.
+
+However using Performance Profile is optional and so users that do not use this way of tuning should be not affected at all.


This change means introducing the PerformanceProfile into all clusters, meaning the API is visible to all users. Is there any danger that a cluster admin, experimenting with the APIs, could mis-configure their cluster and break something?

For experimentations it would be advised to use the performance profile creator to protect the user from mis-configurations.
Applying directly a performance profile will not be straight forward and there are spec related validations.

@dhellmann Indeed, an admin that is not careful could tune the node improperly, however he has more than enough ways to do that with MachineConfigs themselves.

Incorrect kubelet configuration can prevent the node from starting (wrong cpu specification).

OK, my point is that it's a risk. It might be small, but it's there, so let's document it.

adding the info under the risks section.

mrunalp · 2021-08-20T18:26:49Z

enhancements/node-tuning/pao-in-nto.md

+- OCP is already heavy on infrastructure resource consumption and an additional operator would make that worse
+- implementing a conditional start is possible though, similar to what machine-operator uses, but negates the simpler on coding advantage
+
+### PAO as a sub-controller of Machine Config Operator


We can say that currently a new MCO design is under consideration, so it make more sense to go with NTO for now.

dhellmann · 2021-08-20T19:01:07Z

enhancements/node-tuning/pao-in-nto.md

+- The new [Workload partitioning](https://github.com/openshift/enhancements/blob/master/enhancements/workload-partitioning/management-workload-partitioning.md)
+feature seeks to use PAO as the configuration tool.
+However Workload partitioning needs to be enabled at install phase and PAO CRDs are not yet available at that time.
+[A "hacky" procedure](https://github.com/openshift/enhancements/pull/753) is needed to overcome this.


There's a strong desire to reduce the number of components delivered as part of the OCP release payload, and have more of them installed via OLM. There are several benefits to that approach, including decoupling release schedules, shrinking the payload, and making it possible to reduce the API surface area for alternative topologies. The PAO seems like a strong candidate for OLM delivery, especially since that's how it already works.

If we do add it to the payload, NTO seems like a reasonable "home" (although I could also see moving NTO to OLM delivery). But I'm not yet convinced it's necessary to make this change, based on the motivation described here.

Regarding integration with the installer, it would be useful to have additional requirements added to https://issues.redhat.com/browse/OLM-2232, if the plans there don't already cover the use case for PAO.

There is a strong coupling to NTO features and sometimes to a specific kernel bug fixes delivered in RHCOS.

One of the biggest gaps in OLM for such infrastructure operators (apart from the 1st day install described by OLM-2232) is no support for atomic upgrades and proper dependency mechanism that would support multiple Z-streams.

The merge with NTO was proposed to not increase the amount of installed and running processes indeed.

Moving NTO itself to OLM sounds tricky to me without a way to setup at least the openshift-node tuned profile in RHCOS, this is something I consider important for the platform as a whole (@jmencak do you have the original design for NTO?).

@MarSik, the original design document/proposal is a google doc. I believe the audience for NTO is a lot wider than for PAO, but PAO definitely helps where low-latency tuning is needed. The main aim behind NTO was to concentrate node tuning into one place and take advantage of years of experience transferred into workload-specific profiles. TuneD has been an integral part of RHEL for many years and customers have come to expect it there.

dhellmann · 2021-08-20T19:02:50Z

enhancements/node-tuning/pao-in-nto.md

+([Cluster Node Tuning Operator](https://github.com/openshift/cluster-node-tuning-operator))
+and depends on the OCP version. So we release PAO for every OCP minor version (since 4.4) and together with some of the OCP Z-streams.
+The upgrade procedure is tricky though, first OCP needs to be upgraded and then PAO via an OLM channel change.
+This is needed to make sure the necessary functionality is present in the running NTO.


Can you provide more detail about why is the PAO so tightly coupled to the OCP version?

Does the new work to make it possible for OLM operators to specify the versions of OCP they're compatible with help the upgrade story?

PAO is dependent on multiple components and releases: k8s features (cpu manager,memory manager), RHCOS version, kernel version, tuned and MCO - All of which are packed and released in an OCP version where PAO is aligned to.
We did not have a version that could match a different release up until now (and probably not in the future)

PAO is going to have to be able to handle version skew between the control plane and worker nodes as part of the EUS upgrade strategy. Including the operator in the release payload will not remove that requirement.

Ping me on slack if you need more details about that.

IIRC PAO does not depend on anything version specific wrt worker nodes. NTO takes care of the few pieces that are needed wrt tuning and MCO installs the proper RT kernel.

PAO is tightly coupled to NTO/tuned version though and that is more tight than just OCP minor versions. Some tuning changes and bug fixes in NTO/tuned as well as in RHCOS kernel are critical for the advertised functionality.

It would be good to capture the detail from this thread in the body of the document.

adding the additional info to this section.

derekwaynecarr · 2021-08-24T14:16:50Z

To help evaluate, can you please provide the following:

list of pertinent topology options (control plane and infrastructure) where this operator is pertinent.

see:
https://github.com/openshift/api/blob/a6156965faae5ce117e3cd3735981a3fc0e27e27/config/v1/types_infrastructure.go#L92
https://github.com/openshift/api/blob/a6156965faae5ce117e3cd3735981a3fc0e27e27/config/v1/types_infrastructure.go#L103

my assumption is this operator is not installed at all when control plane topology is "external" and the infrastructure topology is not pertinent to the function. this is because the resources leveraged by the PAO are all managed externally when in this mode (e.g. node configuration, kubelet configuration).

existing api compatibility practices

for the present state of PAO, would you advertise the API under which compatibility tier
https://docs.openshift.com/container-platform/4.8/rest_api/understanding-api-support-tiers.html
https://docs.openshift.com/container-platform/4.8/rest_api/understanding-api-support-tiers.html#mapping-support-tiers-to-openshift-api-groups_understanding-api-tiers

enumerate list of performance profile owned objects

just for clarity, if user edits a performance profile, what related resources are modified on their behalf.

enumerate list of webhooks

is it just a validation webhook or is there something else?

updates to openshift conformance test suite if included by default

default options should be tested under our conformance suite, what testing would you recommend including if we proceeded on this. is this testing only possible on particular infrastructure platforms, if so, is this an argument for only enabling this function on those platforms in the node tuning operator itself?

sttts · 2021-08-25T11:22:13Z

enhancements/node-tuning/pao-in-nto.md

+Migrating the use of these libaries to [client-go](https://github.com/openshift/library-go)
+would be one of the options.
+
+PAO also implements a migration webhook to migrate its custom resources during upgrade. The hook is currently installed via OLM mechanisms and this might need to be reimplemented or changed to use whatever CVO or ART support.


what is a migration webhook?

The correct term I should have used is conversion webhook: https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definition-versioning/#webhook-conversion

sttts · 2021-08-25T11:23:21Z

enhancements/node-tuning/pao-in-nto.md

+
+It is expected that the image will be kept up-to-date as part of regular OCP flows.
+
+API changes are handled by a migration webhook that PAO already contains to make upgrades transparent to users.


what does this mean? What does it do?

Conversion webhook, it makes sure that CRs created with older apiVersion are treated properly keeping the old behavior intact or using proper defaults.

Do you have plans to deprecate v1 and eventually remove it? Or do you plan to roundtrip through v1 forever?

Probably yes, we introduced v2 in 4.7 and there were no upgrade incompatible changes to it since. Deprecating v1 is something we can consider indeed, we just haven't had the time to do so yet.

"migration" sounded very much a like a one-time thing, but it is not if v1 stays. Are you aware that v2 objects must roundtrip through v1 without data loss? Doesn't this stop you from new features in the future? You have to add new features to v1 as well.

A different default value between apiVersions is fine, and actually a common reason for a new apiVersion.

@MarSik should we consider v1 deprecation as a design goal here ?

Let's either do the v1 deprecation work before moving or after, but not at the same time.

v1 deprecation work will happen after moving to NTO.
In the meanwhile openshift-kni/performance-addon-operators#727 should cover the back and forth conventions between v1 and v2 as mentioned above

s/migration webhook/conversion webhook/

MarSik · 2021-08-25T11:40:26Z

To help evaluate, can you please provide the following:
* list of pertinent topology options (control plane and infrastructure) where this operator is pertinent.
see:
https://github.com/openshift/api/blob/a6156965faae5ce117e3cd3735981a3fc0e27e27/config/v1/types_infrastructure.go#L92
https://github.com/openshift/api/blob/a6156965faae5ce117e3cd3735981a3fc0e27e27/config/v1/types_infrastructure.go#L103

my assumption is this operator is not installed at all when control plane topology is "external" and the infrastructure topology is not pertinent to the function. this is because the resources leveraged by the PAO are all managed externally when in this mode (e.g. node configuration, kubelet configuration).

I cannot really answer this, but the assumption you made seems to be reasonable. PAO is a control plane component that modifies the worker nodes (unless in single node mode where it must affect masters of course).

* existing api compatibility practices
for the present state of PAO, would you advertise the API under which compatibility tier
https://docs.openshift.com/container-platform/4.8/rest_api/understanding-api-support-tiers.html
https://docs.openshift.com/container-platform/4.8/rest_api/understanding-api-support-tiers.html#mapping-support-tiers-to-openshift-api-groups_understanding-api-tiers

The API has been treated as Tier 1 for multiple releases already (since 4.4).

* enumerate list of performance profile owned objects

MachineConfig, KubeletConfiguration, Tuned (see the diagram linked from the description)

https://github.com/openshift-kni/performance-addon-operators/blob/master/docs/interactions/diagram.png

just for clarity, if user edits a performance profile, what related resources are modified on their behalf.

Potentially all the owned objects mentioned above. Nothing else, all activity happening on the cluster or nodes is indirect via other components (MCO, NTO) reacting to changes in objects owned by PAO.

* enumerate list of webhooks
is it just a validation webhook or is there something else?

Validation and conversion webhook to transparently upgrade PerformanceProfile on apiVersion changes.

https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definition-versioning/#overview

* updates to openshift conformance test suite if included by default
default options should be tested under our conformance suite, what testing would you recommend including if we proceeded on this. is this testing only possible on particular infrastructure platforms, if so, is this an argument for only enabling this function on those platforms in the node tuning operator itself?

All basic checks are doable on all linux platforms except when real-time kernel is needed. That case used to be only supported on GCP (and bare metal).

Measuring real latency requires bare metal, but is not needed for conformance.

sttts · 2021-08-26T09:08:57Z

Validation and conversion webhook to transparently upgrade PerformanceProfile on apiVersion changes.

nitpicking (hopefully): sounds like a misconception of conversion. "upgrade" and "apiVersion changes" are not a thing in API versioning and conversion.

MarSik · 2021-08-27T06:32:33Z

enhancements/node-tuning/pao-in-nto.md

+- PAO is closely tied to some OCP components
+([Cluster Node Tuning Operator](https://github.com/openshift/cluster-node-tuning-operator))
+and depends on the OCP version. So we release PAO for every OCP minor version (since 4.4) and together with some of the OCP Z-streams.
+The upgrade procedure is tricky though, first OCP needs to be upgraded and then PAO via an OLM channel change.


@yanirq Please add another reason:

Pre-GA availability and testing is tricky in the current way of things. The 4.9 prod OLM index (for example) will not be populated until the operator is shipped. At a GA date. All OCP operators are available earlier. PAO has its own separate quay.io bundle/index, but the separation is complicated for testing.

dhellmann · 2021-09-02T20:23:48Z

/priority important-soon

enhancements/node-tuning/pao-in-nto.md

dhellmann · 2021-09-03T19:13:00Z

enhancements/node-tuning/pao-in-nto.md

+([Cluster Node Tuning Operator](https://github.com/openshift/cluster-node-tuning-operator))
+and depends on the OCP version. So we release PAO for every OCP minor version (since 4.4) and together with some of the OCP Z-streams.
+The upgrade procedure is tricky though, first OCP needs to be upgraded and then PAO via an OLM channel change.
+This is needed to make sure the necessary functionality is present in the running NTO.


It would be good to capture the detail from this thread in the body of the document.

dhellmann · 2021-09-03T19:20:16Z

enhancements/node-tuning/pao-in-nto.md

+
+PAO bugs are currently tracked in The OpenShift Bugzilla component [Performance Addon Operator](https://bugzilla.redhat.com/buglist.cgi?component=Performance%20Addon%20Operator&list_id=12053428&product=OpenShift%20Container%20Platform)
+
+This should be kept as-is, the Telco team will own all the bugs. We might need to move the component under NTO or rename it.


I do think we're going to want to consider the bugzilla component question carefully. Moving to a sub-component of NTO seems like it makes sense, since that's the component users will see installed, but it would be good to have the teams decide what to do together.

@MarSik @jmencak Do we want to move this to the open questions section or have a decision on that ?

dhellmann · 2021-09-03T19:23:40Z

enhancements/node-tuning/pao-in-nto.md

+The NTO API definitions transition into openshift/api should be considered to be done at the time of PAO/NTO merge.
+
+- List of pertinent topology options (control plane and infrastructure) where this operator is pertinent:
+The assumption is that PAO is not installed at all when control plane topology is "external" and the infrastructure topology is not pertinent to the function


If the PAO becomes part of the NTO, do we achieve this by not installing the NTO? Is that already how NTO is handled for external control planes?

@jmencak Can you affirm that this is the way NTO is handled for external control planes ?

My understanding is that currently, IBM Cloud is the only cloud with external control plane topology. NTO is being installed on IBM Cloud by using a deployment manifest (annotation). Unless we merge PAO into NTO in a way that we can make use of the above deployment manifest (not what I believe the plan is), similar manifest for including PAO in IBM Cloud would be necessary. Therefore, as things are planned now, PAO would not run on external control planes.

Currently, it is unclear to me what (if anything) needs to be done to include NTO (and PAO for that matter) on other external control plane topologies wrt. efforts such as HyperShift which will make it possible to run external control planes on platforms other than IBM Cloud.

dhellmann · 2021-09-03T19:24:23Z

enhancements/node-tuning/pao-in-nto.md

+
+- PAO has been installed via OLM since 4.4. Once it moves to core we will have to figure out the upgrade procedure, because OLM will not remove the optional operator by itself.
+
+  - Implementations options:


We should pick one of these to describe here and document the others in the alternatives section below.

dhellmann · 2021-09-03T19:31:25Z

enhancements/node-tuning/pao-in-nto.md

+
+It is expected that the image will be kept up-to-date as part of regular OCP flows.
+
+API changes are handled by a migration webhook that PAO already contains to make upgrades transparent to users.


Let's either do the v1 deprecation work before moving or after, but not at the same time.

dhellmann · 2021-09-03T19:32:03Z

enhancements/node-tuning/pao-in-nto.md

+
+The proposed change adds more code to the NTO component increasing the probability of bugs and possibly a higher resource usage.
+
+However using Performance Profile is optional and so users that do not use this way of tuning should be not affected at all.


OK, my point is that it's a risk. It might be small, but it's there, so let's document it.

yanirq · 2021-09-13T13:04:41Z

@dhellmann who can we tag from the API team to review the proposal ?

sttts · 2021-09-22T09:09:10Z

enhancements/node-tuning/pao-in-nto.md

+PAO code would be moved into the NTO repository as a separate controller. The Telco team would keep ownership as it has the necessary domain knowledge needed to implement the low latency tuning.
+
+The code currently uses Operator SDK libraries in some of the tests and this might need to be changed.
+Migrating the use of these libaries to [client-go](https://github.com/openshift/library-go)


sttts · 2021-09-22T09:11:54Z

enhancements/node-tuning/pao-in-nto.md

+would be one of the options.
+
+PAO also implements a [conversion webhook](https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definition-versioning/#webhook-conversion)
+to migrate its custom resources during upgrade.


what does this mean? Why during upgrade?

Added a more elaborate explanation.

sttts · 2021-09-22T09:20:40Z

@yanirq some wording around conversion is still odd in the text. But overall the migraton plan with a conversion webhook plus openshift-kni/performance-addon-operators#727 looks good now.

sttts · 2021-09-29T10:06:11Z

enhancements/node-tuning/pao-in-nto.md

 would be one of the options.

 PAO also implements a [conversion webhook](https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definition-versioning/#webhook-conversion)
-to migrate its custom resources during upgrade.
+to convert its custom resources during upgrade to versions that are supported by the openshift version targeted.


my point is that this not during upgrade, but conversions are done in real-time on every request, even long after upgrade.

@sttts ack, since this is a section about code in general it would make sense to drop the specific mention about upgrade for that paragraph then (following your comment)

yanirq · 2021-10-04T09:36:34Z

@sttts Thanks for the feedback. Comments were addressed.
Any additional reviews required here?

sttts · 2021-10-04T09:51:12Z

@yanirq lgtm from my side.

sttts · 2021-10-04T09:52:10Z

approvers:
  - TBD

looks wrong. How owns this topic from the staff engs?

yanirq · 2021-10-05T08:09:34Z

approvers:
  - TBD
looks wrong. How owns this topic from the staff engs?

Added @dhellmann to approvers list and @sttts to reviewers list.

jmencak · 2021-10-06T08:55:38Z

enhancements/node-tuning/pao-in-nto.md

+PAO code would be moved into the NTO repository as a separate controller. The Telco team would keep ownership as it has the necessary domain knowledge needed to implement the low latency tuning.
+
+The code currently uses Operator SDK libraries in some of the tests and this might need to be changed.
+Migrating the use of these libaries to [library-go](https://github.com/openshift/library-go) or stripping them to [controller-runtime](https://github.com/kubernetes-sigs/controller-runtime)


s/libaries/libraries/

yanirq · 2021-11-01T07:00:19Z

@fromanirh @jmencak @mrunalp since you are listed as reviewers on this PR in addition to @dhellmann , can you please have a final look on this proposal ? (we are waiting for final acks from all peers , got an ack only from @sttts so far)
Thanks.

jmencak · 2021-11-03T08:44:39Z

@fromanirh @jmencak @mrunalp since you are listed as reviewers on this PR in addition to @dhellmann , can you please have a final look on this proposal ? (we are waiting for final acks from all peers , got an ack only from @sttts so far) Thanks.

lgtm from my side. Nit: a missing new line in front of ### Test Plan

ffromani · 2021-11-09T07:53:47Z

@fromanirh @jmencak @mrunalp since you are listed as reviewers on this PR in addition to @dhellmann , can you please have a final look on this proposal ? (we are waiting for final acks from all peers , got an ack only from @sttts so far) Thanks.

lgtm from my side for the general direction and most notably for the idea. There are some details in the upgrade/downgrade flow that I'd like to explore in a bit more details, but that can be postponed I believe.

yanirq · 2021-11-11T10:28:29Z

@dhellmann I believe we have all the acks (just not in the github label format)

mrunalp · 2021-11-12T04:59:08Z

enhancements/node-tuning/pao-in-nto.md

+
+### Risks and Mitigations
+
+- Since the performance profile API will be visible for users on all clusters there is a risk that a cluster admin might tune nodes improperly.


nit: As a mitigation we can say that we document clearly that admins proceed with caution.

mrunalp · 2021-11-12T05:00:52Z

enhancements/node-tuning/pao-in-nto.md

+  From OCP perspective it makes little difference as it would just be part of NTO release process indeed.
+
+- API definitions: They would need to go into openshift/api but at this moment NTO does include its own API definitions under openshift/api.
+The NTO API definitions transition into openshift/api should be considered to be done at the time of PAO/NTO merge.


Do we have a migration plan for this captured?

I rephrased it a bit.
There is no specific migration plan captured at the moment and this is/was considered as a task to be done in NTO which would probably need to move a bit faster now.
This is also the reason I captured this under the open questions section

mrunalp · 2021-11-12T05:01:47Z

enhancements/node-tuning/pao-in-nto.md

+How will the component handle version skew with other components?
+What are the guarantees? Make sure this is in the test plan.
+
+Consider the following in developing a version skew strategy for this


This section needs to be cleaned up.

mrunalp · 2021-11-12T05:03:39Z

enhancements/node-tuning/pao-in-nto.md

+status: provisional
+---
+
+# PAO as part of core OCP


nit: Retitle this so it is clear that we are moving under NTO and not moving as a separate operator to the payload.

This is a preliminary plan describing how to make the [Performance Addon Operator](https://github.com/openshift-kni/performance-addon-operators) part of OpenShift core.

mrunalp · 2021-11-30T02:24:07Z

/lgtm
/approve

openshift-ci · 2021-11-30T02:24:39Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mrunalp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [mrunalp]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci bot requested review from juliakreger and rvanderp3 August 12, 2021 12:31

openshift-ci bot assigned jmencak and mrunalp Aug 12, 2021

cynepco3hahue reviewed Aug 12, 2021

View reviewed changes

jmencak reviewed Aug 16, 2021

View reviewed changes

yanirq force-pushed the enh-pao-in-nto branch from 6bcceba to 50d2b41 Compare August 17, 2021 18:09

jmencak reviewed Aug 18, 2021

View reviewed changes

yanirq force-pushed the enh-pao-in-nto branch from 50d2b41 to 08e5b95 Compare August 19, 2021 08:51

dhellmann reviewed Aug 20, 2021

View reviewed changes

mrunalp reviewed Aug 20, 2021

View reviewed changes

dhellmann reviewed Aug 20, 2021

View reviewed changes

sttts reviewed Aug 25, 2021

View reviewed changes

MarSik commented Aug 27, 2021

View reviewed changes

openshift-ci bot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Sep 2, 2021

dhellmann reviewed Sep 3, 2021

View reviewed changes

sttts reviewed Sep 22, 2021

View reviewed changes

sttts reviewed Sep 29, 2021

View reviewed changes

yanirq force-pushed the enh-pao-in-nto branch from d2f4b79 to 58cb986 Compare September 29, 2021 11:26

jmencak reviewed Oct 6, 2021

View reviewed changes

yanirq force-pushed the enh-pao-in-nto branch from f8a6e50 to fce9772 Compare October 12, 2021 10:08

dhellmann mentioned this pull request Oct 18, 2021

Allow installer to include/exclude capabilities based on user selections #922

Merged

yanirq force-pushed the enh-pao-in-nto branch from fce9772 to 1fc80d6 Compare November 9, 2021 07:50

yanirq force-pushed the enh-pao-in-nto branch 3 times, most recently from 049d107 to 20f3ccd Compare November 10, 2021 08:59

mrunalp reviewed Nov 12, 2021

View reviewed changes

yanirq force-pushed the enh-pao-in-nto branch 2 times, most recently from e9037b3 to e057e06 Compare November 16, 2021 16:42

mrunalp reviewed Nov 16, 2021

View reviewed changes

Move PAO to OCP enhancement

0abfca8

This is a preliminary plan describing how to make the [Performance Addon Operator](https://github.com/openshift-kni/performance-addon-operators) part of OpenShift core.

yanirq force-pushed the enh-pao-in-nto branch from e057e06 to 0abfca8 Compare November 17, 2021 09:59

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 30, 2021

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 30, 2021

openshift-merge-robot merged commit fc2f2e9 into openshift:master Nov 30, 2021

yanirq mentioned this pull request Jan 10, 2022

Refactor NTO to use controller runtime lib openshift/cluster-node-tuning-operator#302

Closed

8 tasks

yanirq mentioned this pull request Feb 8, 2022

wrap NTO controller with controller runtime lib openshift/cluster-node-tuning-operator#316

Merged


		The operator contains the high level orchestration logic + a [tool to simplify generation of Performance Profiles](https://docs.openshift.com/container-platform/4.8/scalability_and_performance/cnf-create-performance-profiles.html). This tool can be executed locally on admin's laptop using `podman run --entrypoint performance-profile-creator performance-addon-operator:ver`

		- [performance-addon-must-gather](https://catalog.redhat.com/software/containers/openshift4/performance-addon-operator-must-gather-rhel8/5ee0bfa4d70cc56cea83827f?container-tabs=overview&gti-tabs=registry-tokens)

		The must gather image [enhances](https://github.com/openshift-kni/performance-addon-operators/tree/master/must-gather) the main OCP one and [collects](https://docs.openshift.com/container-platform/4.8/scalability_and_performance/cnf-performance-addon-operator-for-low-latency-nodes.html#cnf-about-gathering-data_cnf-master) some latency tuning configuration and data that are needed for debugging.

		- and the functional tests are also vendored into [cnf-tests](https://catalog.redhat.com/software/containers/openshift4/cnf-tests-rhel8/5e9d81f65a134668769d560b?container-tabs=overview&gti-tabs=registry-tokens) that can be used to [validate customer deployments](https://docs.openshift.com/container-platform/4.8/scalability_and_performance/cnf-performance-addon-operator-for-low-latency-nodes.html#cnf-performing-end-to-end-tests-for-platform-verification_cnf-master) (this includes other aspects like SCTP, PTP, SR-IOV as well).


		- PAO has been installed via OLM since 4.4. Once it moves to core we will have to figure out the upgrade procedure, because OLM will not remove the optional operator by itself.

		One of the options is to keep using the same leadership election method and make sure the newest version always wins. The OLM operator will then stay on the cluster (can be release noted) as an inactive component.


		#### CI

		PAO contains both unit and functional tests implemented using Gingko.


		PAO code would be moved into the NTO repository as a separate controller. The Telco team would keep ownership as it has the necessary domain knowledge needed to implement the low latency tuning.

		The code currently uses Operator SDK libraries in some of the tests and this might need to be changed.


		Integration with NTO is currently TBD, please advise if anything special is needed.

		One of the test suites (4_latency) implements an opinionated latency testing suite to get comparable results from customers and to prevent configuration / command line invocation mistakes. Keeping it in PAO/NTO or moving it into cnf-tests is TBD.


		### Releases

		PAO should be shipped as part of core OCP. ART again. PAO already obeys the OCP github PR rules and uses the same PR bot so nothing changes in the lifecycle.


		One option is to remove the relevant OLM artifacts (CSV,Subscription,Operator group) via the performance-addon controller and leave the PAO CRDs without the change.

		Another option, in case leader election supports such conditioning, is to keep using the same leadership election method and make sure the newest version always wins. The OLM operator will then stay on the cluster (can be release noted) as an inactive component.


		### Upgrade

		- PAO has been installed via OLM since 4.4. Once it moves to core we will have to figure out the upgrade procedure, because OLM will not remove the optional operator by itself.


		### Code

		PAO code would be moved into the NTO repository as a separate controller. The Telco team would keep ownership as it has the necessary domain knowledge needed to implement the low latency tuning.


		### Goals

		We (the telco team) would like to make PAO part of OpenShift. There are multiple paths we can take to achieve this, I am going to describe the one we prefer: PAO becoming a sub-controller in NTO.


		The proposed change adds more code to the NTO component increasing the probability of bugs and possibly a higher resource usage.

		However using Performance Profile is optional and so users that do not use this way of tuning should be not affected at all.


		It is expected that the image will be kept up-to-date as part of regular OCP flows.

		API changes are handled by a migration webhook that PAO already contains to make upgrades transparent to users.


		PAO bugs are currently tracked in The OpenShift Bugzilla component [Performance Addon Operator](https://bugzilla.redhat.com/buglist.cgi?component=Performance%20Addon%20Operator&list_id=12053428&product=OpenShift%20Container%20Platform)

		This should be kept as-is, the Telco team will own all the bugs. We might need to move the component under NTO or rename it.


		- PAO has been installed via OLM since 4.4. Once it moves to core we will have to figure out the upgrade procedure, because OLM will not remove the optional operator by itself.

		- Implementations options:


		### Risks and Mitigations

		- Since the performance profile API will be visible for users on all clusters there is a risk that a cluster admin might tune nodes improperly.

Move PAO to OCP enhancement #867

Move PAO to OCP enhancement #867

Conversation

MarSik commented Aug 12, 2021

MarSik commented Aug 12, 2021

MarSik commented Aug 12, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmencak left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yanirq Aug 30, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrunalp Aug 20, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sttts Aug 26, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

derekwaynecarr commented Aug 24, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarSik commented Aug 25, 2021

sttts commented Aug 26, 2021

Choose a reason for hiding this comment

yanirq Aug 30, 2021 •

edited

Loading

mrunalp Aug 20, 2021 •

edited

Loading

sttts Aug 26, 2021 •

edited

Loading