Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wrap NTO controller with controller runtime lib #316

Merged
merged 1 commit into from
Feb 15, 2022

Conversation

yanirq
Copy link
Contributor

@yanirq yanirq commented Feb 8, 2022

Refactor cluster node tuning operator to use controller runtime library (release 0.11).

The functionality is internal only and replaces the direct application of a controller with the controller runtime scheme.
This is a refactor to the main invocation of cluster node tuning controller.
The controller and the metrics server are wrapped and started by the controller runtime library.

This is also a preliminary work to set up the stage for moving Performance addons operator under NTO as documented here: openshift/enhancements#867

@yanirq
Copy link
Contributor Author

yanirq commented Feb 8, 2022

@jmencak @dagrayvid
This PR is an alternative to replace a full refactor of NTO's controller presented here: #302
It keeps most of the original code and will save a lot of regressions if proven as a valid solution.
The performance should be tested as well but since the main functionality is kept then the results should not differ from the current implementation of NTO.

@dagrayvid
Copy link
Contributor

As expected this PR seems to perform about the same as the current implementation:

  1. Fully idle:
Old implementation:                 Controller-runtime wrapper:
2022-04-02 17:39:25: 101 18         2022-08-02 16:52:44: 54 12
2022-04-02 17:40:25: 103 19         2022-08-02 16:53:44: 56 13
2022-04-02 17:41:25: 105 20         2022-08-02 16:54:44: 57 13
2022-04-02 17:42:25: 105 20         2022-08-02 16:55:44: 60 15
delta:               4   2                               6  3
  1. Creating pods with no label matching used.
Old implementation:                 Controller-runtime wrapper:
2022-04-02 17:46:33: 112 22         2022-08-02 17:00:22: 69 18
2022-04-02 17:47:33: 113 22         2022-08-02 17:01:22: 71 18
2022-04-02 17:48:33: 115 23         2022-08-02 17:02:22: 73 19
2022-04-02 17:49:33: 117 23         2022-08-02 17:03:22: 75 19
delta:               5   1                               6 1
  1. Creating pods with a profile that matches to the pods (3 master 3 worker cluster)
Old implementation:                 Controller-runtime wrapper:
2022-04-02 17:53:18: 169  33        2022-08-02 17:09:37: 129 27
2022-04-02 17:54:18: 539  56        2022-08-02 17:10:37: 587 48
2022-04-02 17:55:18: 969  72        2022-08-02 17:11:37: 1101 65
2022-04-02 17:56:18: 1488 86        2022-08-02 17:12:37: 1696 80
delta:               1319 53                             1567 53

// collection based locking scheme and we cannot continue until the ConfigMap
// is GC or deleted manually. If the owner references do not exist, just go
// ahead with the new client-go leader-election code.
loop:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmencak In case the move to the controller runtime wrapper does not handle this GC, will we still need to keep this cleanup for legacy NTO lock ?
What is the upgrade path for NTO ? can we skip major versions and miss this cleanup ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this cleanup code was added in 4.8. In my opinion we no longer need to keep it since any of the 4.8, 4.9, 4.10 versions will take care of this cleanup. In other words, we don's support direct upgrades from say 4.7->4.11.

Copy link
Contributor

@jmencak jmencak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR, @yanirq ! I had a quick look at this and it looks mostly good to me. Only nits so far, but I will take another look. Also, @dagrayvid , can you take a look?

manifests/40-rbac.yaml Outdated Show resolved Hide resolved
// collection based locking scheme and we cannot continue until the ConfigMap
// is GC or deleted manually. If the owner references do not exist, just go
// ahead with the new client-go leader-election code.
loop:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this cleanup code was added in 4.8. In my opinion we no longer need to keep it since any of the 4.8, 4.9, 4.10 versions will take care of this cleanup. In other words, we don's support direct upgrades from say 4.7->4.11.

pkg/util/leaderelection.go Outdated Show resolved Hide resolved
@dagrayvid
Copy link
Contributor

The changes look good to me after my first look through.

@yanirq @jmencak is the long term plan to maintain the NTO controller as a client-go based controller like this? Or do we intend to rewrite it using the controller-runtime eventually, assuming we can find a way to do so without a significant performance regression?

@yanirq
Copy link
Contributor Author

yanirq commented Feb 10, 2022

The changes look good to me after my first look through.

@yanirq @jmencak is the long term plan to maintain the NTO controller as a client-go based controller like this? Or do we intend to rewrite it using the controller-runtime eventually, assuming we can find a way to do so without a significant performance regression?

The main goal here is to have NTO wrapped by controller runtime in order for us to add PAO, which is already written completely with controller-runtime, under this repository (see openshift/enhancements#867).
When approaching this at first we thought about having a main reconcile loop (controller runtime implemented) that would take care of both NTO and PAO.
In retrospective #302 showed that a full conversion of NTO controller to use controller runtime will need a deeper design change so it would be efficient and and written according to best practices.

With the changes introduced in this PR, the next step would be to add PAO controller with controller runtime manager:

mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{

@jmencak
Copy link
Contributor

jmencak commented Feb 10, 2022

So now that we have two options, I wonder how they'd compare performance-wise with PAO already merged in as a controller. Have you tried that, @yanirq? What's the expectation here? At the moment, this PR is the "winner", but will it still be a "winner" when running idle with PAO merged in?

@yanirq
Copy link
Contributor Author

yanirq commented Feb 10, 2022

So now that we have two options, I wonder how they'd compare performance-wise with PAO already merged in as a controller. Have you tried that, @yanirq? What's the expectation here? At the moment, this PR is the "winner", but will it still be a "winner" when running idle with PAO merged in?

The approach of adding PAO would be similar in both PRs. We will add it as a separate controller (with its reconciler) under the manger.
The reconciler ideally should handle only related CRs (Performance Profiles in our case and watch for its created artifacts)

@dagrayvid
Copy link
Contributor

So now that we have two options, I wonder how they'd compare performance-wise with PAO already merged in as a controller. Have you tried that, @yanirq? What's the expectation here? At the moment, this PR is the "winner", but will it still be a "winner" when running idle with PAO merged in?

I had the same thought. I can probably test this using #314.

@yanirq
Copy link
Contributor Author

yanirq commented Feb 10, 2022

So now that we have two options, I wonder how they'd compare performance-wise with PAO already merged in as a controller. Have you tried that, @yanirq? What's the expectation here? At the moment, this PR is the "winner", but will it still be a "winner" when running idle with PAO merged in?

I had the same thought. I can probably test this using #314.

@cynepco3hahue should have a more updated PR for that (at least a WIP one)

@yanirq
Copy link
Contributor Author

yanirq commented Feb 10, 2022

/retest

@dagrayvid
Copy link
Contributor

I had the same thought. I can probably test this using #314.

@cynepco3hahue should have a more updated PR for that (at least a WIP one)

I will wait for an update to PR#314 before testing. Although I don't expect any surprises, I would like to do this test before merging this PR.

This is a refactor to the main invocation of
cluster node tuning controller.
The controller and the metrics server are wrapped and
started by the controller runtime library.
@yanirq
Copy link
Contributor Author

yanirq commented Feb 14, 2022

/retest

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 14, 2022

@yanirq: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@jmencak
Copy link
Contributor

jmencak commented Feb 15, 2022

Thank you for the changes, @yanirq . This looks good to me now. Can we squash the commits? @dagrayvid , unless you have objections from your side, I think this is ready to be merged.

@yanirq
Copy link
Contributor Author

yanirq commented Feb 15, 2022

Thank you for the changes, @yanirq . This looks good to me now. Can we squash the commits? @dagrayvid , unless you have objections from your side, I think this is ready to be merged.

@jmencak commits are already squashed

@jmencak
Copy link
Contributor

jmencak commented Feb 15, 2022

/lgtm
/approve
/hold
David, please cancel the hold if you have no objections.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 15, 2022
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 15, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 15, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jmencak, yanirq

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 15, 2022
@dagrayvid
Copy link
Contributor

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 15, 2022
@openshift-merge-robot openshift-merge-robot merged commit 11f88d1 into openshift:master Feb 15, 2022
IlyaTyomkin pushed a commit to IlyaTyomkin/cluster-node-tuning-operator that referenced this pull request May 23, 2023
This is a refactor to the main invocation of
cluster node tuning controller.
The controller and the metrics server are wrapped and
started by the controller runtime library.
IlyaTyomkin pushed a commit to IlyaTyomkin/cluster-node-tuning-operator that referenced this pull request Jun 13, 2023
This is a refactor to the main invocation of
cluster node tuning controller.
The controller and the metrics server are wrapped and
started by the controller runtime library.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants