-
Notifications
You must be signed in to change notification settings - Fork 184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Available CRDs check feature #2712
base: main
Are you sure you want to change the base?
Conversation
/cc @nb-ohad |
84f0cbe
to
ddf5c5f
Compare
main.go
Outdated
} | ||
if len(crds) > 0 { | ||
if err = (&crd.CustomResourceDefinitionReconciler{ | ||
Client: mgr.GetClient(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use an uncached Client instead of cached for this controller?
hack/crdavail.sh
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's rename this here as well as in the Dockerfile? maybe ocs-operator-entrypoint
or do you have any other suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: raaizik The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test ocs-operator-bundle-e2e-aws |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please separate this commit into two commits?
- The first commit should include the controller changes.
- The second commit should include the script changes.
I will help test the first commit without the script. I believe we should be sorted with just the first commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like a complex hack to cause a panic without actually using panic()
.
How did we finalize on this approach?
@umangachapagain We do not need this approach at all, Me and Rewant tested the solution without any script and fixed the bugs we had and it works fine. They will soon update the PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- At the start each reconcile iteration, revalidate which CRD are now
available. If a CRD of interest is now avail, panic the op
Can you pls rephrase this statement in the commit message, This is not correct. Also please update the commit title.
1c82a37
to
2137468
Compare
/test ocs-operator-bundle-e2e-aws |
controllers/util/k8sutil.go
Outdated
@@ -44,6 +45,8 @@ const ( | |||
OwnerUIDIndexName = "ownerUID" | |||
) | |||
|
|||
var CRDList []string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not the same code we tested, Why is this list empty? @rewantsoni, @raaizik Did you tested these new changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: this PR introduces only the structure of the feature without any specific CRDs which will be added from their respective PRs (e.g., ClusterClaim, VolumeGroupSnapshotClass and DRClusterConfig).
We've tested it with the script. It basically doesn't even create the CRD reconciler since the CRDs list is empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to test it with an empty list. We should not assume that it will work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested with with and without entries in CRDList and it works
controllers/util/k8sutil.go
Outdated
@@ -44,6 +45,8 @@ const ( | |||
OwnerUIDIndexName = "ownerUID" | |||
) | |||
|
|||
var CRDList []string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename this variable and please add a comment on what does it holds.
controllers/crd/crd_controller.go
Outdated
// Reconcile compares available CRDs maps following either a Create or Delete event | ||
func (r *CustomResourceDefinitionReconciler) Reconcile(ctx context.Context, request reconcile.Request) (reconcile.Result, error) { | ||
r.ctx = ctx | ||
r.Log.Info("Reconciling CustomResourceDefinition.", "CRD", klog.KRef(request.Namespace, request.Name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think only name is enough here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I am missing something, but the current code will panic the entire pod!
Why have we removed the code that recycles the process without exiting failing the pod?
controllers/crd/crd_controller.go
Outdated
if err != nil { | ||
return reconcile.Result{}, err | ||
} | ||
if !reflect.DeepEqual(availableCrds, r.AvailableCrds) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not believe a map guarantees the order of keys if the 2 maps contain different keys. DeepEqual might fail even if both maps contain the exact same keys.
Please move to a "len check + loop on items" for checking equality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to create a map, added keys in a different order and compare them using DeepEqual and it works
https://go.dev/play/p/uFkmp57xEIp
controllers/crd/crd_controller.go
Outdated
} | ||
if !reflect.DeepEqual(availableCrds, r.AvailableCrds) { | ||
r.Log.Info("CustomResourceDefinitions created/deleted. Restarting process.") | ||
panic("CustomResourceDefinitions created/deleted. Restarting process.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe panic will allow us to detect the reason for the exit from outside the operator. use os.Exit
with a specific exit code.
controllers/util/k8sutil.go
Outdated
@@ -44,6 +45,8 @@ const ( | |||
OwnerUIDIndexName = "ownerUID" | |||
) | |||
|
|||
var CRDList []string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Global singleton non-cost, non-atomic state inside a utils file is a red flag. Why do we need this?
@iamniting @umangachapagain |
/hold |
19fd2ec
to
48e6dab
Compare
Reasons for this enhancement: - A controller cannot set up a watch for a CRD that is not installed on the cluster, trying to set up a watch will panic the operator - There is no known way, that we are aware of, to add a watch later without client cache issue How does the enhancement work around the issue: - A new controller to watch creation/deletion for the CRDs of interest to prevent unnecessary reconciles - On start of the operator(main), detect which CRDs are avail (out of a fixed list) - At the start each reconcile of new controller, we fetch the CRDs available again and compare it with CRDs fetched in previous step, If there is any change, we panic the op Signed-off-by: raaizik <132667934+raaizik@users.noreply.github.com> Co-Authored-By: Rewant Soni <rewant.soni@gmail.com>
Adds a script that bypasses pod restarts Signed-off-by: raaizik <132667934+raaizik@users.noreply.github.com> Co-Authored-By: Rewant Soni <rewant.soni@gmail.com>
if !reflect.DeepEqual(availableCrds, r.AvailableCrds) { | ||
r.Log.Info("CustomResourceDefinitions created/deleted. Restarting process.") | ||
os.Exit(42) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Available CRD is a map. I am not sure if there is an order guarantee on keys which means that DeepEqual might fail even if all the keys and values are the same. I would suggest changing the check to maps.Equal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to create a map, added keys in a different order, and compared them using DeepEqual and it worked. maps.Equal produces same results
https://go.dev/play/p/uFkmp57xEIp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is the conditional watch that checks the CRD exists before adding the watch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added it in #2745
} | ||
if !reflect.DeepEqual(availableCrds, r.AvailableCrds) { | ||
r.Log.Info("CustomResourceDefinitions created/deleted. Restarting process.") | ||
os.Exit(42) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please have the exit code in a constant with a name that explains the usage?
In addition, a comment about this line will help
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added it with #2745
if [ $EXIT_CODE -ne $RESTART_EXIT_CODE ]; then | ||
exit $EXIT_CODE | ||
fi | ||
done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a missing newline here and git is complaining
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it might've been discussed, just out of curiosity, why can't the operation be offloaded to odf-op? odf-op could set env and ocs-op correspondingly sets up the watch? I didn't think of all the possibilities though.
RESTART_EXIT_CODE=42 | ||
|
||
while true; do | ||
./usr/local/bin/ocs-operator --enable-leader-election --health-probe-bind-address=:8081 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are effectively making this a sub-process rather than a PID 1 and I'm not confident enough in the shell handling signals vs controller-runtime.
I could see good amount of comments on this PR, in brief, may I know why did we backoff from kubelet restarting us?
@@ -44,6 +45,10 @@ const ( | |||
OwnerUIDIndexName = "ownerUID" | |||
) | |||
|
|||
func GetCrds() []string { | |||
return []string{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is my understanding correct that in upcoming PRs we register the CRDNames in this and conditionally watch on them in the setupwithmanager?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added it with #2745
DeleteFunc: func(e event.TypedDeleteEvent[client.Object]) bool { | ||
crdAvailable, keyExist := r.AvailableCrds[e.Object.GetName()] | ||
if keyExist && crdAvailable { | ||
r.Log.Info("CustomResourceDefinition %s was Deleted.", e.Object.GetName()) | ||
return true | ||
} | ||
return false | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a possibility that CRD deletion is less than our operator lifetime?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although it is very unlikely but it is possible, and if they happens it will lead to a pod restart
@@ -228,6 +254,7 @@ func (r *StorageClusterReconciler) SetupWithManager(mgr ctrl.Manager) error { | |||
Owns(&corev1.Secret{}, builder.WithPredicates(predicate.GenerationChangedPredicate{})). | |||
Owns(&routev1.Route{}). | |||
Owns(&templatev1.Template{}). | |||
Watches(&extv1.CustomResourceDefinition{}, enqueueStorageClusterRequest, builder.WithPredicates(crdPredicate)). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we already made a mistake (with Virt) in watching all CRDs increasing the memory footprint, my mistake in not pushing hard on this fix #2539, seeing this PR we don't want to watch all CRDs until & unless necessary.
Clubbed w/ 2539 we could do something like below as we are only interested in the metadata of CRDs while mapping CRDs.
crd := &metav1.PartialObjectMetadata{}
crd.SetGroupVersionKind(extv1.SchemeGroupVersion.WithKind("CustomResourceDefinition"))
crd.Name = <CRDName>
client.Get(<ctx>, <ns-name>, crd)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looked at this and I couldn't find a way to specify multiple names in the cache using field selector, But I have updated it to use PartialObjectMetadata here: #2745
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Changes
ClusterClaim
,VolumeGroupSnapshotClass
andDRClusterConfig
).