Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent creation of KSVC if its namespace is not in the Mesh #181

Merged
merged 1 commit into from
Mar 21, 2024

Conversation

israel-hdez
Copy link
Contributor

@israel-hdez israel-hdez commented Mar 13, 2024

Description

When a KServe InferenceService is created, odh-model-controller enrolls the ISVC namespace to the mesh, if needed. It was found that sometimes the namespace enrollment process may take a little time to be processed.

Since KServe and odh-model-controllers are (to certain extent) independent of each other, sometimes KServe controller and Knative serving controller are faster and the pod from the KSVC may be created faster than the mesh enrollment process. This would lead to the KSVC pod not having an Istio sidecar.

Since the Istio authorization rules are evaluated on the sidecar of target service, the missing sidecar on the model/ksvc pod would mean that the traffic would bypass the authorization rules and this impacts ODH KServe authorization (ref: https://istio.io/latest/docs/ops/best-practices/security/#server-first-tcp-protocols-are-not-supported).

In an effort to prevent that situation, this is adding a new validating webhook that would block creation of the Knative Service until the namespace is acknowledged to be a member of the service mesh.

Fixes https://issues.redhat.com/browse/RHOAIENG-2542

How Has This Been Tested?

Simple sanity check: ensuring a model can be deployed.

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Copy link
Contributor

openshift-ci bot commented Mar 13, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Copy link
Contributor

@bartoszmajsak bartoszmajsak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With rejecting creation you kind of follow eventual consistency, so I think that is ok per se, especially considering the fact that those two controllers are independent of each other.

I have one thing for you to validate, and a few nits. Other than that the code looks good. Big kudos for spending time on writing tests!

Important

Keep in mind that at some point we will switch from creating SMM (or directly manipulating SMMR which we shouldn't be doing in the first place) to label selectors and at this point we will need to make sure we can handle this case as well. I do not know if SMMR is internally updated if that's the case, but I will figure it out.

Makefile Show resolved Hide resolved
config/webhook/kustomization.yaml Outdated Show resolved Hide resolved
controllers/webhook/ksvc_validator.go Outdated Show resolved Hide resolved
main.go Outdated Show resolved Hide resolved
main.go Outdated
os.Exit(1)
}
} else {
setupLog.Info("Skipping setup of Knative Service validating Webhook.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you really have to log such info perhaps it's good to add "why" it is skipped.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I'll add a reason.

Copy link
Contributor Author

@israel-hdez israel-hdez Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed... That said, I'm no longer sure how much we can skip this.
The conditional is in prevention of RawDeployment setup, where Knative is missing. However, AFAIK, there is no way to introduce equivalent conditionals around the ValidatingWebhookConfiguration resource (i.e. the operator manifests).

So, if the user has Knative for other uses, the ValidatingWebhookConfiguration would take effect and would break the cluster, given the controller would have the logic turned off.

I'll tag @VaishnaviHire and @zdtsw for their advice.

controllers/webhook/ksvc_validator.go Outdated Show resolved Hide resolved
Comment on lines +101 to +98
log.Info("Rejecting Knative service because its namespace is not a member of the service mesh")
return fmt.Errorf("rejecting creation of Knative service %s on namespace %s because the namespace is not a configured member of the mesh", ksvc.Name, ksvc.Namespace)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure you really need a log here - the error you are returning seems to be sufficient and I assume it will be logged somehow?

In addition the info message is misleading. It can be pending or failing if I'm reading this right.

Copy link
Contributor Author

@israel-hdez israel-hdez Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error is not written to the controller logs. The error is returned to the client that tried to create the resource. So, the client chooses what to do with the message. Because of this, I chose to add the log message to make it clear what is happening in odh-model-controller side.

About pending or failing, I'm not sure if it is worth to go that deep here :-) Stating it is not configured is enough for me to make the user go and diagnose it, in case it stays on one of those states for longer than expected.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I thought this is handled by operator-sdk, but if that is not the case logging is helpful here.

About pending or failing, I'm not sure if it is worth to go that deep here :-) Stating it is not configured is enough for me to make the user go and diagnose it, in case it stays on one of those states for longer than expected.

I was coming from the angle that "not configured" is a bit ambiguous here. Does it mean that someone forgot to put the proper label on the namespace so it is not really seen by the mesh OR it cannot be configured by the mesh because... other errors? But I agree it might be going to deep.

Makefile Show resolved Hide resolved
Makefile Show resolved Hide resolved
@openshift-ci openshift-ci bot removed the lgtm label Mar 19, 2024
When a KServe InferenceService is created, odh-model-controller enrolls the ISVC namespace to the mesh, if needed. It was found that sometimes the namespace enrollment process may take a little time to be processed.

Since KServe and odh-model-controllers are (to certain extent) independent of each other, sometimes KServe controller and Knative serving controller are faster and the pod from the KSVC may be created faster than the mesh enrollment process. This would lead to the KSVC pod not having an Istio sidecar.

Since the Istio authorization rules are evaluated on the sidecar of target service, the missing sidecar on the model/ksvc pod would mean that the traffic would bypass the authorization rules and this impacts ODH KServe authorization (ref: https://istio.io/latest/docs/ops/best-practices/security/#server-first-tcp-protocols-are-not-supported).

In an effort to prevent that situation, this is adding a new validating webhook that would block creation of the Knative Service until the namespace is acknowledged to be a member of the service mesh.

Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>
@israel-hdez
Copy link
Contributor Author

I just squashed the commits.
@spolti @VedantMahabaleshwarkar @bartoszmajsak Please, give it a final pass.

Copy link
Contributor

@bartoszmajsak bartoszmajsak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@openshift-ci openshift-ci bot added the lgtm label Mar 21, 2024
Copy link
Contributor

openshift-ci bot commented Mar 21, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bartoszmajsak, israel-hdez, VedantMahabaleshwarkar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [VedantMahabaleshwarkar,israel-hdez]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit 3904c1d into opendatahub-io:main Mar 21, 2024
5 checks passed
@israel-hdez israel-hdez deleted the jira-2542-fix branch March 21, 2024 23:04
bartoszmajsak added a commit to bartoszmajsak/odh-model-controller that referenced this pull request Apr 2, 2024
This change hanleds "soft opt-in" for authorization capability handled by Authorino.

The logic consists of following steps:

 * during the controller bootstrap it will check if GroupVersion `authorino.kuadrant.io/v1beta2` is present in the cluster and conditionally add it to scheme.
 * when defining watches (using builder instance for OpenshiftReconciler) it will conditionally set a watch for `AuthConfig` according to the rules defined in opendatahub-io#181 but additionally if Authorino GroupVersion is present in the schema
 * it guards AuthConfig reconciler from executing based on the "schema rule". This prevents from constant failures due to AuthConfig not being defined for the cluster.

NOTE: As it is the case with original logic brought in opendatahub-io#181, it will require restart of the controller pod to set watch on AuthConfig resource when Authorino is installed in the cluster.
bartoszmajsak added a commit to bartoszmajsak/odh-model-controller that referenced this pull request Apr 2, 2024
This change handles "soft opt-in" for authorization capability handled by Authorino.

The logic consists of following steps:

 * during the controller bootstrap it will check if GroupVersion `authorino.kuadrant.io/v1beta2` is present in the cluster and conditionally add it to scheme.
 * when defining watches (using builder instance for OpenshiftReconciler) it will conditionally set a watch for `AuthConfig` according to the rules defined in opendatahub-io#181 but additionally if Authorino GroupVersion is present in the schema
 * it guards AuthConfig reconciler from executing based on the "schema rule". This prevents from constant failures due to AuthConfig not being defined for the cluster.

NOTE: As it is the case with original logic brought in opendatahub-io#181, it will require restart of the controller pod to set watch on AuthConfig resource when Authorino is installed in the cluster.

Custom image: quay.io/bmajsak/odh-model-controller:latest
bartoszmajsak added a commit to bartoszmajsak/odh-model-controller that referenced this pull request Apr 2, 2024
This change handles "soft opt-in" for authorization capability handled by Authorino.

The logic consists of following steps:

 * during the controller bootstrap it will check if GroupVersion `authorino.kuadrant.io/v1beta2` is present in the cluster and conditionally add it to scheme.
 * when defining watches (using builder instance for OpenshiftReconciler) it will conditionally set a watch for `AuthConfig` according to the rules defined in opendatahub-io#181 but additionally if Authorino GroupVersion is present in the schema
 * it guards AuthConfig reconciler from executing based on the "schema rule". This prevents from constant failures due to AuthConfig not being defined for the cluster.

NOTE: As it is the case with original logic brought in opendatahub-io#181, it will require restart of the controller pod to set watch on AuthConfig resource when Authorino is installed in the cluster.

Custom image: quay.io/bmajsak/odh-model-controller:latest

Signed-off-by: bartoszmajsak <bartosz.majsak@gmail.com>
bartoszmajsak added a commit to bartoszmajsak/odh-model-controller that referenced this pull request Apr 2, 2024
This change handles "soft opt-in" for authorization capability handled by Authorino.

The logic consists of following steps:

 * during the controller bootstrap it will check if GroupVersion `authorino.kuadrant.io/v1beta2` is present in the cluster and conditionally add it to scheme.
 * when defining watches (using builder instance for OpenshiftReconciler) it will conditionally set a watch for `AuthConfig` according to the rules defined in opendatahub-io#181 but additionally if Authorino GroupVersion is present in the schema
 * it guards AuthConfig reconciler from executing based on the "schema rule". This prevents from constant failures due to AuthConfig not being defined for the cluster.

NOTE: As it is the case with original logic brought in opendatahub-io#181, it will require restart of the controller pod to set watch on AuthConfig resource when Authorino is installed in the cluster.

Custom image: quay.io/bmajsak/odh-model-controller:latest

Signed-off-by: bartoszmajsak <bartosz.majsak@gmail.com>
bartoszmajsak added a commit to bartoszmajsak/odh-model-controller that referenced this pull request Apr 2, 2024
This change handles "soft opt-in" for authorization capability handled by Authorino.

The logic consists of following steps:

 * during the controller bootstrap it will check if GroupVersion `authorino.kuadrant.io/v1beta2` is present in the cluster and conditionally add it to scheme.
 * when defining watches (using builder instance for OpenshiftReconciler) it will conditionally set a watch for `AuthConfig` according to the rules defined in opendatahub-io#181 but additionally if Authorino GroupVersion is present in the schema
 * it guards AuthConfig reconciler from executing based on the "schema rule". This prevents from constant failures due to AuthConfig not being defined for the cluster.

NOTE: As it is the case with original logic brought in opendatahub-io#181, it will require restart of the controller pod to set watch on AuthConfig resource when Authorino is installed in the cluster.

Custom image: quay.io/bmajsak/odh-model-controller:latest

Signed-off-by: bartoszmajsak <bartosz.majsak@gmail.com>
bartoszmajsak added a commit to bartoszmajsak/odh-model-controller that referenced this pull request Apr 2, 2024
This change handles "soft opt-in" for authorization capability handled by Authorino.

The logic consists of following steps:

 * during the controller bootstrap it will check if GroupVersion `authorino.kuadrant.io/v1beta2` is present in the cluster and conditionally add it to scheme.
 * when defining watches (using builder instance for OpenshiftReconciler) it will conditionally set a watch for `AuthConfig` according to the rules defined in opendatahub-io#181 but additionally if Authorino GroupVersion is present in the schema
 * it guards AuthConfig reconciler from executing based on the "schema rule". This prevents from constant failures due to AuthConfig not being defined for the cluster.

NOTE: As it is the case with original logic brought in opendatahub-io#181, it will require restart of the controller pod to set watch on AuthConfig resource when Authorino is installed in the cluster.

Custom image: quay.io/bmajsak/odh-model-controller:latest

Signed-off-by: bartoszmajsak <bartosz.majsak@gmail.com>
bartoszmajsak added a commit to bartoszmajsak/odh-model-controller that referenced this pull request Apr 5, 2024
This change handles "soft opt-in" for authorization capability handled by Authorino.

The solution is based on Conditions exposed in DSCInitialization resource, as provided in opendatahub-io/opendatahub-operator#949.

When the condition of type `CapabilityServiceMeshAuthorization` has a reason `MissingOperator` it will assume authorino/authorization module is not present and therefore does not handle AuthConfigs - leaving models not secured as it was for the previous versions. This approach allows to softly opt-in for Authorization without blocking the upgrade. Any other reason means that ODH Operator is in not happy state and will eventually reconcile so that Authorino will start working.

The logic consists of the following steps:

 * during the controller bootstrap it will check separately if ServiceMesh is enabled and Authorino is present (latter by inspecting the status.condition)
 * when defining watches (using builder instance for OpenshiftReconciler) it will conditionally set a watch for `AuthConfig`
 * it guards AuthConfig reconciler from executing based on the "authorization capability". This prevents constant failures due to `AuthConfig` not being defined for the cluster.

> [!NOTE]
> As it is the case with original logic brought in opendatahub-io#181, it will require restart of the controller pod to set a watch on AuthConfig resource when Authorino is installed in the cluster.

Signed-off-by: bartoszmajsak <bartosz.majsak@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants