-
Notifications
You must be signed in to change notification settings - Fork 880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework & Simplify Kubeflow Auth #2864
Rework & Simplify Kubeflow Auth #2864
Conversation
@juliusvonkohout @kimwnasptd here is my proposed solution for the JWT issue. |
/assign @juliusvonkohout @kimwnasptd I want to get these changes in for 1.9.1, so would greatly appreciate your review as soon as possible. Otherwise, we will keep getting reports of things not working on very common kubernetes distributions like EKS. |
Hey @thesuperzapper, |
"From inside the cluster, machines should be able to access some Kubeflow APIs (e.g. KFP) with a Kubernetes ServiceAccount token." Should be Because also from inside you can talk to the ingressgateway and do everything that you can do via the UI/API I adjusted your text slightly there. |
apps/pipeline/upstream/base/installs/multi-user/istio-authorization-config.yaml upstreaming is tracked in #2804 and we welcome PRs. |
I think we should move away from this special KFP way and always go trough the ingressgateway in the future, no matter whether you are inside or outside of the cluster. Here is a verbose draft. export KF_PIPELINES_TOKEN_PATH=$(pwd)/kf_pipelines_token.yml
echo "${{ secrets.KUBECONFIG }}" > $(pwd)/kubeconfig.yaml
export KUBECONFIG=$(pwd)/kubeconfig.yaml
TOKEN=$(kubectl create token default-editor -n $KF_PROJECT_NAMESPACE --duration $KF_K8S_API_SERVER_TOKEN_EXPIRATION --audience istio-ingressgateway.istio-system.svc.cluster.local)
echo -n $TOKEN > $(pwd)/kf_pipelines_token.yml
python submit_run.py auth_token = pathlib.Path(os.environ["KF_PIPELINES_TOKEN_PATH"]).read_text()
KF_ISTIO_INGRESSGATEWAY_URL = os.environ.get('KF_ISTIO_INGRESSGATEWAY_URL')
kfp_client = kfp.Client(host=KF_ISTIO_INGRESSGATEWAY_URL, existing_token=auth_token) This was tested on 1.8.1 and for 1.9 we can probably even drop the audience when creating the token. So we could just use the default token from e.g. a jupyterlab and we would not need the pod defaults for this special KFP token anymore as described here https://www.kubeflow.org/docs/components/pipelines/user-guides/core-functions/connect-api/#serviceaccount-token-volume. |
@b4sus thanks for remining me, I found a similar issue when I implemented oauth2-proxy in deployKF. I have added 939010b which will enable the "sign in" screen for oauth2-proxy, meaning that users have to explicitly "start" the auth flow, rather than it just being redirected in the background (and accumulating many CSRF cookies for each request if the page is already open). |
Can we document or even test in a GitHub action workflow "From outside the cluster, machines should be able to access Kubeflow APIswith a JWT issued by Dex (via token exchange)" somehow? |
common/oauth2-proxy/components/istio-m2m/requestauthentication.yaml
Outdated
Show resolved
Hide resolved
@juliusvonkohout this would be a breaking change, and so can not be included in 1.9.1. Also, this would need to be a decision for the KFP maintainers, because it's a long-standing feature and most/all KFP users are currently depending on it. Personally, I see no benefit to removing this functionality, it only makes things more complex for users who don't want to allow Kubernetes JWTs to be used from outside the cluster, while also making it harder for people to migrate from the 'Standalone' KFP to the 'Kubeflow Platform' one.
This feature is part of KFP for a reason, removing this check would be a security risk, as it reduces the "isolation" of the JWTs, both in terms of not having these tokens be useful outside the cluster, and also not allowing random SA tokens to be used. Also, there is already a |
common/oauth2-proxy/overlays/m2m-dex-and-eks/kustomization.yaml
Outdated
Show resolved
Hide resolved
I went through the 50 files on my phone, so I might have missed things, but in general it looks like a good compromise and it significantly improves the documentation. Maybe you should extend Kimonas' proposal with some details from your initial post. |
# Dex default cookie expiration is 24h. | ||
# If set to 168h (default oauth2-proxy), Istio will not be able to use the JWT after 24h, | ||
# but oauth2-proxy will still consider the cookie valid. | ||
# It's possible to configure the JWT Refresh Token to enable longer login session. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more information about where to change the jwt refresh token would be useful. Setting the expiration to 7 days or so for all components involved (Istio, Dex, oauth2-proxy) is often requested by users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do this in a separate PR, it's not related to this change.
Switching for the rest of the week to also help with the review of the PR. I also took a look on the issue. @thesuperzapper this looks amazing! With a quick look I really like the approach of distinguishing the scenarios, and for the cloud/vendor specific ones only give some basic guidance. Will provide some more comments the following hours/days |
@juliusvonkohout I have reverted your change because as I was saying in #2864 (comment), we need to support in cluster access to some of these special API endpoints via their Kubernetes Service, not the ingress-gateway. |
@thesuperzapper regarding #2864 (comment) It seems that committing your suggestion by clicking on the button there breaks the DCO because it does not properly sign the commit. Feel free to just overwrite it and commit yourself. |
3714100
to
939010b
Compare
@juliusvonkohout in general, please don't commit to others PRs unless they are unresponsive, its akin to "steeling" the PR as GitHub will attribute the commit to you as well, I have removed your commit and removed your access to push to my branch. |
The missing sign-off looks more like a GitHub bug and if you explicitly create a button (based on my input to just remove the word "to") to click on to commit for others and explicitly enable "edit from maintainers" in your PR this is expected behaviour and a formal invitation. These settings, buttons and suggestions are all in your control in the end. The same would apply if I create a PR in Kubeflow/kubeflow and do the same things. Then it would be my responsibility if I enable it and make an explicit suggestion for you to commit. So for me it really does not matter and I welcome commits from others in general to my branches as allowed by my own PR settings. I even invite some people to my forks sometimes to collaborate. But I also have no emotional ego investment here and just want to get this reviewed and merged. Since it is your PR it is just your decision and I am totally fine with it. But back to the important things: I think with a few more hours of addressing the remaining comments and maybe some additional tests and documentation additions we can get this over the finish line within September. |
@juliusvonkohout can you give a specific list of things you need me to change, or do you just need time to review? I am waiting on @kimwnasptd to give a review so we can get this merged. I am pretty confident (especially given the passing tests) that this PR works, so I am happy to merge it (so that master works for people) and then update docs in a follow-up PR, if there is any remaining things to document. |
Yes, let me provide a list. Some comments/conversations have been answered/resolved, but some are still open:
Probably all of this can be addressed in an hour or so. Nobody is expecting perfection there, just rough guidance.
I need to do a second run through all files and to focus more on the architecture than 50+ individual files as well. I plan to get this done in the next week. By then you might also get feedback from Kimonas. But if the stuff above is addressed and Kimonas is fine with the architectural details that might also be enough.
Yes, we now have so many and more extensive tests such that they are a strong indicator. Given the 1-2 hour effort mentioned above, it probably makes sense to address the small things mentioned above in this PR. This would increase the probability of getting this merged next week already. |
Istio should do the magic here, it managing this route oob with
As mentioned, please mind that it's intention was only for self-signed issuers on kind, vcluster and so on. Including this mechanism for deployments like EKS, Azure and others with OIDC Issuer served behind publicly trusted certificates was not only not needed but also the real source of issues. I had a look at this PR and from my perspective it looks fine, just this one doubt about the I asked @MaxKavun from my team to verify this PR in EKS Cluster. @MaxKavun, can you also check if the setup works without the |
Hi, Quick update regarding testing, I have tested this PR via connecting to KFP from notebook in Rancher cluster "RKE - k8s v1.28.12". Also tested in my local cluster "Kind v1.31.0".
Result:
In my other cluster which contains more experiments, that also worked and returned experiments metadata, earlier was getting 403 error. Thanks |
Are m2m token tests such as https://github.com/kubeflow/manifests/blob/master/.github/workflows/notebook_controller_m2m_test.yaml, https://github.com/kubeflow/manifests/blob/master/.github/workflows/kserve_m2m_test.yaml and https://github.com/kubeflow/manifests/blob/master/.github/workflows/pipeline_test.yaml also working on rancher? |
@juliusvonkohout @kromanow94 don't bother testing without the ouath2-proxy virtualService, because even if Istio was doing something strange with regard to the That page is needed to ensure that background requests don't all initiate a login flow and flood the browser with CSRF cookies when the current auth cookie becomes invalid (while still having a window open to the dashboard). |
So I've done a first pass on the structural side, and things look good but I still would like to take a bit of a deeper look before approving |
Can you get this done until Saturday? Because the plan is that I cut 1.9.1rc1 on Saturday/Sunday. If not, I have to add this PR in 1.9.1rc.2. |
I've done some testing and it all look good, can confirm that oauth2-proxy virtualservice is indeed needed.
But even after update logout works but it doesn't redirect to the login page (you have to refresh the page manually) |
939010b
to
547d2c4
Compare
@juliusvonkohout @kimwnasptd I have applied most of the requested changes in 547d2c4 and rebased because other things were merged to master in the meantime (e.g. a KFP update). I added notes in the We should be ready to merge now, and as long as we do an RC.2 for 1.9.1 (the RC.1 was cut without this PR). This is really the most critical fix for 1.9.1 so it must be merged. |
@thesuperzapper I will do the 1.9.1 RC2 in two weeks after my vacation. Yes, we need to include this PR for 1.9.1. So I will merge this if nothing serious comes up in the next week. |
@thesuperzapper can you solve the merge conflict? Afterwards I can merge it. |
Signed-off-by: Mathew Wicks <5735406+thesuperzapper@users.noreply.github.com>
Signed-off-by: Mathew Wicks <5735406+thesuperzapper@users.noreply.github.com>
Signed-off-by: Mathew Wicks <5735406+thesuperzapper@users.noreply.github.com>
06bc881
to
00c4f4c
Compare
Alright as discussed on slack as well /lgtm And we can continue in follow up PRs. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: juliusvonkohout The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Resolves #2850
Background
Goals of Kubeflow Auth
See the 20240606-jwt-handling.md doc for more context.
But in short, the goals of Kubeflow Auth are:
kubeflow-userid
header.kubeflow-groups
header (future).Issues with Auth in 1.9.0
In Kubeflow 1.9.0 there were a significant number of changes to the way auth was implemented.
The main change was the migration from
oidc-authservice
tooauth2-proxy
for handling authentication.The solution we implemented in 1.9.0 had a number of issues:
It did not work properly with nearly all major Kubernetes distributions (e.g. EKS, GKE, AKS, K3s, etc.):
A CronJob was to populate JWKS into the Istio
RequestAuthentication
:jwks_uri
returned by the/.well-known/openid-configuration
was correct (which is not true on EKS [EKS] [request]: Provide correct jwks_uri from the Kubernetes API aws/containers-roadmap#2234).We were needlessly verifying JWTs in multiple places:
RequestAuthentication
).envoyExtAuthzHttp
withoauth2-proxy
.RequestAuthentication
).oauth2-proxy
can not retrieve the JWKS from Kubernetes on clusters that don't allow anonymous access to the API server (which is disabled on common distributions like K3s).The use of cluster-wide
RequestAuthentication
resources:RequestAuthentication
that applied to all Pods in the cluster rather than being scoped to theistio-ingressgateway
:RequestAuthentication/m2m-token-issuer
(for Kubernetes JWTs)RequestAuthentication/dex-jwt
(for Dex JWTs)Somehow, the
VirtualService
foroauth2-proxy
was omitted:/oauth2/
HTTP path was answered by central-dashboard, not oauth2-proxy.What does this PR change?
High-Level Overview
Here is a high-level overview of how the new auth flows work:
User Authentication:
istio-ingressgateway
pods via something likekubectl port-forward
.AuthorizationPolicy/istio-ingressgateway-oauth2-proxy
:envoyExtAuthzHttp
to verify the request withoauth2-proxy
.oauth2-proxy
will verify it and set theAuthorization
header with a Dex JWT.oauth2-proxy
will redirect the user to the Dex login page.RequestAuthentication/dex-jwt
:istio-ingressgateway
Pods.kubeflow-userid
andkubeflow-groups
headers.AuthorizationPolicy/istio-ingressgateway-require-jwt
:RequestAuthentication
populated arequestPrincipals
metadata, it is allowed to pass.AuthorizationPolicies
)kubeflow-userid
andkubeflow-groups
headers implicitly.Off cluster, Machine Authentication:
istio-ingressgateway
pods via something likekubectl port-forward
.Authorization
header (either a Dex JWT or a Kubernetes JWT).AuthorizationPolicy/istio-ingressgateway-oauth2-proxy
:Authorization
header.RequestAuthentication/dex-jwt
orRequestAuthentication/m2m-token-issuer
:istio-ingressgateway
Pods.kubeflow-userid
andkubeflow-groups
headers.AuthorizationPolicy/istio-ingressgateway-require-jwt
:RequestAuthentication
populated arequestPrincipals
metadata, it is allowed to pass.AuthorizationPolicies
)kubeflow-userid
andkubeflow-groups
headers implicitly.On-cluster, Machine Authentication:
http://ml-pipeline-ui.kubeflow.svc.cluster.local
.Authorization
header.istio-ingressgateway
and relatedRequestAuthentication
/AuthorizationPolicy
resources.AuthorizationPolicy/ml-pipeline-ui
(in thekubeflow
namespace):Authorization
header (but not akubeflow-userid
header), it is allowed to pass directly to theml-pipeline-ui
Pods.TokenReview
call to the Kubernetes API server.New Kustomize Overlays
To enable the above flows there are three new overlays (only one of which should be applied at a time):
./common/oauth2-proxy/overlays/m2m-dex-and-kind
:/openid/v1/jwks
(e.g. not EKS).example/kustomization.yaml
../common/oauth2-proxy/overlays/m2m-dex-only
:./common/oauth2-proxy/overlays/m2m-dex-and-eks
:issuer
for their EKS cluster in thekustomization.yaml
.About the
m2m-dex-and-kind
OverlayAs discussed in #2850, rather than using a
CronJob/kubeflow-m2m-oidc-configurator
to populate the JWKS into the IstioRequestAuthentication
, we now allow Istio to directly access the/openid/v1/jwks
endpoint on the Kubernetes API server.Because this endpoint is may not be accessible anonymously on all Kubernetes distributions, we have created a
Deployment/cluster-jwks-proxy
which useskubectl proxy
to make the JWKS keys available without authentication athttp://cluster-jwks-proxy.istio-system.svc.cluster.local/openid/v1/jwks
.This is only necessary for Kubernetes distributions that do not allow anonymous access to the
https://kubernetes.default.svc.cluster.local/openid/v1/jwks
endpoint. However, we include it by default to maximise the amount of clusters the default manifests work on.Removal of unneeded resources
The following resources have been removed because they were no longer needed:
EnvoyFilter/x-forwarded-host
:CronJob/kubeflow-m2m-oidc-configurator
:Deployment/cluster-jwks-proxy
to provide anonymous access to the cluster JWKS (for kind clusters).params.yaml
files in the Kustomize bases, these have been removed.Test Fixes
The following changes to the tests have been made:
test_dex_login.py
script (used in the/.github/workflows/dex_test.yaml
workflow) has been updated with the new script from:Other Notes
Update Notes
Users updating from 1.9.0 to 1.9.1 will need to manually remove the resources that are no longer needed:
CronJob/kubeflow-m2m-oidc-configurator
(namespace:istio-system
):ServiceAccount/kubeflow-m2m-oidc-configurator
Role/kubeflow-m2m-oidc-configurator
RoleBinding/kubeflow-m2m-oidc-configurator
EnvoyFilter/x-forwarded-host
Custom Patches
We have made a few custom patches to the
upstream
folder underapps/pipeline/upstream/base/installs/multi-user/istio-authorization-config.yaml
. We should try and upstream these changes to the Kubeflow manifests repo, so we don't need to maintain them in the future.Off-Cluster Access with Kubernetes JWTs
I am not a fan of allowing Kubernetes JWTs to be used from outside the cluster, as this requires exfiltrating the cluster JWTs. However, I understand that some users may want to do this, so the default
m2m-dex-and-kind
overlay allows this.The
m2m-dex-and-kind
overlay will only work on K8S distributions whose cluster API serves valid JWKS keys at/openid/v1/jwks
(e.g. not EKS). However, if someone applies them2m-dex-and-kind
overlay on an incompatible cluster, everything should still work, except for the ability to use K8s JWTs from outside the cluster.