-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[backend] Cannot get MLMD objects from Metadata store when running v2 pipeline #8733
Comments
Hi @fstetic! |
Hi @gkcalat! Thanks for the quick response. The run doesn't complete. That error happens at the start of the run. I tried a tutorial pipeline with v1 YAML spec and that one behaves as expected. I inspected MinIO bucket and found out that v1 pipelines make a dir named I also noticed in network requests, when a run is opened in UI, a POST request to I also raised this issue in Slack and a person responded that it might be related to a namespace/profile instantiation issue so I'll look into that next. |
Hello @fstetic I am also having the same problem. Testing on 2.0.0-beta.1 for both API Server and UI, and kfp==2.0.0beta14 |
Hi @tleewongjaro-agoda. Unfortunately no, I gave up and downgraded to v1 pipelines. |
/cc @chensun |
Have u fighred out now? Or any ideas? |
This comment was marked as outdated.
This comment was marked as outdated.
The use of v1 pipelines is still viable?, I have the same problem reported above But the proxy-agent pod is on CrashLoopBack, I searched the pod logs and the result is below In the ui, I keep coming across this error without being able to use it +++ dirname /opt/proxy/attempt-register-vm-on-proxy.sh
|
Sorry,I'm still using Pipeline v2. Can't help u.
?????Enoch?????
***@***.***
…------------------ 原始邮件 ------------------
发件人: ***@***.***>;
发送时间: 2023年9月5日(星期二) 晚上6:56
收件人: ***@***.***>;
抄送: ***@***.***>; ***@***.***>;
主题: Re: [kubeflow/pipelines] [backend] Cannot get MLMD objects from Metadata store when running v2 pipeline (Issue #8733)
The use of v1 pipelines is still viable?, I have the same problem reported above
But the proxy-agent pod is on CrashLoopBack, I searched the pod logs and the result is below
In the ui, I keep coming across this error without being able to use it
Error: failed to retrieve list of pipelines. Click Details for more information.
+++ dirname /opt/proxy/attempt-register-vm-on-proxy.sh
++ cd /opt/proxy
++ pwd
DIR=/opt/proxy
++ jq -r '.data.Hostname // empty'
++ kubectl get configmap inverse-proxy-config -o json
HOSTNAME=
[[ -n '' ]]
[[ ! -z '' ]]
++ curl http://metadata.google.internal/computeMetadata/v1/instance/zone -H 'Metadata-Flavor: Google'
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (6) Could not resolve host: metadata.google.internal
INSTANCE_ZONE=/
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you commented.Message ID: ***@***.***>
|
Tagging @Linchin for a bit more visibility. This was mentioned a couple of days ago in the 1.8 tracking issue, and one of our customers is also running exactly into this (they are using 2.0-alpha.7):
Could you please confirm this is an issue? Also, do you think this is potentially blocking 1.8? |
I don't think this would be a blocker, as we had tested pipelines like this in KFP 2.0 standalone deployment, while I do recall seeing similar error messages sometime, but it shouldn't fail the pipeline execution. |
facing the same issue. I get this error msg and run doesn't start. I see the issue is closed, but don't see a solution other than downgrading. Any workable solution without downgrading ? |
Hi same problem we faced on 1.8 kubeflow. What is the solution? we could not solve this problem. @venkatesh-chinni did you find something? @chensun Can you explain a little bit more ? |
Still trying to figure out, no resolution yet |
I found a solution with downgrade to 1.7 kubeflow and with 1.24 kubernetes. And using kfp 2.0.1 version. I hope it will help for you! |
I had this issue and it stopped happening after creating a volume. The runs start working even if I create a volume and then I delete it, but keeps happening if I create a pipeline run on a newly installed kubeflow 1.8 from the manifests. |
Thanks to the investigation done by @orfeas-k in canonical/bundle-kubeflow#966, it seems like this issue might be some kind of irrecoverable race condition in the As a temporary workaround, it seems like you can simply restart that deployment and it should fix it: kubectl rollout restart deployment/metadata-envoy-deployment --namespace kubeflow As a longer-term solution (assuming a restart is all that is required), we can add a apiVersion: apps/v1
kind: Deployment
metadata:
name: metadata-envoy-deployment
spec:
template:
spec:
containers:
- name: container
livenessProbe:
failureThreshold: 3
initialDelaySeconds: 5
periodSeconds: 15
successThreshold: 1
timeoutSeconds: 5
httpGet:
path: "/"
port: md-envoy
httpHeaders:
- name: Content-Type
value: application/grpc-web-text |
As commented in deployKF/deployKF#191 (comment), note that the above livenessProbe has been tested for pipelines 2.0 and we bumped into issues sending the same request when we bumped in 2.2.0 (as described in canonical/envoy-operator#106) |
@thesuperzapper Thanks for your information. I put this in "metadata-envoy-deployment" yaml, and I run a simple pipeline of example, it still show of " |
In my case the issue was related to a hard coded image name [gcr.io/ml-pipeline/kfp-driver@sha256:8e60086b04d92b657898a310ca9757631d58547e76bbbb8bfc376d654bef1707] somewhere in the kubeflow code which is not mentioned any where in the kubeflow manifests, which kept the pod Init:ImagePullBackOff state. Loading this image with the same name on all worker nodes solved the issue. I am using kubeflow version 1.8, on prem k8 cluster. |
/reopen since the fix for metadata-envoy-deployment does not seem to be upstream yet |
@juliusvonkohout: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Hi Team, |
I got a pipeline to finish, but there are still errors with ml-metadata |
CC @kubeflow/release-team since it has 35 upvotes |
cross posting in #11086 (comment) since it is related |
@hbelmiro to review this. |
@juliusvonkohout it's unlikely to be resolved for KFP 2.4. It was added to the 2.5 milestone. |
I am not good at English. So I used a translator. So the sentence is unnatural. Please understand. I encountered the same error when running a pipeline through the Kubeflow UI. After analyzing the workflow-controller logs, I identified that the error was caused by missing configmap and serviceaccount configurations. To resolve this, I:
After implementing these fixes, both the container driver and DAG driver pods were created successfully. The pipeline pod was also created (though it had a separate component code error unrelated to this issue). If anyone needs the detailed configuration steps, feel free to ask. Hope this helps others facing similar issues! Environment
|
Do you mean that the artifact-reposirories configmap is really mandatory and missing from https://github.com/kubeflow/pipelines/blob/master/manifests/kustomize/base/installs/multi-user/pipelines-profile-controller/sync.py ? "Set up the default-editor serviceaccount with appropriate roles and permissions" Which roles and permissions are missing? I can also run the Pipeline, but I get errors in the UI. Do they not appear for you? Please use 1.9.1 instead of 1.9 for superior authentication. https://github.com/kubeflow/manifests/releases/tag/v1.9.1 |
I checked the logs of the "workflow" pod in the kubeflow namespace and found an error about missing "artifact-repositories configmap".
After applying both the ConfigMap and these permissions, the pipeline worked correctly with no UI errors. |
Can someone post the manifest/version and install steps they are using to produce this error. The link in the OP unfortunately gives 404 now (this guide). I'm trying to reproduce it. I saw it show up temporarily on my first pipeline run on kfp 2.0.0 (sdk version 2.0.1, and k8s 1.25.16 on kind), but then it went away and the pipeline eventually completed successfully. All other pipeline runs I've done have succeeded. I used the platform-agnostic manifest: |
Using KF 1.9.1 I was able to reproduce this on one deployment, but not on others. That is, using the same install from manifest I got the error, but then when I redeployed I couldn't get the error again (using the same pipeline YAML). This tells me it's either an issue with the way KF is deployed sometimes, or some special state that KF was in. This may be related to issue #11403 , which was fixed recently. If no one sees this with 1.9.2+, this may have addressed this issue as well. |
It seems to be fixed now. It can happen with very large azure oidc info that you exceed the limit of the grpc server. But most things reported here seem to be from unclean installations or are probably fixed in KFP 2.4.0 or the master branch. If you upgrade to KF 1.9.1 cleanup your istio-system namespace. Also some here had problems with the launcher and driver image which had no proper versioning until 2.4.0 kubeflow/manifests#2953. So please create separate issues focused on a single problem. The master branch is probably also affected by kubeflow/manifests#2970 and we hope to resolve it soon. /close |
@juliusvonkohout: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Environment
Local Canonical Kubeflow using this guide
Bottom of KFP UI left sidenav says
build version dev_local
and the guide states 1.6kfp 2.0.0b10
kfp-pipeline-spec 0.1.17
kfp-server-api 2.0.0a6
Steps to reproduce
Install Kubeflow using aforementioned guide. Copy addition pipeline and compile it and either run it after uploading through UI or run it from code. Both doesn't work.
Expected result
Pipeline shouldn't fail.
Materials and Reference
In details it says
Cannot find context with {"typeName":"system.PipelineRun","contextName":"a5e7085e-ef10-48b2-a0a5-1ced3b93e2e5"}: Unknown Content-type received.
Addition pipeline from documentation
Impacted by this bug? Give it a 👍.
The text was updated successfully, but these errors were encountered: