Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[backend] Cannot get MLMD objects from Metadata store when running v2 pipeline #8733

Closed
fstetic opened this issue Jan 19, 2023 · 39 comments
Closed

Comments

@fstetic
Copy link

fstetic commented Jan 19, 2023

Environment

  • How did you deploy Kubeflow Pipelines (KFP)?
    Local Canonical Kubeflow using this guide
  • KFP version:
    Bottom of KFP UI left sidenav says build version dev_local and the guide states 1.6
  • KFP SDK version:
    kfp 2.0.0b10
    kfp-pipeline-spec 0.1.17
    kfp-server-api 2.0.0a6

Steps to reproduce

Install Kubeflow using aforementioned guide. Copy addition pipeline and compile it and either run it after uploading through UI or run it from code. Both doesn't work.

Expected result

Pipeline shouldn't fail.

Materials and Reference

In details it says Cannot find context with {"typeName":"system.PipelineRun","contextName":"a5e7085e-ef10-48b2-a0a5-1ced3b93e2e5"}: Unknown Content-type received.

Addition pipeline from documentation

from kfp import compiler
from kfp import dsl


@dsl.component
def addition_component(num1: int, num2: int) -> int:
    return num1 + num2


@dsl.pipeline(name="addition-pipeline")
def my_pipeline(a: int, b: int, c: int):
    add_task_1 = addition_component(num1=a, num2=b)
    add_task_2 = addition_component(num1=add_task_1.output, num2=c)


cmplr = compiler.Compiler()
cmplr.compile(my_pipeline, package_path="my_pipeline.yaml")

Impacted by this bug? Give it a 👍.

@gkcalat
Copy link
Member

gkcalat commented Jan 19, 2023

Hi @fstetic!
Thank you for reporting this. Could you confirm whether the problem is persistent or if it goes away after the run completes?

@gkcalat gkcalat self-assigned this Jan 19, 2023
@fstetic
Copy link
Author

fstetic commented Jan 20, 2023

Hi @gkcalat! Thanks for the quick response.

The run doesn't complete. That error happens at the start of the run.

I tried a tutorial pipeline with v1 YAML spec and that one behaves as expected. I inspected MinIO bucket and found out that v1 pipelines make a dir named <workflow name> in mlpipelines/artifacts, but v2 don't. "contextName" in the error message stated in the issue corresponds to the RunID of the pipeline, not workflow name.

I also noticed in network requests, when a run is opened in UI, a POST request to /ml_metadata.MetadataStoreService/GetContextByTypeAndName where v1 and v2 pipelines differ. V1 pipelines send pipeline_run in request body and v2 pipelines send system.PipelineRun. Don't know if that means anything because in both cases the request fails with 400 error and message Cannot POST /ml_metadata.MetadataStoreService/GetContextByTypeAndName

I also raised this issue in Slack and a person responded that it might be related to a namespace/profile instantiation issue so I'll look into that next.

@tleewongjaro-agoda
Copy link

Hello @fstetic

I am also having the same problem.
Have you figured out what is wrong?

Testing on 2.0.0-beta.1 for both API Server and UI, and kfp==2.0.0beta14

@fstetic
Copy link
Author

fstetic commented Apr 20, 2023

Hi @tleewongjaro-agoda. Unfortunately no, I gave up and downgraded to v1 pipelines.

@gkcalat
Copy link
Member

gkcalat commented Apr 20, 2023

/cc @chensun

@gkcalat gkcalat assigned chensun and jlyaoyuli and unassigned gkcalat May 4, 2023
@Enochlove
Copy link

Hello @fstetic

I am also having the same problem. Have you figured out what is wrong?

Testing on 2.0.0-beta.1 for both API Server and UI, and kfp==2.0.0beta14

Have u fighred out now? Or any ideas?

@LordWaif

This comment was marked as outdated.

@LordWaif
Copy link

LordWaif commented Sep 5, 2023

The use of v1 pipelines is still viable?, I have the same problem reported above

But the proxy-agent pod is on CrashLoopBack, I searched the pod logs and the result is below

In the ui, I keep coming across this error without being able to use it
Error: failed to retrieve list of pipelines. Click Details for more information.

+++ dirname /opt/proxy/attempt-register-vm-on-proxy.sh
++ cd /opt/proxy
++ pwd

  • DIR=/opt/proxy
    ++ jq -r '.data.Hostname // empty'
    ++ kubectl get configmap inverse-proxy-config -o json
  • HOSTNAME=
  • [[ -n '' ]]
  • [[ ! -z '' ]]
    ++ curl http://metadata.google.internal/computeMetadata/v1/instance/zone -H 'Metadata-Flavor: Google'
    % Total % Received % Xferd Average Speed Time Time Time Current
    Dload Upload Total Spent Left Speed
    0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (6) Could not resolve host: metadata.google.internal
  • INSTANCE_ZONE=/

@Enochlove
Copy link

Enochlove commented Sep 6, 2023 via email

@DnPlas
Copy link

DnPlas commented Oct 5, 2023

Tagging @Linchin for a bit more visibility. This was mentioned a couple of days ago in the 1.8 tracking issue, and one of our customers is also running exactly into this (they are using 2.0-alpha.7):

"Cannot get MLMD objects from Metadata store." and when clicking the "details" button on the error I get this:
Cannot find context with {"typeName":"system.PipelineRun","contextName":"496bc83e-d8be-491b-988f-5ff3b98736c5"}: Unknown Content-type received.

Could you please confirm this is an issue? Also, do you think this is potentially blocking 1.8?

@chensun
Copy link
Member

chensun commented Oct 5, 2023

Tagging @Linchin for a bit more visibility. This was mentioned a couple of days ago in the 1.8 tracking issue, and one of our customers is also running exactly into this (they are using 2.0-alpha.7):

"Cannot get MLMD objects from Metadata store." and when clicking the "details" button on the error I get this:
Cannot find context with {"typeName":"system.PipelineRun","contextName":"496bc83e-d8be-491b-988f-5ff3b98736c5"}: Unknown Content-type received.

Could you please confirm this is an issue? Also, do you think this is potentially blocking 1.8?

I don't think this would be a blocker, as we had tested pipelines like this in KFP 2.0 standalone deployment, while I do recall seeing similar error messages sometime, but it shouldn't fail the pipeline execution.
That being said, I will test this again with Kubeflow 1.8 rc shortly.

@chensun
Copy link
Member

chensun commented Oct 9, 2023

Confirming this doesn't reproduce on Kubeflow 1.8.0-rc.1
image

The error message about cannot get MLMD context does sometime shown in the UI, this is expected before a run starts (we should consider some UI improvement to not make it confusing), but it should be gone once the run starts (the root driver pod will create MLMD context).

@venkatesh-chinni
Copy link

facing the same issue. I get this error msg and run doesn't start. I see the issue is closed, but don't see a solution other than downgrading. Any workable solution without downgrading ?

@ZeynepRuveyda
Copy link

ZeynepRuveyda commented May 16, 2024

Hi same problem we faced on 1.8 kubeflow. What is the solution? we could not solve this problem. @venkatesh-chinni did you find something? @chensun Can you explain a little bit more ?

@venkatesh-chinni
Copy link

venkatesh-chinni commented May 17, 2024

Hi same problem we faced on 1.8 kubeflow. What is the solution? we could not solve this problem. @venkatesh-chinni did you find something? @chensun Can you explain a little bit more ?

Still trying to figure out, no resolution yet

@ZeynepRuveyda
Copy link

Hi same problem we faced on 1.8 kubeflow. What is the solution? we could not solve this problem. @venkatesh-chinni did you find something? @chensun Can you explain a little bit more ?

Still trying to figure out, no resolution yet

I found a solution with downgrade to 1.7 kubeflow and with 1.24 kubernetes. And using kfp 2.0.1 version.

I hope it will help for you!

@photonbit
Copy link

I had this issue and it stopped happening after creating a volume. The runs start working even if I create a volume and then I delete it, but keeps happening if I create a pipeline run on a newly installed kubeflow 1.8 from the manifests.

@thesuperzapper
Copy link
Member

Thanks to the investigation done by @orfeas-k in canonical/bundle-kubeflow#966, it seems like this issue might be some kind of irrecoverable race condition in the Deployment/metadata-envoy-deployment Pods.

As a temporary workaround, it seems like you can simply restart that deployment and it should fix it:

kubectl rollout restart deployment/metadata-envoy-deployment --namespace kubeflow

As a longer-term solution (assuming a restart is all that is required), we can add a livenessProbe on the PodSpec of the manifests. For example, this Kustomize patch may work:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: metadata-envoy-deployment
spec:
  template:
    spec:
      containers:
        - name: container
          livenessProbe:
            failureThreshold: 3
            initialDelaySeconds: 5
            periodSeconds: 15
            successThreshold: 1
            timeoutSeconds: 5
            httpGet:
              path: "/"
              port: md-envoy
              httpHeaders:
                - name: Content-Type
                  value: application/grpc-web-text

@orfeas-k
Copy link

orfeas-k commented Aug 7, 2024

As commented in deployKF/deployKF#191 (comment), note that the above livenessProbe has been tested for pipelines 2.0 and we bumped into issues sending the same request when we bumped in 2.2.0 (as described in canonical/envoy-operator#106)

@ZDowney926
Copy link

ZDowney926 commented Aug 9, 2024

@thesuperzapper Thanks for your information. I put this in "metadata-envoy-deployment" yaml, and I run a simple pipeline of example, it still show of " Cannot get MLMD objects from Metadata store. Cannot find context with {"typeName":"system.PipelineRun","contextName":"2c9d40f1-1c09-4de3-a2e0-725eb4a4f0fb"}: Cannot find specified context"
P.S. I try in both KubeFlow v1.8.0 and v1.10.0, it show same message.

@vishnujp12
Copy link

In my case the issue was related to a hard coded image name [gcr.io/ml-pipeline/kfp-driver@sha256:8e60086b04d92b657898a310ca9757631d58547e76bbbb8bfc376d654bef1707] somewhere in the kubeflow code which is not mentioned any where in the kubeflow manifests, which kept the pod Init:ImagePullBackOff state. Loading this image with the same name on all worker nodes solved the issue. I am using kubeflow version 1.8, on prem k8 cluster.

@juliusvonkohout
Copy link
Member

juliusvonkohout commented Dec 10, 2024

/reopen

since the fix for metadata-envoy-deployment does not seem to be upstream yet

Copy link

@juliusvonkohout: Reopened this issue.

In response to this:

/reopen

since the fixfor metadata-envoy-deployment does not seem to be upstream yet

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@google-oss-prow google-oss-prow bot reopened this Dec 10, 2024
@github-project-automation github-project-automation bot moved this from Closed to Needs triage in KFP Runtime Triage Dec 10, 2024
@shivanibhargove
Copy link

Hi Team,
Is there any update or any workaround this? We are blocked on kfp v2.

@juliusvonkohout
Copy link
Member

I got a pipeline to finish, but there are still errors with ml-metadata image
Also no clear errors in the ml-metadata deployments. I think i have to examine the database directly Maybe I know more in a few weeks @rimolive

@juliusvonkohout
Copy link
Member

juliusvonkohout commented Jan 6, 2025

CC @kubeflow/release-team since it has 35 upvotes

@juliusvonkohout
Copy link
Member

cross posting in #11086 (comment) since it is related

@varodrig
Copy link
Contributor

varodrig commented Jan 6, 2025

@hbelmiro to review this.

@HumairAK HumairAK added this to the KFP 2.5.0 milestone Jan 6, 2025
@hbelmiro
Copy link
Contributor

hbelmiro commented Jan 6, 2025

@juliusvonkohout it's unlikely to be resolved for KFP 2.4. It was added to the 2.5 milestone.

@honeydanji
Copy link

I am not good at English. So I used a translator. So the sentence is unnatural. Please understand.

I encountered the same error when running a pipeline through the Kubeflow UI. After analyzing the workflow-controller logs, I identified that the error was caused by missing configmap and serviceaccount configurations.

To resolve this, I:

  1. Created the required artifact-repositories configmap
  2. Set up the default-editor serviceaccount with appropriate roles and permissions

After implementing these fixes, both the container driver and DAG driver pods were created successfully. The pipeline pod was also created (though it had a separate component code error unrelated to this issue).

If anyone needs the detailed configuration steps, feel free to ask. Hope this helps others facing similar issues!

Environment

  • kubeflow manifests v1.9
  • kfp version == 2.2.0

@juliusvonkohout
Copy link
Member

I am not good at English. So I used a translator. So the sentence is unnatural. Please understand.

I encountered the same error when running a pipeline through the Kubeflow UI. After analyzing the workflow-controller logs, I identified that the error was caused by missing configmap and serviceaccount configurations.

To resolve this, I:

1. Created the required artifact-repositories configmap

2. Set up the default-editor serviceaccount with appropriate roles and permissions

After implementing these fixes, both the container driver and DAG driver pods were created successfully. The pipeline pod was also created (though it had a separate component code error unrelated to this issue).

If anyone needs the detailed configuration steps, feel free to ask. Hope this helps others facing similar issues!

Environment

* kubeflow manifests v1.9

* kfp version == 2.2.0

Do you mean that the artifact-reposirories configmap is really mandatory and missing from https://github.com/kubeflow/pipelines/blob/master/manifests/kustomize/base/installs/multi-user/pipelines-profile-controller/sync.py ?

"Set up the default-editor serviceaccount with appropriate roles and permissions" Which roles and permissions are missing?

I can also run the Pipeline, but I get errors in the UI. Do they not appear for you?

Please use 1.9.1 instead of 1.9 for superior authentication. https://github.com/kubeflow/manifests/releases/tag/v1.9.1

@honeydanji
Copy link

I am not good at English. So I used a translator. So the sentence is unnatural. Please understand.
I encountered the same error when running a pipeline through the Kubeflow UI. After analyzing the workflow-controller logs, I identified that the error was caused by missing configmap and serviceaccount configurations.
To resolve this, I:

1. Created the required artifact-repositories configmap

2. Set up the default-editor serviceaccount with appropriate roles and permissions

After implementing these fixes, both the container driver and DAG driver pods were created successfully. The pipeline pod was also created (though it had a separate component code error unrelated to this issue).
If anyone needs the detailed configuration steps, feel free to ask. Hope this helps others facing similar issues!
Environment

* kubeflow manifests v1.9

* kfp version == 2.2.0

Do you mean that the artifact-reposirories configmap is really mandatory and missing from https://github.com/kubeflow/pipelines/blob/master/manifests/kustomize/base/installs/multi-user/pipelines-profile-controller/sync.py ?

"Set up the default-editor serviceaccount with appropriate roles and permissions" Which roles and permissions are missing?

I can also run the Pipeline, but I get errors in the UI. Do they not appear for you?

Please use 1.9.1 instead of 1.9 for superior authentication. https://github.com/kubeflow/manifests/releases/tag/v1.9.1

I checked the logs of the "workflow" pod in the kubeflow namespace and found an error about missing "artifact-repositories configmap".
This ConfigMap is essential because it is required for storing artifacts generated by the pipeline.
After creating the ConfigMap, I encountered a service account permission error.
To resolve this, I added the following minimal permissions for the default-editor account:

yamlCopyapiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: default-editor-role
  namespace: kubeflow
rules:
- apiGroups: [""]
  resources: ["pods", "configmaps", "secrets"]
  verbs: ["create", "get", "list", "watch", "patch", "delete"]
- apiGroups: ["argoproj.io"]
  resources: ["workflowtaskresults"]
  verbs: ["create", "get", "list", "watch"]

After applying both the ConfigMap and these permissions, the pipeline worked correctly with no UI errors.

@dandawg
Copy link
Contributor

dandawg commented Jan 21, 2025

Can someone post the manifest/version and install steps they are using to produce this error. The link in the OP unfortunately gives 404 now (this guide). I'm trying to reproduce it.

I saw it show up temporarily on my first pipeline run on kfp 2.0.0 (sdk version 2.0.1, and k8s 1.25.16 on kind), but then it went away and the pipeline eventually completed successfully. All other pipeline runs I've done have succeeded.

I used the platform-agnostic manifest:
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/cluster-scoped-resources?ref=2.0.0"
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/env/platform-agnostic?ref=2.0.0"

@HumairAK HumairAK moved this to In Progress in KFP 2.x Release Jan 29, 2025
@dandawg
Copy link
Contributor

dandawg commented Jan 30, 2025

Using KF 1.9.1 I was able to reproduce this on one deployment, but not on others. That is, using the same install from manifest I got the error, but then when I redeployed I couldn't get the error again (using the same pipeline YAML). This tells me it's either an issue with the way KF is deployed sometimes, or some special state that KF was in.

This may be related to issue #11403 , which was fixed recently. If no one sees this with 1.9.2+, this may have addressed this issue as well.

@juliusvonkohout
Copy link
Member

juliusvonkohout commented Jan 31, 2025

It seems to be fixed now. It can happen with very large azure oidc info that you exceed the limit of the grpc server. But most things reported here seem to be from unclean installations or are probably fixed in KFP 2.4.0 or the master branch. If you upgrade to KF 1.9.1 cleanup your istio-system namespace. Also some here had problems with the launcher and driver image which had no proper versioning until 2.4.0 kubeflow/manifests#2953. So please create separate issues focused on a single problem. The master branch is probably also affected by kubeflow/manifests#2970 and we hope to resolve it soon.

/close

Copy link

@juliusvonkohout: Closing this issue.

In response to this:

It seems to be fixed now. It can happen with very large azure oidc info that you exceed the limit of the grpc server. But most things reported here seem to be from unclean installations or are probably fixed in KFP 2.4.0. If you upgrade from KF 1.9.1 cleanup your istio-system namespace. Also some here had problems with the launcher and driver image which had no proper versioning until 2.4.0 kubeflow/manifests#2953. So please create separate issues focused on a single problem.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Status: Needs triage
Development

No branches or pull requests