[backend] Cannot get MLMD objects from Metadata store when running v2 pipeline #8733

fstetic · 2023-01-19T10:30:18Z

Environment

How did you deploy Kubeflow Pipelines (KFP)?
Local Canonical Kubeflow using this guide
KFP version:
Bottom of KFP UI left sidenav says build version dev_local and the guide states 1.6
KFP SDK version:
kfp 2.0.0b10
kfp-pipeline-spec 0.1.17
kfp-server-api 2.0.0a6

Steps to reproduce

Install Kubeflow using aforementioned guide. Copy addition pipeline and compile it and either run it after uploading through UI or run it from code. Both doesn't work.

Expected result

Pipeline shouldn't fail.

Materials and Reference

In details it says Cannot find context with {"typeName":"system.PipelineRun","contextName":"a5e7085e-ef10-48b2-a0a5-1ced3b93e2e5"}: Unknown Content-type received.

Addition pipeline from documentation

from kfp import compiler
from kfp import dsl


@dsl.component
def addition_component(num1: int, num2: int) -> int:
    return num1 + num2


@dsl.pipeline(name="addition-pipeline")
def my_pipeline(a: int, b: int, c: int):
    add_task_1 = addition_component(num1=a, num2=b)
    add_task_2 = addition_component(num1=add_task_1.output, num2=c)


cmplr = compiler.Compiler()
cmplr.compile(my_pipeline, package_path="my_pipeline.yaml")

Impacted by this bug? Give it a 👍.

The text was updated successfully, but these errors were encountered:

gkcalat · 2023-01-19T23:51:18Z

Hi @fstetic!
Thank you for reporting this. Could you confirm whether the problem is persistent or if it goes away after the run completes?

fstetic · 2023-01-20T08:07:53Z

Hi @gkcalat! Thanks for the quick response.

The run doesn't complete. That error happens at the start of the run.

I tried a tutorial pipeline with v1 YAML spec and that one behaves as expected. I inspected MinIO bucket and found out that v1 pipelines make a dir named <workflow name> in mlpipelines/artifacts, but v2 don't. "contextName" in the error message stated in the issue corresponds to the RunID of the pipeline, not workflow name.

I also noticed in network requests, when a run is opened in UI, a POST request to /ml_metadata.MetadataStoreService/GetContextByTypeAndName where v1 and v2 pipelines differ. V1 pipelines send pipeline_run in request body and v2 pipelines send system.PipelineRun. Don't know if that means anything because in both cases the request fails with 400 error and message Cannot POST /ml_metadata.MetadataStoreService/GetContextByTypeAndName

I also raised this issue in Slack and a person responded that it might be related to a namespace/profile instantiation issue so I'll look into that next.

tleewongjaro-agoda · 2023-04-20T08:02:30Z

Hello @fstetic

I am also having the same problem.
Have you figured out what is wrong?

Testing on 2.0.0-beta.1 for both API Server and UI, and kfp==2.0.0beta14

fstetic · 2023-04-20T08:12:29Z

Hi @tleewongjaro-agoda. Unfortunately no, I gave up and downgraded to v1 pipelines.

gkcalat · 2023-04-20T15:00:07Z

/cc @chensun

Enochlove · 2023-09-05T09:12:07Z

Hello @fstetic

I am also having the same problem. Have you figured out what is wrong?

Testing on 2.0.0-beta.1 for both API Server and UI, and kfp==2.0.0beta14

Have u fighred out now? Or any ideas?

LordWaif · 2023-09-05T10:56:35Z

The use of v1 pipelines is still viable?, I have the same problem reported above

But the proxy-agent pod is on CrashLoopBack, I searched the pod logs and the result is below

In the ui, I keep coming across this error without being able to use it
Error: failed to retrieve list of pipelines. Click Details for more information.

+++ dirname /opt/proxy/attempt-register-vm-on-proxy.sh
++ cd /opt/proxy
++ pwd

DIR=/opt/proxy
++ jq -r '.data.Hostname // empty'
++ kubectl get configmap inverse-proxy-config -o json
HOSTNAME=
[[ -n '' ]]
[[ ! -z '' ]]
++ curl http://metadata.google.internal/computeMetadata/v1/instance/zone -H 'Metadata-Flavor: Google'
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (6) Could not resolve host: metadata.google.internal
INSTANCE_ZONE=/

Enochlove · 2023-09-06T02:12:32Z

Sorry，I'm still using Pipeline v2. Can't help u.    ?????Enoch????? ***@***.***  

…

------------------ 原始邮件 ------------------ 发件人: ***@***.***>; 发送时间: 2023年9月5日(星期二) 晚上6:56 收件人: ***@***.***>; 抄送: ***@***.***>; ***@***.***>; 主题: Re: [kubeflow/pipelines] [backend] Cannot get MLMD objects from Metadata store when running v2 pipeline (Issue #8733) The use of v1 pipelines is still viable?, I have the same problem reported above But the proxy-agent pod is on CrashLoopBack, I searched the pod logs and the result is below In the ui, I keep coming across this error without being able to use it Error: failed to retrieve list of pipelines. Click Details for more information. +++ dirname /opt/proxy/attempt-register-vm-on-proxy.sh ++ cd /opt/proxy ++ pwd DIR=/opt/proxy ++ jq -r '.data.Hostname // empty' ++ kubectl get configmap inverse-proxy-config -o json HOSTNAME= [[ -n '' ]] [[ ! -z '' ]] ++ curl http://metadata.google.internal/computeMetadata/v1/instance/zone -H 'Metadata-Flavor: Google' % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (6) Could not resolve host: metadata.google.internal INSTANCE_ZONE=/ — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: ***@***.***>

DnPlas · 2023-10-05T11:02:15Z

Tagging @Linchin for a bit more visibility. This was mentioned a couple of days ago in the 1.8 tracking issue, and one of our customers is also running exactly into this (they are using 2.0-alpha.7):

"Cannot get MLMD objects from Metadata store." and when clicking the "details" button on the error I get this:
Cannot find context with {"typeName":"system.PipelineRun","contextName":"496bc83e-d8be-491b-988f-5ff3b98736c5"}: Unknown Content-type received.

Could you please confirm this is an issue? Also, do you think this is potentially blocking 1.8?

chensun · 2023-10-05T18:58:24Z

Tagging @Linchin for a bit more visibility. This was mentioned a couple of days ago in the 1.8 tracking issue, and one of our customers is also running exactly into this (they are using 2.0-alpha.7):
"Cannot get MLMD objects from Metadata store." and when clicking the "details" button on the error I get this:
Cannot find context with {"typeName":"system.PipelineRun","contextName":"496bc83e-d8be-491b-988f-5ff3b98736c5"}: Unknown Content-type received.
Could you please confirm this is an issue? Also, do you think this is potentially blocking 1.8?

I don't think this would be a blocker, as we had tested pipelines like this in KFP 2.0 standalone deployment, while I do recall seeing similar error messages sometime, but it shouldn't fail the pipeline execution.
That being said, I will test this again with Kubeflow 1.8 rc shortly.

chensun · 2023-10-09T20:52:59Z

Confirming this doesn't reproduce on Kubeflow 1.8.0-rc.1

The error message about cannot get MLMD context does sometime shown in the UI, this is expected before a run starts (we should consider some UI improvement to not make it confusing), but it should be gone once the run starts (the root driver pod will create MLMD context).

venkatesh-chinni · 2024-05-06T09:46:42Z

facing the same issue. I get this error msg and run doesn't start. I see the issue is closed, but don't see a solution other than downgrading. Any workable solution without downgrading ?

ZeynepRuveyda · 2024-05-16T07:21:32Z

Hi same problem we faced on 1.8 kubeflow. What is the solution? we could not solve this problem. @venkatesh-chinni did you find something? @chensun Can you explain a little bit more ?

venkatesh-chinni · 2024-05-17T09:24:04Z

Hi same problem we faced on 1.8 kubeflow. What is the solution? we could not solve this problem. @venkatesh-chinni did you find something? @chensun Can you explain a little bit more ?

Still trying to figure out, no resolution yet

ZeynepRuveyda · 2024-05-17T09:29:51Z

Hi same problem we faced on 1.8 kubeflow. What is the solution? we could not solve this problem. @venkatesh-chinni did you find something? @chensun Can you explain a little bit more ?

Still trying to figure out, no resolution yet

I found a solution with downgrade to 1.7 kubeflow and with 1.24 kubernetes. And using kfp 2.0.1 version.

I hope it will help for you!

photonbit · 2024-06-04T00:51:50Z

I had this issue and it stopped happening after creating a volume. The runs start working even if I create a volume and then I delete it, but keeps happening if I create a pipeline run on a newly installed kubeflow 1.8 from the manifests.

thesuperzapper · 2024-08-03T00:13:19Z

Thanks to the investigation done by @orfeas-k in canonical/bundle-kubeflow#966, it seems like this issue might be some kind of irrecoverable race condition in the Deployment/metadata-envoy-deployment Pods.

As a temporary workaround, it seems like you can simply restart that deployment and it should fix it:

kubectl rollout restart deployment/metadata-envoy-deployment --namespace kubeflow

As a longer-term solution (assuming a restart is all that is required), we can add a livenessProbe on the PodSpec of the manifests. For example, this Kustomize patch may work:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: metadata-envoy-deployment
spec:
  template:
    spec:
      containers:
        - name: container
          livenessProbe:
            failureThreshold: 3
            initialDelaySeconds: 5
            periodSeconds: 15
            successThreshold: 1
            timeoutSeconds: 5
            httpGet:
              path: "/"
              port: md-envoy
              httpHeaders:
                - name: Content-Type
                  value: application/grpc-web-text

orfeas-k · 2024-08-07T13:47:59Z

As commented in deployKF/deployKF#191 (comment), note that the above livenessProbe has been tested for pipelines 2.0 and we bumped into issues sending the same request when we bumped in 2.2.0 (as described in canonical/envoy-operator#106)

ZDowney926 · 2024-08-09T10:10:56Z

@thesuperzapper Thanks for your information. I put this in "metadata-envoy-deployment" yaml, and I run a simple pipeline of example, it still show of " Cannot get MLMD objects from Metadata store. Cannot find context with {"typeName":"system.PipelineRun","contextName":"2c9d40f1-1c09-4de3-a2e0-725eb4a4f0fb"}: Cannot find specified context"
P.S. I try in both KubeFlow v1.8.0 and v1.10.0, it show same message.

vishnujp12 · 2024-08-30T06:38:15Z

In my case the issue was related to a hard coded image name [gcr.io/ml-pipeline/kfp-driver@sha256:8e60086b04d92b657898a310ca9757631d58547e76bbbb8bfc376d654bef1707] somewhere in the kubeflow code which is not mentioned any where in the kubeflow manifests, which kept the pod Init:ImagePullBackOff state. Loading this image with the same name on all worker nodes solved the issue. I am using kubeflow version 1.8, on prem k8 cluster.

juliusvonkohout · 2024-12-10T12:34:40Z

/reopen

since the fix for metadata-envoy-deployment does not seem to be upstream yet

google-oss-prow · 2024-12-10T12:34:44Z

@juliusvonkohout: Reopened this issue.

In response to this:

/reopen

since the fixfor metadata-envoy-deployment does not seem to be upstream yet

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

shivanibhargove · 2025-01-02T04:54:11Z

Hi Team,
Is there any update or any workaround this? We are blocked on kfp v2.

juliusvonkohout · 2025-01-06T16:46:42Z

I got a pipeline to finish, but there are still errors with ml-metadata
Also no clear errors in the ml-metadata deployments. I think i have to examine the database directly Maybe I know more in a few weeks @rimolive

juliusvonkohout · 2025-01-06T16:47:17Z

CC @kubeflow/release-team since it has 35 upvotes

juliusvonkohout · 2025-01-06T16:54:43Z

cross posting in #11086 (comment) since it is related

varodrig · 2025-01-06T17:09:21Z

@hbelmiro to review this.

hbelmiro · 2025-01-06T17:54:58Z

@juliusvonkohout it's unlikely to be resolved for KFP 2.4. It was added to the 2.5 milestone.

honeydanji · 2025-01-08T06:53:57Z

I am not good at English. So I used a translator. So the sentence is unnatural. Please understand.

I encountered the same error when running a pipeline through the Kubeflow UI. After analyzing the workflow-controller logs, I identified that the error was caused by missing configmap and serviceaccount configurations.

To resolve this, I:

Created the required artifact-repositories configmap
Set up the default-editor serviceaccount with appropriate roles and permissions

After implementing these fixes, both the container driver and DAG driver pods were created successfully. The pipeline pod was also created (though it had a separate component code error unrelated to this issue).

If anyone needs the detailed configuration steps, feel free to ask. Hope this helps others facing similar issues!

Environment

kubeflow manifests v1.9
kfp version == 2.2.0

juliusvonkohout · 2025-01-08T14:59:00Z

I am not good at English. So I used a translator. So the sentence is unnatural. Please understand.

I encountered the same error when running a pipeline through the Kubeflow UI. After analyzing the workflow-controller logs, I identified that the error was caused by missing configmap and serviceaccount configurations.

To resolve this, I:
1. Created the required artifact-repositories configmap

2. Set up the default-editor serviceaccount with appropriate roles and permissions
After implementing these fixes, both the container driver and DAG driver pods were created successfully. The pipeline pod was also created (though it had a separate component code error unrelated to this issue).

If anyone needs the detailed configuration steps, feel free to ask. Hope this helps others facing similar issues!

Environment
* kubeflow manifests v1.9

* kfp version == 2.2.0

Do you mean that the artifact-reposirories configmap is really mandatory and missing from https://github.com/kubeflow/pipelines/blob/master/manifests/kustomize/base/installs/multi-user/pipelines-profile-controller/sync.py ?

"Set up the default-editor serviceaccount with appropriate roles and permissions" Which roles and permissions are missing?

I can also run the Pipeline, but I get errors in the UI. Do they not appear for you?

Please use 1.9.1 instead of 1.9 for superior authentication. https://github.com/kubeflow/manifests/releases/tag/v1.9.1

honeydanji · 2025-01-13T02:40:54Z

I am not good at English. So I used a translator. So the sentence is unnatural. Please understand.
I encountered the same error when running a pipeline through the Kubeflow UI. After analyzing the workflow-controller logs, I identified that the error was caused by missing configmap and serviceaccount configurations.
To resolve this, I:
1. Created the required artifact-repositories configmap

2. Set up the default-editor serviceaccount with appropriate roles and permissions
After implementing these fixes, both the container driver and DAG driver pods were created successfully. The pipeline pod was also created (though it had a separate component code error unrelated to this issue).
If anyone needs the detailed configuration steps, feel free to ask. Hope this helps others facing similar issues!
Environment
* kubeflow manifests v1.9

* kfp version == 2.2.0
Do you mean that the artifact-reposirories configmap is really mandatory and missing from https://github.com/kubeflow/pipelines/blob/master/manifests/kustomize/base/installs/multi-user/pipelines-profile-controller/sync.py ?

"Set up the default-editor serviceaccount with appropriate roles and permissions" Which roles and permissions are missing?

I can also run the Pipeline, but I get errors in the UI. Do they not appear for you?

Please use 1.9.1 instead of 1.9 for superior authentication. https://github.com/kubeflow/manifests/releases/tag/v1.9.1

I checked the logs of the "workflow" pod in the kubeflow namespace and found an error about missing "artifact-repositories configmap".
This ConfigMap is essential because it is required for storing artifacts generated by the pipeline.
After creating the ConfigMap, I encountered a service account permission error.
To resolve this, I added the following minimal permissions for the default-editor account:

yamlCopyapiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: default-editor-role
  namespace: kubeflow
rules:
- apiGroups: [""]
  resources: ["pods", "configmaps", "secrets"]
  verbs: ["create", "get", "list", "watch", "patch", "delete"]
- apiGroups: ["argoproj.io"]
  resources: ["workflowtaskresults"]
  verbs: ["create", "get", "list", "watch"]

After applying both the ConfigMap and these permissions, the pipeline worked correctly with no UI errors.

dandawg · 2025-01-21T03:36:45Z

Can someone post the manifest/version and install steps they are using to produce this error. The link in the OP unfortunately gives 404 now (this guide). I'm trying to reproduce it.

I saw it show up temporarily on my first pipeline run on kfp 2.0.0 (sdk version 2.0.1, and k8s 1.25.16 on kind), but then it went away and the pipeline eventually completed successfully. All other pipeline runs I've done have succeeded.

I used the platform-agnostic manifest:
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/cluster-scoped-resources?ref=2.0.0"
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/env/platform-agnostic?ref=2.0.0"

dandawg · 2025-01-30T16:35:42Z

Using KF 1.9.1 I was able to reproduce this on one deployment, but not on others. That is, using the same install from manifest I got the error, but then when I redeployed I couldn't get the error again (using the same pipeline YAML). This tells me it's either an issue with the way KF is deployed sometimes, or some special state that KF was in.

This may be related to issue #11403 , which was fixed recently. If no one sees this with 1.9.2+, this may have addressed this issue as well.

juliusvonkohout · 2025-01-31T16:38:19Z

It seems to be fixed now. It can happen with very large azure oidc info that you exceed the limit of the grpc server. But most things reported here seem to be from unclean installations or are probably fixed in KFP 2.4.0 or the master branch. If you upgrade to KF 1.9.1 cleanup your istio-system namespace. Also some here had problems with the launcher and driver image which had no proper versioning until 2.4.0 kubeflow/manifests#2953. So please create separate issues focused on a single problem. The master branch is probably also affected by kubeflow/manifests#2970 and we hope to resolve it soon.

/close

google-oss-prow · 2025-01-31T16:38:24Z

@juliusvonkohout: Closing this issue.

In response to this:

It seems to be fixed now. It can happen with very large azure oidc info that you exceed the limit of the grpc server. But most things reported here seem to be from unclean installations or are probably fixed in KFP 2.4.0. If you upgrade from KF 1.9.1 cleanup your istio-system namespace. Also some here had problems with the launcher and driver image which had no proper versioning until 2.4.0 kubeflow/manifests#2953. So please create separate issues focused on a single problem.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

fstetic added area/backend kind/bug labels Jan 19, 2023

gkcalat self-assigned this Jan 19, 2023

gkcalat assigned chensun and jlyaoyuli and unassigned gkcalat May 4, 2023

This comment was marked as outdated.

Sign in to view

davidspek mentioned this issue Sep 21, 2023

[kubeflow 1.8] Kubeflow 1.8 Tracking Issue kubeflow/manifests#2442

Closed

5 tasks

chensun closed this as completed Oct 9, 2023

DnPlas mentioned this issue Oct 14, 2023

cannot save parameter /tmp/outputs/condition #9678

Closed

kimwnasptd mentioned this issue Jul 4, 2024

running kfp_v2 integration test and an experiment shows that Cannot get MLMD objects from Metadata store. canonical/bundle-kubeflow#966

Closed

thesuperzapper mentioned this issue Aug 3, 2024

fix: add liveness probe for metadata-envoy-deployment deployKF/deployKF#191

Open

github-project-automation bot added this to KFP Runtime Triage Aug 29, 2024

github-project-automation bot moved this to Closed in KFP Runtime Triage Aug 29, 2024

google-oss-prow bot reopened this Dec 10, 2024

github-project-automation bot moved this from Closed to Needs triage in KFP Runtime Triage Dec 10, 2024

HumairAK added this to the KFP 2.5.0 milestone Jan 6, 2025

HumairAK added this to KFP 2.x Release Jan 29, 2025

HumairAK moved this to In Progress in KFP 2.x Release Jan 29, 2025

google-oss-prow bot closed this as completed Jan 31, 2025

github-project-automation bot moved this from In Progress to Done in KFP 2.x Release Jan 31, 2025

juliusvonkohout mentioned this issue Jan 31, 2025

[backend] <Cannot get MLMD objects from Metadata store. Cannot find context>(version 1.8.0 and 1.10.0) #11086

Closed

[backend] Cannot get MLMD objects from Metadata store when running v2 pipeline #8733

[backend] Cannot get MLMD objects from Metadata store when running v2 pipeline #8733

Comments

fstetic commented Jan 19, 2023 • edited Loading

Environment

Steps to reproduce

Expected result

Materials and Reference

gkcalat commented Jan 19, 2023

fstetic commented Jan 20, 2023

tleewongjaro-agoda commented Apr 20, 2023

fstetic commented Apr 20, 2023

gkcalat commented Apr 20, 2023

Enochlove commented Sep 5, 2023

This comment was marked as outdated.

LordWaif commented Sep 5, 2023

Enochlove commented Sep 6, 2023 via email

DnPlas commented Oct 5, 2023

chensun commented Oct 5, 2023

chensun commented Oct 9, 2023

venkatesh-chinni commented May 6, 2024

ZeynepRuveyda commented May 16, 2024 • edited Loading

venkatesh-chinni commented May 17, 2024 • edited Loading

ZeynepRuveyda commented May 17, 2024

photonbit commented Jun 4, 2024

thesuperzapper commented Aug 3, 2024

orfeas-k commented Aug 7, 2024

ZDowney926 commented Aug 9, 2024 • edited Loading

vishnujp12 commented Aug 30, 2024

juliusvonkohout commented Dec 10, 2024 • edited Loading

google-oss-prow bot commented Dec 10, 2024

shivanibhargove commented Jan 2, 2025

juliusvonkohout commented Jan 6, 2025

juliusvonkohout commented Jan 6, 2025 • edited Loading

juliusvonkohout commented Jan 6, 2025

varodrig commented Jan 6, 2025

hbelmiro commented Jan 6, 2025

honeydanji commented Jan 8, 2025

juliusvonkohout commented Jan 8, 2025

honeydanji commented Jan 13, 2025

dandawg commented Jan 21, 2025

dandawg commented Jan 30, 2025

juliusvonkohout commented Jan 31, 2025 • edited Loading

google-oss-prow bot commented Jan 31, 2025

fstetic commented Jan 19, 2023 •

edited

Loading

ZeynepRuveyda commented May 16, 2024 •

edited

Loading

venkatesh-chinni commented May 17, 2024 •

edited

Loading

ZDowney926 commented Aug 9, 2024 •

edited

Loading

juliusvonkohout commented Dec 10, 2024 •

edited

Loading

juliusvonkohout commented Jan 6, 2025 •

edited

Loading

juliusvonkohout commented Jan 31, 2025 •

edited

Loading