Skip to content
This repository has been archived by the owner on Jan 31, 2022. It is now read-only.

label microservice show 500's contacting metadata server when they first start #88

Open
jlewi opened this issue Dec 31, 2019 · 3 comments
Labels

Comments

@jlewi
Copy link
Contributor

jlewi commented Dec 31, 2019

See attached logs. Some of the worker pods for the label microservice are returning 500s when they
first try to contact the metadata server.

google.auth.exceptions.TransportError: ("Failed to retrieve http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/?recursive=true from the Google Compute Enginemetadata service. Status: 500 Response:\
nb'Could not recursively fetch uri\\n'", <google.auth.transport.requests._Response object at 0x7f8b8b9c2ac8>)

It appears to be able to get credentials though since it is able to verify the pubsub subscription exists. Takes about 4 minutes.

label-bot-worker-5c8967dc7c-rgv9b.pod.logs.txt

Not seeing the same errors reported in kubeflow/kubeflow#4607 in the metadata server logs.

gke-metadata-server.logs.txt

Note I think a lot of the K8s errors in the logs are because the master was temporarily unavailable while it was upgrading.

My cluster is 1.14.9-gke.2

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the label kind/bug to this issue, with a confidence of 0.89. Please mark this comment with 👍 or 👎 to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

@jlewi
Copy link
Contributor Author

jlewi commented Jan 3, 2020

I also observed kaniko jobs launched by skaffold getting stuck. Symptom was kaniko container started but no logs were emitted.

I kicked the node metadata servers

kubectl -n kube-system delete pods -l k8s-app=gke-metadata-server

At that appears to have caused things to start.

I'm running 1.14.9-gke.2

jlewi pushed a commit to jlewi/code-intelligence that referenced this issue Jan 3, 2020
…tart.

  * Attempt to fix kubeflow#88
  * We see some predictions succeed but then subsequent ones fail.

* Try to deal with workload identity issues by testing for a service account
  on startup.
jlewi pushed a commit to jlewi/code-intelligence that referenced this issue Jan 4, 2020
* worker.py should format logs as json entries. This will make it easier
  to query the data in BigQuery and stackdriver to measure performance.

  * Related to kubeflow#79

* To deal with workload identity flakiness (kubeflow#88) test that we can get
  application default credentials on startup and if not exit.

* As a hack to deal with multi-threading issues with Keras models (kubeflow#89)
  have the predict function load a new model on each call

  * It looks like the way pubsub works there is actually a thread pool
    so predict calls won't be handled in the same thread even though
    we throttle it to handle one item at a time.
k8s-ci-robot pushed a commit that referenced this issue Jan 4, 2020
* worker.py should format logs as json entries. This will make it easier
  to query the data in BigQuery and stackdriver to measure performance.

  * Related to #79

* To deal with workload identity flakiness (#88) test that we can get
  application default credentials on startup and if not exit.

* As a hack to deal with multi-threading issues with Keras models (#89)
  have the predict function load a new model on each call

  * It looks like the way pubsub works there is actually a thread pool
    so predict calls won't be handled in the same thread even though
    we throttle it to handle one item at a time.
@yantriks-edi-bice
Copy link

I'm new to kubeflow and this has caused me all kinds of grief. On top of the other issues with kfctl delete and reapply this really makes things seem unusable. Glad to see it's a Google problem though ;-)

jlewi pushed a commit to jlewi/code-intelligence that referenced this issue Jan 18, 2020
…tart.

  * Attempt to fix kubeflow#88
  * We see some predictions succeed but then subsequent ones fail.

* Try to deal with workload identity issues by testing for a service account
  on startup.
jlewi pushed a commit to jlewi/code-intelligence that referenced this issue Jan 18, 2020
…tart.

  * Attempt to fix kubeflow#88
  * We see some predictions succeed but then subsequent ones fail.

* Try to deal with workload identity issues by testing for a service account
  on startup.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants