Cannot run pipeline samples in GCP IAP Deployment #2773

bruce3557 · 2019-12-25T10:47:07Z

What happened:
We cannot run pipeline samples.
Seems that gcloud related command cannot get workload identity correctly.
The error messages are

ERROR: (gsutil) timed out
This may be due to network connectivity issues. Please check your network settings, and the status of the service you are trying to reach.

What did you expect to happen:
We should run pipeline samples smoothly.

What steps did you take:
Created a run and an experiment.

Anything else you would like to add:
I tried this implementation and still cannot get correct result.
https://github.com/kubeflow/pipelines/blob/master/samples/core/secret/secret.py

The text was updated successfully, but these errors were encountered:

bruce3557 · 2019-12-25T17:27:13Z

When I retried deployment again, the message is changed to

AccessDeniedException: 403 Primary: /namespaces/dcard-data.svc.id.goog with additional claims does not have storage.objects.list access to dcard--bruce.

bruce3557 · 2019-12-26T10:32:02Z

not sure whether that is related,
when I run secret sample, I will get these messages.
It seems that cloud sdk cannot link to metadata.google.internal

Traceback (most recent call last):
  File "<string>", line 6, in <module>
  File "/usr/local/lib/python2.7/dist-packages/google/api_core/page_iterator.py", line 212, in _items_iter
    for page in self._page_iter(increment=False):
  File "/usr/local/lib/python2.7/dist-packages/google/api_core/page_iterator.py", line 243, in _page_iter
List of buckets:
    page = self._next_page()
  File "/usr/local/lib/python2.7/dist-packages/google/api_core/page_iterator.py", line 369, in _next_page
    response = self._get_next_page_response()
  File "/usr/local/lib/python2.7/dist-packages/google/api_core/page_iterator.py", line 419, in _get_next_page_response
    method=self._HTTP_METHOD, path=self.path, query_params=params
  File "/usr/local/lib/python2.7/dist-packages/google/cloud/_http.py", line 417, in api_request
    timeout=timeout,
  File "/usr/local/lib/python2.7/dist-packages/google/cloud/_http.py", line 275, in _make_request
    method, url, headers, data, target_object, timeout=timeout
  File "/usr/local/lib/python2.7/dist-packages/google/cloud/_http.py", line 313, in _do_request
    url=url, method=method, headers=headers, data=data, timeout=timeout
  File "/usr/local/lib/python2.7/dist-packages/google/auth/transport/requests.py", line 277, in request
    self.credentials.before_request(auth_request, method, url, request_headers)
  File "/usr/local/lib/python2.7/dist-packages/google/auth/credentials.py", line 124, in before_request
    self.refresh(request)
  File "/usr/local/lib/python2.7/dist-packages/google/auth/compute_engine/credentials.py", line 102, in refresh
    six.raise_from(new_exc, caught_exc)
  File "/usr/lib/python2.7/dist-packages/six.py", line 737, in raise_from
    raise value
google.auth.exceptions.RefreshError: HTTPConnectionPool(host='metadata.google.internal', port=80): Read timed out. (read timeout=120)

bruce3557 · 2019-12-26T10:45:27Z

I found this issue also: googleapis/google-auth-library-python#211

bruce3557 · 2019-12-26T14:27:43Z

After set workload identity to pipeline-runner in kubeflow namespace,
I can read data via gcloud command but still timeout after few minutes.

parthmishra · 2019-12-27T19:53:36Z

I can read data via gcloud command but still timeout after few minutes.

Maybe this has to do with how gcloud obtains/refreshes credentials? Even when using the old secret method (e.g. .apply(gcp.use_gcp_secret("user-gcp-sa")) I still get the timeouts and have to rely on setting the retry attempts for the component.

bruce3557 · 2019-12-28T08:16:21Z

About timeout problem, I think that is GKE problem. That will use default credential client and the certification is timeout around 1 hour.
But I think binding workload identity to pipeline-runner is workable for kubeflow ~

@parthmishra I tried that but it didn’t work because gcloud sdk implementation

wronk · 2020-01-02T23:34:08Z

@bruce3557, also running into this on some training experiments (using Katib outside pipelines). I end up with that same error when trying to download training data:

google.auth.exceptions.TransportError: HTTPConnectionPool(host='metadata.google.internal', port=80): Read timed out. (read timeout=120)

Please post back if you find a fix

bruce3557 · 2020-01-03T02:32:24Z

@wronk I find a workaround solution to prevent this problem in kubeflow issue 4607.
You can restart metadata pods regularly (around half hour)
The command is:
kubectl delete pods -n kube-system --selector=k8s-app=gke-metadata-server

Before GCP fix the issue, we cannot do anything I think.
The related GCP issue is here:
https://issuetracker.google.com/issues/146622472

Bobgy · 2020-01-20T04:16:43Z

As mentioned in the GCP issue, did you try the workarounds.

2 workarounds:

1) Disable workload identity
2) Downgrade GKE to a version that uses 0.2.13 of GKE Metadata server (1.14.8-gke.18)

has been working well for me using the following command
gcloud container clusters upgrade <cluster-name> --master --cluster-version 1.14.8-gke.17

yantriks-edi-bice · 2020-01-27T22:00:39Z

@Bobgy I get the following error when trying to downgrade

Master of cluster [xxxxx] will be upgraded from version [1.14.9-gke.2] to version [1.14.8-gke.17]. This operation is long-running and will block other operations on the cluster (including
delete) until it has run to completion.
Do you want to continue (Y/n)?
ERROR: (gcloud.container.clusters.upgrade) ResponseError: code=400, message=Master version "1.14.8-gke.17" is unsupported.

yantriks-edi-bice · 2020-01-27T22:14:24Z

But I think binding workload identity to pipeline-runner is workable for kubeflow ~

I don't yet understand how all of kubeflow is set up but am wondering about the effect such change would have on the other components. Would they continue to work assuming pipeline works ?

numerology · 2020-01-27T23:51:24Z

AFAIK there is an ongoing issue related with recent GKE release. Will keep this thread updated.

Bobgy · 2020-01-28T03:05:27Z

ERROR: (gcloud.container.clusters.upgrade) ResponseError: code=400, message=Master version "1.14.8-gke.17" is unsupported.

It means a new patch version has been released. The new 1.14.8-gke.x probably already have the fix.

yantriks-edi-bice · 2020-01-28T04:02:40Z

@Bobgy thanks - found latest in 1.18.8 series is 1.14.8-gke.33 and used your command to upgrade from earlier kubeflow 0.7 default version. Still getting this error though and cluster-user has Storage Admin role

File "kfp_component/google/dataflow/_launch_python.py", line 58, in launch_python
job_id, location = read_job_id_and_location(storage_client, staging_location)
File "kfp_component/google/dataflow/_common_ops.py", line 99, in read_job_id_and_location
if job_blob.exists():
File "/usr/local/lib/python2.7/site-packages/google/cloud/storage/blob.py", line 404, in exists
_target_object=None,
File "/usr/local/lib/python2.7/site-packages/google/cloud/_http.py", line 319, in api_request
raise exceptions.from_http_response(response)
google.api_core.exceptions.Forbidden: 403 GET
https://www.googleapis.com/storage/v1/b/edi_bice/o/kubeflow%2Fpipelines%2F378a9083ca79da0fc8b315b96dd965d8%2Fkfp%2Fdataflow%2Flaunch_python%2Fjob.txt?fields=name
: Primary: /namespaces/xxx-xx-xxx.svc.id.goog with additional claims does not have storage.objects.get access to edi_bice/kubeflow/pipelines/378a9083ca79da0fc8b315b96dd965d8/kfp/dataflow/launch_python/job.txt.

Bobgy · 2020-03-05T08:04:54Z

@yantriks-edi-bice Sorry for late notice, you probably also need to upgrade your google/cloud-sdk client versions as mentioned in #3069 (comment)

Bobgy · 2020-03-05T08:06:04Z

It seems the original issue is a GKE workload identity problem, closing now.
/close

k8s-ci-robot · 2020-03-05T08:06:05Z

@Bobgy: Closing this issue.

In response to this:

It seems the original issue is a GKE workload identity problem, closing now.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Signed-off-by: Andrews Arokiam <andrews.arokiam@ideas2it.com> Signed-off-by: Dan Sun <dsun20@bloomberg.net> Co-authored-by: Dan Sun <dsun20@bloomberg.net>

Ark-kun assigned IronPan, rmgogogo and numerology Dec 28, 2019

bruce3557 mentioned this issue Dec 30, 2019

GKE Kubeflow google cloud SDK "ERROR: timed out" kubeflow/kubeflow#4607

Closed

k8s-ci-robot closed this as completed Mar 5, 2020

nihil0 mentioned this issue Mar 31, 2020

GCS is not accessible from pipeline components #3402

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot run pipeline samples in GCP IAP Deployment #2773

Cannot run pipeline samples in GCP IAP Deployment #2773

bruce3557 commented Dec 25, 2019

bruce3557 commented Dec 25, 2019

bruce3557 commented Dec 26, 2019

bruce3557 commented Dec 26, 2019

bruce3557 commented Dec 26, 2019

parthmishra commented Dec 27, 2019

bruce3557 commented Dec 28, 2019 •

edited

Loading

wronk commented Jan 2, 2020

bruce3557 commented Jan 3, 2020 •

edited

Loading

Bobgy commented Jan 20, 2020 •

edited

Loading

yantriks-edi-bice commented Jan 27, 2020

yantriks-edi-bice commented Jan 27, 2020

numerology commented Jan 27, 2020

Bobgy commented Jan 28, 2020

yantriks-edi-bice commented Jan 28, 2020 •

edited

Loading

Bobgy commented Mar 5, 2020

Bobgy commented Mar 5, 2020

k8s-ci-robot commented Mar 5, 2020

Cannot run pipeline samples in GCP IAP Deployment #2773

Cannot run pipeline samples in GCP IAP Deployment #2773

Comments

bruce3557 commented Dec 25, 2019

bruce3557 commented Dec 25, 2019

bruce3557 commented Dec 26, 2019

bruce3557 commented Dec 26, 2019

bruce3557 commented Dec 26, 2019

parthmishra commented Dec 27, 2019

bruce3557 commented Dec 28, 2019 • edited Loading

wronk commented Jan 2, 2020

bruce3557 commented Jan 3, 2020 • edited Loading

Bobgy commented Jan 20, 2020 • edited Loading

yantriks-edi-bice commented Jan 27, 2020

yantriks-edi-bice commented Jan 27, 2020

numerology commented Jan 27, 2020

Bobgy commented Jan 28, 2020

yantriks-edi-bice commented Jan 28, 2020 • edited Loading

Bobgy commented Mar 5, 2020

Bobgy commented Mar 5, 2020

k8s-ci-robot commented Mar 5, 2020

bruce3557 commented Dec 28, 2019 •

edited

Loading

bruce3557 commented Jan 3, 2020 •

edited

Loading

Bobgy commented Jan 20, 2020 •

edited

Loading

yantriks-edi-bice commented Jan 28, 2020 •

edited

Loading