Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement Request: Add AWS S3 Support for TensorBoard in KFP #4364

Closed
lucinvitae opened this issue Aug 12, 2020 · 10 comments
Closed

Enhancement Request: Add AWS S3 Support for TensorBoard in KFP #4364

lucinvitae opened this issue Aug 12, 2020 · 10 comments
Assignees
Labels
platform/aws status/triaged Whether the issue has been explicitly triaged

Comments

@lucinvitae
Copy link

Overview

My team is interested in attempting to integrate TensorBoard into Kubeflow Pipelines (using v1.0.0, standalone installation) we find ourself unable to do so due to our dependency on AWS instead of GCP. I was recommended by someone in the Kubeflow Slack to open this Enhancement Request for adding S3 support for using TensorBoard in Kubeflow Pipelines.

Proposal

A great improvement to the KFP UI would be to see the Start TensorBoard button in the output page of a pipeline run as described in the KFP docs (https://www.kubeflow.org/docs/pipelines/sdk/output-viewer/#tensorboard) even if the TensorBoard log directory has been uploaded to an AWS S3 bucket.

End user requirements to leverage this feature:

  1. Pipeline configuration that creates the /mlpipeline-ui-metadata.json file successfully in the container,
  2. Pipeline step outputs a TensorBoard logdir that is valid (i.e. tensorboard --inspect --logdir /app/logs/fit/… succeeds)
  3. Pipeline step uploads logdir to AWS S3 successfully

Here’s the content of an example metadata file for a S3 logdir:

{
  "outputs": [
    {
      "type": "tensorboard",
      "source": "s3://my-team-bucket/kubeflow/pipeline-x/run-123/app/logs/fit/20200723-124231"
    }
  ]
}

@Bobgy pointed me to #4208 as a potential future workaround using a mount path (note: not yet merged as of creation of this GitHub issue), but it would be great to have AWS S3 support for this as well.

I'm happy to chip in where I can with design discussions/implementation here, especially with regard to AWS integration in general, since my team is exclusively (and mostly successfully) using AWS instead of GCP for our KFP cluster.

Original Slack channel thread:
https://kubeflow.slack.com/archives/CE10KS9M4/p1595512605179800

@Bobgy
Copy link
Contributor

Bobgy commented Aug 13, 2020

/assign @Jeffwan @PatrickXYS
Who work for AWS.

@Bobgy Bobgy added status/triaged Whether the issue has been explicitly triaged platform/aws labels Aug 13, 2020
@PatrickXYS
Copy link
Member

Per AWS standalone KFP support has already been added recently, check PR for details.

All the requirements has been addressed. Follow the README for instructions.

Feel free to submit any issues you find during standalone KFP installation.

@lucinvitae
Copy link
Author

Hello @PatrickXYS, does that include TensorBoard support? I didn't see anything in that PR related to TensorBoard at first glance. In our last discussion @Bobgy informed me that no such support exists.

@PatrickXYS
Copy link
Member

If you check https://github.com/kubeflow/pipelines/blob/master/manifests/kustomize/env/aws/viewer-pod-template.json, this is for ml-pipeline-ui-configmap, which is rendered when you do Start Tensorboard.

The point here is you need to configure those AWS parameters correctly, such as s3 bucket, aws public id, and aws secret key, etc.

@lucinvitae
Copy link
Author

@PatrickXYS thanks, that looks promising. We'll try that out.

@nlarusstone
Copy link

@PatrickXYS are there instructions on how to use IAM roles with Tensorboard? We don't have long-lived AWS secrets, so we'll need an IAM based solution

@lucinvitae
Copy link
Author

Bump ^ we also use IAM roles instead of long-lived AWS secrets.

@PatrickXYS do you know if this is supported? And if it's undocumented but supported, happy to help supply documentation provided we can get it working.

@PatrickXYS
Copy link
Member

We haven't supported IAM role yet. Also, I'm not sure if that's feasible for now, will add in our roadmap.

@lucinvitae
Copy link
Author

@PatrickXYS we were finally able to get tensorboard working with IAM.

I'll document it below so that other people can leverage the setup. Although I'm happy to add the docs somewhere else if there's a better location.

Similar to https://github.com/kubeflow/pipelines/blob/master/manifests/kustomize/env/aws/README.md we have a configmap with the content necessary for the tensorboard launcher:

We're using kustomize to override the VIEWER_TENSORBOARD_POD_TEMPLATE_SPEC_PATH variable in an overlay file:

- op: replace
  path: /spec/template/spec/containers/0/env
  value:
    ...
    - name: VIEWER_TENSORBOARD_POD_TEMPLATE_SPEC_PATH
      value: /etc/config/viewer-tensorboard-template.json
    ...

Then we've modified ml-pipeline-ui-configmap like this:

  viewer-tensorboard-template.json: |-
    {
      "metadata": {
        "annotations": {
          "iam.amazonaws.com/role": "ai-rancher/rancher_ai_training_shared"
        }
      },
      "spec": {
          "serviceAccountName": "kubeflow-pipelines-viewer"
      }
    }

After that, the tensorboard viewer pod gets started with the above IAM role and can access our S3 buckets:

$ kubectl -n kubeflow get pods viewer-f4bd94b05ac8e177e75eec0da3bfca29d298289a-deploymentsj6hd -o yaml | grep -C3 iam
kind: Pod
metadata:
  annotations:
    iam.amazonaws.com/role: ai-rancher/rancher_ai_training_shared

I should also note that in our kustomize configs we specify a commonAnnotation:

commonAnnotations:
  iam.amazonaws.com/role: ai-rancher/rancher_ai_training_shared

But we still needed the above viewer-tensorboard-template.json changes to get tensorboard viewer pod access to our S3 bucket for downloading the log dir, otherwise the viewer pod was showing AccessDenied errors and not rendering anything.

All of this is to say that we've been able to create TensorBoard artifacts as follows with S3 paths from within our pipelines, and the viewer pod is able to use IAM roles to download the S3 log dir. Example metadata.json:

{
  "outputs": [
    {
      "type": "tensorboard",
      "source": "s3://invitae-ai-training-shared/kubeflow/experiments/ccccfe49-a23e-4a5b-9684-a3e7c5e26095/runs/17400f2b-11dc-4729-b7dc-f2336a81aadd/logs"
    }
  ]
}

Also, the KFP docs (here say that "The pipeline component must write a JSON file specifying metadata for the output viewer(s) that you want to use for visualizing the results. The file name must be /mlpipeline-ui-metadata.json" but we've found that not only can the filename be anything when specifying outputs for python components, but also there needs to be an artifact named mlpipeline-ui-metadata or else outputs do not work, tensorboard outputs included. Example working output for a python pipeline component:

    op = dsl.ContainerOp(
        name="Write tensorboard metadata",
        image=docker_image,
        command=["sh", "-c"],
        arguments=[ ... some command that produces metadata.json ... ],
        file_outputs={"mlpipeline-ui-metadata": "metadata.json"},
    )

It would be great to update the KFP for the above metadata.json issue, since debugging this issue cost me a few hours, and I felt a bit mislead by the existing documentation.

cc @nlarusstone since you were asking in the #kubeflow-pipelines slack channel.

@lucinvitae
Copy link
Author

I should note, none of this worked until we bumped to KFP standalone version: v1.0.1
(https://github.com/kubeflow/pipelines/releases/tag/1.0.1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
platform/aws status/triaged Whether the issue has been explicitly triaged
Projects
None yet
Development

No branches or pull requests

5 participants