Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Examples folder structure #1691

Merged

Conversation

andreyvelich
Copy link
Member

As we discussed at the AutoML community meeting before, we should refactor our example structure to be more user-friendly.

This PR proposed the following folder structure for our examples:

  • argo - Katib with Argo integration.
  • early-stopping - Early Stopping Examples
  • fpga - Katib with FPGA integration.
  • hp-tuning - HP Tuning examples, what do you think about the naming (<algorithm-name>.yaml) ?
  • kind-cluster - Get Started example to run Katib from the local laptop. This was a popular request from the users.
  • kubeflow-pipelines - Katib with KFP.
  • kubeflow-training-operator - Katib with Training Operators.
  • metrics-collector - Different metrics collector examples
  • nas - NAS examples.
  • resume-experiment - Examples with resume experiment feature.
  • sdk - Katib Python SDK examples.
  • tekton - Katib with Tekton integration.

We should have discussion for the following directories (they might be not clear for the user):

  • trial-settings - Examples with the various Trial template specification
  • trial-training-containers - Training container examples.

I updated the docs/links with the new examples.

Please give your feedback on this PR since it is a very significant change and we have to be clear for our users.

/cc @kubeflow/wg-training-leads @kimwnasptd @jbottum @shannonbradshaw @anencore94 @tenzen-y @c-bata @g-votte @eliaskoromilas @jstamel @knkski

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@google-oss-robot
Copy link

@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: g-votte.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

As we discussed at the AutoML community meeting before, we should refactor our example structure to be more user-friendly.

This PR proposed the following folder structure for our examples:

  • argo - Katib with Argo integration.
  • early-stopping - Early Stopping Examples
  • fpga - Katib with FPGA integration.
  • hp-tuning - HP Tuning examples, what do you think about the naming (<algorithm-name>.yaml) ?
  • kind-cluster - Get Started example to run Katib from the local laptop. This was a popular request from the users.
  • kubeflow-pipelines - Katib with KFP.
  • kubeflow-training-operator - Katib with Training Operators.
  • metrics-collector - Different metrics collector examples
  • nas - NAS examples.
  • resume-experiment - Examples with resume experiment feature.
  • sdk - Katib Python SDK examples.
  • tekton - Katib with Tekton integration.

We should have discussion for the following directories (they might be not clear for the user):

  • trial-settings - Examples with the various Trial template specification
  • trial-training-containers - Training container examples.

I updated the docs/links with the new examples.

Please give your feedback on this PR since it is a very significant change and we have to be clear for our users.

/cc @kubeflow/wg-training-leads @kimwnasptd @jbottum @shannonbradshaw @anencore94 @tenzen-y @c-bata @g-votte @eliaskoromilas @jstamel @knkski

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@google-oss-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

Comment on lines 109 to 115
Katib has out of the box support for the [Kubeflow Training Operators](https://github.com/kubeflow/tf-operator) to
perform Trial's [Worker job](https://www.kubeflow.org/docs/components/katib/overview/#trial).
Check the following examples for the various distributed operators:

This comment was marked as resolved.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also supported the previous versions for some of them.
For example, TFJob and PyTorchJob was integrated in 2018 in Katib.

Copy link
Member

@tenzen-y tenzen-y Oct 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also supported the previous versions for some of them.

It makes sense.

I think It needs to describe the training-operator version since I guess it only works with training-operator >= 1.3 in the following examples. WDYT @andreyvelich ?

trialSpec:
apiVersion: kubeflow.org/v1
kind: MPIJob

trialSpec:
apiVersion: kubeflow.org/v1
kind: PyTorchJob

trialSpec:
apiVersion: kubeflow.org/v1
kind: TFJob

trialSpec:
apiVersion: kubeflow.org/v1
kind: XGBoostJob

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For TFJob and PyTorchJob it should work for 1.1 and 1.2 release also, but we can keep support only for >= 1.3 version.
@kubeflow/wg-training-leads Do we want to keep support for previous versions for Training Operators in Katib ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, @andreyvelich.
I was able to verify that the above examples work with pytorch-operator. I withdraw the following comment.

#1691 (comment)

I think It needs to describe the training-operator version since I guess it only works with train-operator >= 1.3 in the following examples. WDYT @andreyvelich ?

examples/v1beta1/kind-cluster/README.md Outdated Show resolved Hide resolved
examples/v1beta1/kind-cluster/README.md Outdated Show resolved Hide resolved
examples/v1beta1/kind-cluster/deploy.sh Outdated Show resolved Hide resolved
examples/v1beta1/kind-cluster/deploy.sh Outdated Show resolved Hide resolved
examples/v1beta1/kind-cluster/deploy.sh Outdated Show resolved Hide resolved
@google-cla
Copy link

google-cla bot commented Oct 4, 2021

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

2 similar comments
@google-cla
Copy link

google-cla bot commented Oct 4, 2021

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

@google-cla
Copy link

google-cla bot commented Oct 4, 2021

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

@tenzen-y
Copy link
Member

tenzen-y commented Oct 4, 2021

@googlebot I fixed it.

@google-cla
Copy link

google-cla bot commented Oct 4, 2021

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

@tenzen-y
Copy link
Member

tenzen-y commented Oct 4, 2021

@googlebot I fixed it.

@tenzen-y
Copy link
Member

tenzen-y commented Oct 4, 2021

lgtm


- [Metrics Collection Strategy](./metrics-collector/metrics-collection-strategy.yaml)

## TODO (andreyvelich) Discuss about this name. Trial Settings
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about just Trial Template ? or Setup Trial Template ?
I think including the word template would be better.

Copy link
Member Author

@andreyvelich andreyvelich Oct 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, does Trial Training Containers folder where we add training code sounds good ?

WDYT @johnugeorge about the name trial-template instead of trial-settings ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that sounds better

examples/v1beta1/README.md Outdated Show resolved Hide resolved
Comment on lines +138 to +145
## FPGA Support in Katib Experiments

You can run Katib Experiments on [FPGA](https://en.wikipedia.org/wiki/Field-programmable_gate_array)
based instances. For more information check [these examples](./fpga).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this parts should have one more hierarchy deeper.
Such as FPGA Support under Use cases in Katib Experiments.
If there came more use cases, like asic placement example, computer vision example, nlp example, ..., current structure leaves all examples in this README.md

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's good idea, what do you think @johnugeorge @gaocegege @eliaskoromilas ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would preserve the directory structure, keeping all the use-cases directly under examples/v1beta1. Grouping them in this README sounds good.

Copy link
Member Author

@andreyvelich andreyvelich Oct 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should decide what is use-cases, does Argo and Tekton Experiment also relates to the use-cases ?
Maybe we should decide about it in the following issue and keep the structure as it is for now ?

WDYT @johnugeorge @eliaskoromilas @anencore94 ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I think the word use-case would be for domain dependency.
Something like Katib Experiment example for FPGA, Katib Experiment for BERT, Katib Experiment for Deepfake, and so on.

Argo and Tekton Experiments belong to represent the variety of trialSpecs

Comment on lines 1 to 2
# TODO (andreyvelich) This metrics collector image (kubeflowkatib/custom-metrics-collector) doesn't work in v1beta1.
# It is currently using api.v1.alpha3.Manager instead of api.v1.beta1.Manager to report metrics.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this issue resolved ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet, this is the tracking issue: #1263.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks :) I got it.

@eliaskoromilas
Copy link
Member

@andreyvelich
Copy link
Member Author

@andreyvelich I just noticed a typo in fpga/README.md (Serice Service).

https://github.com/andreyvelich/katib/tree/refactor-example-folder/examples/v1beta1/fpga#simplifying-fpga-management-in-eks-elastic-kubernetes-serice

Feel free to fix this.

I think I fixed this here: #1688.

@andreyvelich andreyvelich changed the title [WIP] Refactor Examples folder structure Refactor Examples folder structure Oct 6, 2021
@anencore94
Copy link
Member

Based on the feedback and discussion, I followed this way:

trial-template: Examples with Trial template modifications.
trial-images: Trial images which are located in Katib repository (I moved NAS Trials there also).

It looks better, Thanks!

@tenzen-y
Copy link
Member

tenzen-y commented Oct 7, 2021

@googlebot I consent.

@google-cla
Copy link

google-cla bot commented Oct 7, 2021

All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter.

We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.

Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the cla label to yes (if enabled on your project).

ℹ️ Googlers: Go here for more info.

@tenzen-y
Copy link
Member

tenzen-y commented Oct 7, 2021

Based on the feedback and discussion, I followed this way:

  • trial-template: Examples with Trial template modifications.
  • trial-images: Trial images which are located in Katib repository (I moved NAS Trials there also).

@johnugeorge @tenzen-y @eliaskoromilas @anencore94 Please let me know what do you think about it ? If it sounds good, we can merge this initial examples change.

/hold for the review

Sounds good.
/retest

@eliaskoromilas
Copy link
Member

@johnugeorge @tenzen-y @eliaskoromilas @anencore94 Please let me know what do you think about it ? If it sounds good, we can merge this initial examples change.

/lgtm

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the Paths for mxnet and pytorch are wrong.

INFO[0003] Unpacking rootfs as cmd ADD examples/v1beta1/trial-training-containers/pytorch-mnist /opt/pytorch-mnist requires it.
error building image: error building stage: failed to get files used from context: failed to get fileinfo for /mnt/test-data-volume/kubeflow-katib-presubmit-e2e-v1beta1-1691-ce8
da88-8960-c513/src/github.com/kubeflow/katib/examples/v1beta1/trial-training-containers/pytorch-mnist: lstat /mnt/test-data-volume/kubeflow-katib-presubmit-e2e-v1beta1-1691-ce8d
a88-8960-c513/src/github.com/kubeflow/katib/examples/v1beta1/trial-training-containers/pytorch-mnist: no such file or directory
INFO[0003] Unpacking rootfs as cmd ADD examples/v1beta1/trial-training-containers/mxnet-mnist /opt/mxnet-mnist requires it.
error building image: error building stage: failed to get files used from context: failed to get fileinfo for /mnt/test-data-volume/kubeflow-katib-presubmit-e2e-v1beta1-1691-ce8
da88-8960-c513/src/github.com/kubeflow/katib/examples/v1beta1/trial-training-containers/mxnet-mnist: lstat /mnt/test-data-volume/kubeflow-katib-presubmit-e2e-v1beta1-1691-ce8da8
8-8960-c513/src/github.com/kubeflow/katib/examples/v1beta1/trial-training-containers/mxnet-mnist: no such file or directory

examples/v1beta1/trial-images/mxnet-mnist/Dockerfile Outdated Show resolved Hide resolved
examples/v1beta1/trial-images/pytorch-mnist/Dockerfile Outdated Show resolved Hide resolved
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
@andreyvelich
Copy link
Member Author

Nice catch @tenzen-y!

@google-cla
Copy link

google-cla bot commented Oct 7, 2021

All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter.

We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.

Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the cla label to yes (if enabled on your project).

ℹ️ Googlers: Go here for more info.

1 similar comment
@google-cla
Copy link

google-cla bot commented Oct 7, 2021

All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter.

We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.

Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the cla label to yes (if enabled on your project).

ℹ️ Googlers: Go here for more info.

@tenzen-y
Copy link
Member

tenzen-y commented Oct 7, 2021

@googlebot I consent.

@google-cla
Copy link

google-cla bot commented Oct 7, 2021

All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter.

We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.

Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the cla label to yes (if enabled on your project).

ℹ️ Googlers: Go here for more info.

@andreyvelich
Copy link
Member Author

@googlebot I consent.

@google-cla
Copy link

google-cla bot commented Oct 7, 2021

All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter.

We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.

Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the cla label to yes (if enabled on your project).

ℹ️ Googlers: Go here for more info.

@andreyvelich
Copy link
Member Author

@eliaskoromilas Please can you comment @googlebot I consent.

@google-cla
Copy link

google-cla bot commented Oct 7, 2021

All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter.

We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.

Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the cla label to yes (if enabled on your project).

ℹ️ Googlers: Go here for more info.

@tenzen-y
Copy link
Member

tenzen-y commented Oct 7, 2021

/lgtm

@eliaskoromilas
Copy link
Member

@googlebot I consent.

@andreyvelich
Copy link
Member Author

Thanks everyone for help on this PR!
/hold cancel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants