Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

E2E for TensorFlow Integration #381

Merged
merged 2 commits into from
Aug 20, 2019
Merged

Conversation

thandayuthapani
Copy link
Contributor

This PR involves E2E for TensorFlow Integration with volcano

@volcano-sh-bot volcano-sh-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 23, 2019
@TravisBuddy
Copy link

Hey @thandayuthapani,
Something went wrong with the build.

TravisCI finished with status errored, which means the build failed because of something unrelated to the tests, such as a problem with a dependency or the build process itself.

View build log

TravisBuddy Request Identifier: cab9b0e0-ad65-11e9-aa77-bf7f12915990

@TravisBuddy
Copy link

Hey @thandayuthapani,
Something went wrong with the build.

TravisCI finished with status errored, which means the build failed because of something unrelated to the tests, such as a problem with a dependency or the build process itself.

View build log

TravisBuddy Request Identifier: 41c164d0-ad6b-11e9-aa77-bf7f12915990

@TommyLike
Copy link
Contributor

@thandayuthapani Download every images and run different samples for every application running on volcano would slow down our developing process. Maybe we should setup up a cronjob for this kind of e2e tests?

cc @k82cn @hzxuzhonghu

test/e2e/util.go Outdated
@@ -645,7 +646,7 @@ func waitJobStateAborted(ctx *context, job *vkv1.Job) error {

func waitJobPhaseExpect(ctx *context, job *vkv1.Job, state vkv1.JobPhase) error {
var additionalError error
err := wait.Poll(100*time.Millisecond, oneMinute, func() (bool, error) {
err := wait.Poll(100*time.Millisecond, twoMinute, func() (bool, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this may increase total test time, I would like to add a parameter to waitJobPhaseExpect

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@hzxuzhonghu
Copy link
Collaborator

I think this is good, we should have a tf job case. Cronjob does not fit very much, it can not prevent a pr from breaking cases like tf job.

For the speed issue, the docker images should be reusable.

@thandayuthapani
Copy link
Contributor Author

@thandayuthapani Download every images and run different samples for every application running on volcano would slow down our developing process. Maybe we should setup up a cronjob for this kind of e2e tests?

cc @k82cn @hzxuzhonghu

@TommyLike To my Knowledge downloading Images will not affect test time much, since they provide high speed internet, which is around 35MB/s. So downloading images might add 10 more secs in E2E

@TravisBuddy
Copy link

Hey @thandayuthapani,
Something went wrong with the build.

TravisCI finished with status errored, which means the build failed because of something unrelated to the tests, such as a problem with a dependency or the build process itself.

View build log

TravisBuddy Request Identifier: c4fb3320-addf-11e9-aa77-bf7f12915990

@TommyLike
Copy link
Contributor

TommyLike commented Jul 24, 2019

@thandayuthapani Download every images and run different samples for every application running on volcano would slow down our developing process. Maybe we should setup up a cronjob for this kind of e2e tests?
cc @k82cn @hzxuzhonghu

@TommyLike To my Knowledge downloading Images will not affect test time much, since they provide high speed internet, which is around 35MB/s. So downloading images might add 10 more secs in E2E

@thandayuthapani We need to consider how much time it takes to download the image and how much time it takes to load the images into kind and to complete the e2e tests.

⚡ root@husheng-test  ~  docker image ls | grep ed417b28e09c
thanda/tf-operator-example    1.0                                        ed417b28e09c        42 hours ago        1.25GB
 ⚡ root@husheng-test  ~  kind load docker-image thanda/tf-operator-example:1.0  --name integration
Execution time: 0h:02m:57s sec

image

@thandayuthapani
Copy link
Contributor Author

We need to consider how much time it takes to download the image and how much time it takes to load the images into kind and to complete the e2e tests.

⚡ root@husheng-test  ~  docker image ls | grep ed417b28e09c
thanda/tf-operator-example    1.0                                        ed417b28e09c        42 hours ago        1.25GB
 ⚡ root@husheng-test  ~  kind load docker-image thanda/tf-operator-example:1.0  --name integration
Execution time: 0h:02m:57s sec

image

1.25GB Image which has been posted is old one, Have updated that image with lighter one.

docker images | grep thanda/tf-operator-example
thanda/tf-operator-example    1.0                 1bdc930499c4        4 hours ago         461MB

And time to load docker-image in my local setup is

time kind load docker-image thanda/tf-operator-example:1.0 --name integration

real	0m22.613s

For E2E test it will take around 3 mins to complete but that is the simplest tensorflow training example I could find.

@TommyLike
Copy link
Contributor

@thandayuthapani thanks for your update, and that's why I am thinking if we need considering move this into another job

@thandayuthapani
Copy link
Contributor Author

@TommyLike Have reduced number of train steps, now the test runs in CI for 78 Secs, if that is okay we can maintain that, else we can reduce steps further also if necessary.
Screenshot from 2019-07-24 16-43-27

@TommyLike
Copy link
Contributor

/lgtm

@volcano-sh-bot volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Jul 24, 2019
@hzxuzhonghu
Copy link
Collaborator

@thandayuthapani One point, the image used should be in volcanosh instead of your own repo.

@thandayuthapani
Copy link
Contributor Author

@thandayuthapani One point, the image used should be in volcanosh instead of your own repo.

Sure will move that to volcanosh repo

@thandayuthapani
Copy link
Contributor Author

@thandayuthapani One point, the image used should be in volcanosh instead of your own repo.

Have updated the same.

WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`;
export TF_CONFIG={\"cluster\":{\"ps\":[${PS_HOST}],\"worker\":[${WORKER_HOST}]},\"task\":{\"type\":\"ps\",\"index\":${VK_TASK_INDEX}},\"environment\":\"cloud\"};
python /var/tf_dist_mnist/dist_mnist.py
image: thanda/tf-operator-example:1.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

volcanosh/dist-mnist-tf-example:0.0.1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`;
export TF_CONFIG={\"cluster\":{\"ps\":[${PS_HOST}],\"worker\":[${WORKER_HOST}]},\"task\":{\"type\":\"worker\",\"index\":${VK_TASK_INDEX}},\"environment\":\"cloud\"};
python /var/tf_dist_mnist/dist_mnist.py
image: thanda/tf-operator-example:1.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@volcano-sh-bot volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Jul 25, 2019
@TravisBuddy
Copy link

Hey @thandayuthapani,
Something went wrong with the build.

TravisCI finished with status errored, which means the build failed because of something unrelated to the tests, such as a problem with a dependency or the build process itself.

View build log

TravisBuddy Request Identifier: a05e1c20-aed5-11e9-a97b-d704c485e7e4

@thandayuthapani
Copy link
Contributor Author

@k82cn Can you please retrigger this build.

@volcano-sh-bot volcano-sh-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 26, 2019
@volcano-sh-bot volcano-sh-bot removed lgtm Indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jul 26, 2019
@TommyLike
Copy link
Contributor

/lgtm

@volcano-sh-bot volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Jul 26, 2019
@TravisBuddy
Copy link

Hey @thandayuthapani,
Something went wrong with the build.

TravisCI finished with status errored, which means the build failed because of something unrelated to the tests, such as a problem with a dependency or the build process itself.

View build log

TravisBuddy Request Identifier: 29ceca70-afaa-11e9-83c1-65c58ed3b4ac

@volcano-sh-bot volcano-sh-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 27, 2019
@volcano-sh-bot volcano-sh-bot removed lgtm Indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jul 27, 2019
@TravisBuddy
Copy link

Hey @thandayuthapani,
Something went wrong with the build.

TravisCI finished with status errored, which means the build failed because of something unrelated to the tests, such as a problem with a dependency or the build process itself.

View build log

TravisBuddy Request Identifier: 04909210-b04b-11e9-83c1-65c58ed3b4ac

@thandayuthapani
Copy link
Contributor Author

@k82cn @TommyLike Please have a look

@TommyLike
Copy link
Contributor

/lgtm

@TommyLike
Copy link
Contributor

/approve

@TommyLike
Copy link
Contributor

/assign @kevin-wangzefeng

@k82cn
Copy link
Member

k82cn commented Aug 20, 2019

/lgtm
/approve

@volcano-sh-bot volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Aug 20, 2019
@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: k82cn, thandayuthapani, TommyLike

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 20, 2019
@volcano-sh-bot volcano-sh-bot merged commit e2faed8 into volcano-sh:master Aug 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants