Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete Suggestion deployment after Experiment is finished #1150

Merged

Conversation

sperlingxx
Copy link
Member

@sperlingxx sperlingxx commented Apr 16, 2020

This PR responds to the #1061 issue.
I'd like to implement suggestion eviction at first. Then work on how to persist the status of suggestion #1062 (maybe through pv?).
@andreyvelich @johnugeorge @richardsliu @hougangliu @gaocegege

@kubeflow-bot
Copy link

This change is Reviewable

@sperlingxx sperlingxx changed the title [solve #1061] Delete Suggestion deployment after Experiment is finished Delete Suggestion deployment after Experiment is finished Apr 16, 2020
Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for doing this!
I left few comments.

@@ -49,9 +49,11 @@ type ExperimentSpec struct {

NasConfig *NasConfig `json:"nasConfig,omitempty"`

// If false/true, which means delete/resume Suggestion after experiment is finished
ResumeExperiment bool `json:"resumeExperiment,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use this parameter in the future PRs also for #1062?
For example:

ResumeExperiment: "never"- Always delete Suggestion deployment after Experiment is finished and never restart it.
ResumeExperiment: "alwaysRunning" - Suggestion deployment is in always running mode.
ResumeExperiment: "fromPersistentVolume" - Suggestion deployment is deleted after Experiment is finished and we can restore it from PV.

This is just my thought, we can think about better naming.
/cc @gaocegege @johnugeorge @richardsliu

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich Roughly, this design sounds to me. There are a few things I want to double check:

  1. According to bottom comment, when ResumeExperiment: "never" , suggestion instance itself will still be alive with terminated status after experiment who owns it finished. We only evict deployment and service.
  2. If ResumeExperiment: "fromPersistentVolume" , we will try to retrieve suggestion status from PV when initializing suggestion instance. And we will preserve the suggestion instance into PV after the termination of experiment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sperlingxx Yeah, it is correct.
As I mentioned bellow, should Suggestion CR be always running? What do you think?
Even if we want to delete deployment after Experiment is finished, Suggestion CR still has useful information for the user.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich I agree on keeping Suggestion CR. It may be useful, and it will be deleted after the remove of experiment CR who owns it.

@@ -49,9 +49,11 @@ type ExperimentSpec struct {

NasConfig *NasConfig `json:"nasConfig,omitempty"`

// If false/true, which means delete/resume Suggestion after experiment is finished
ResumeExperiment bool `json:"resumeExperiment,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use this parameter in the future PRs also for #1062?
For example:

ResumeExperiment: "never"- Always delete Suggestion deployment after Experiment is finished and never restart it.
ResumeExperiment: "alwaysRunning" - Suggestion deployment is in always running mode.
ResumeExperiment: "fromPersistentVolume" - Suggestion deployment is deleted after Experiment is finished and we can restore it from PV.

This is just my thought, we can think about better naming.
What do you think @sperlingxx ?
/cc @gaocegege @johnugeorge @richardsliu

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is already updated.

pkg/controller.v1alpha3/experiment/experiment_util.go Outdated Show resolved Hide resolved
@gaocegege
Copy link
Member

@andreyvelich request another review, thanks.

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this change. I left few comments


const (
NeverResume ResumePolicyType = "Never"
LongRunning ResumePolicyType = "LongRunning"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the name ResumePolicy.
What do you think about the name LongRunning? Is it better than AlwaysRunning?
/cc @johnugeorge @gaocegege @richardsliu

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gaocegege @johnugeorge @richardsliu Do you have any ideas about the naming?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make it more consistent, how about keeping options to "Never" and "Always"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnugeorge As which proposed in #1062 , there will be the third policy type representing "resuming experiment from persistent format"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @sperlingxx.
In case when Experiment is not deleted and always running, ResumeExperiment: Always sounds not very obviously.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM. I think we can use LongRunning

@k8s-ci-robot k8s-ci-robot requested a review from gaocegege April 17, 2020 10:50
@sperlingxx sperlingxx force-pushed the delete_suggestion_after_finished branch from 4d35ba6 to 590deaf Compare April 18, 2020 14:30
@gaocegege
Copy link
Member

@johnugeorge @andreyvelich Any other suggestion?

Copy link
Member

@gaocegege gaocegege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@johnugeorge
Copy link
Member

Sorry for being late.
One minor change.
@sperlingxx Can you add resume policy to NeverResume to one or more of our examples so that it is tested with our presubmit tests? Since Resume policy is not specified in our tests, only default of LongRunning will be tested currently.

@sperlingxx
Copy link
Member Author

sperlingxx commented Apr 29, 2020

Sorry for being late.
One minor change.
@sperlingxx Can you add resume policy to NeverResume to one or more of our examples so that it is tested with our presubmit tests? Since Resume policy is not specified in our tests, only default of LongRunning will be tested currently.

Good idea! But I'm busy with other business these days, I think I can offer the test case in early (maybe mid) May.

@andreyvelich
Copy link
Member

I think we can merge this PR and add e2e test in the future PR.
What do you think @gaocegege @johnugeorge ?

@gaocegege
Copy link
Member

Of course. I think we can.

/cc @johnugeorge

@johnugeorge
Copy link
Member

Sounds good
/approve

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnugeorge

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 58e0764 into kubeflow:master Apr 30, 2020
sperlingxx added a commit to sperlingxx/katib that referenced this pull request Jul 9, 2020
)

* Delete Suggestion deployment after Experiment is finished

* fix

* update

* update openapi

* fix validator test

* refine

* add debug info

* try to fix

* try to fix

* try to fix

* keep API style consistent

* rebase master
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants