Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resume experiment with extra trials from last checkpoint #952

Merged
merged 6 commits into from
Dec 9, 2019

Conversation

johnugeorge
Copy link
Member

@johnugeorge johnugeorge commented Dec 6, 2019

Fixes: #891
Resuming experiment feature is added. If user wants to try the same experiment with more number of trials, max trials can be reconfigured to restart the experiment from the last checkpoint.

Experiment is restarted only when experiment is in succeeded state with max trials reached previously.


This change is Reviewable

@johnugeorge
Copy link
Member Author

/hold
Holding till reviews are completed.
/cc @gaocegege
/cc @hougangliu

@johnugeorge
Copy link
Member Author

/cc @richardsliu

Copy link
Member

@gaocegege gaocegege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Should we add a test case for it?

@k8s-ci-robot k8s-ci-robot added size/L and removed size/M labels Dec 8, 2019
@johnugeorge
Copy link
Member Author

@gaocegege added a test to reconfigure max trails and parallel trial count

@@ -123,6 +124,14 @@ func (k *KatibClient) CreateExperiment(experiment *experimentsv1alpha3.Experimen
return nil
}

func (k *KatibClient) UpdateExperiment(experiment *experimentsv1alpha3.Experiment, namespace ...string) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this method used for UI? and it seems namespace param isn't used

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about it. Shall we do this separately?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this method is not useful in your PR topic, I think we'd better remove it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. this method is currently used in resume_e2e_experiment.go script in this PR to update the experiment.

// Experiment is restartable only if it is in succeeded state by reaching max trials
if util.IsCompletedExperimentRestartable(instance) {
// Check if max trials is reconfigured
if (instance.Spec.MaxTrialCount != nil) &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if a user changes MaxTrialCount to nil (infinity), here we should also allow to restart it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

(*instance.Spec.MaxTrialCount != instance.Status.Trials) {
if (instance.Spec.MaxTrialCount != nil &&
*instance.Spec.MaxTrialCount != instance.Status.Trials) ||
(instance.Spec.MaxTrialCount == nil && instance.Status.Trials != 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we consider the case: instance.Spec.MaxTrialCount == 0 and the experiment will be marked completed with instance.Status.Trials == 0; then update instance.Spec.MaxTrialCount to nil?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think, this is a invalid case. It doesn't make sense to set MaxTrialCount to be zero if set. We should add a validation for this separately.
Related: #768

@hougangliu
Copy link
Member

/lgtm

@johnugeorge
Copy link
Member Author

/hold cancel
as reviews are completed

@johnugeorge
Copy link
Member Author

/approve

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnugeorge

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 4a97e21 into kubeflow:master Dec 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[feature] Add new trials to a succeeded Experiment
4 participants