Resume experiment with extra trials from last checkpoint #952

johnugeorge · 2019-12-06T14:32:35Z

Fixes: #891
Resuming experiment feature is added. If user wants to try the same experiment with more number of trials, max trials can be reconfigured to restart the experiment from the last checkpoint.

Experiment is restarted only when experiment is in succeeded state with max trials reached previously.

This change is

johnugeorge · 2019-12-06T14:33:10Z

/hold
Holding till reviews are completed.
/cc @gaocegege
/cc @hougangliu

johnugeorge · 2019-12-06T15:45:42Z

/cc @richardsliu

gaocegege

lgtm

Should we add a test case for it?

johnugeorge · 2019-12-08T11:54:01Z

@gaocegege added a test to reconfigure max trails and parallel trial count

hougangliu · 2019-12-09T01:12:49Z

pkg/util/v1alpha3/katibclient/katib_client.go

@@ -123,6 +124,14 @@ func (k *KatibClient) CreateExperiment(experiment *experimentsv1alpha3.Experimen
 	return nil
 }

+func (k *KatibClient) UpdateExperiment(experiment *experimentsv1alpha3.Experiment, namespace ...string) error {


is this method used for UI? and it seems namespace param isn't used

I am not sure about it. Shall we do this separately?

If this method is not useful in your PR topic, I think we'd better remove it.

No. this method is currently used in resume_e2e_experiment.go script in this PR to update the experiment.

hougangliu · 2019-12-09T01:17:25Z

pkg/controller.v1alpha3/experiment/experiment_controller.go

+		// Experiment is restartable only if it is in succeeded state by reaching max trials
+		if util.IsCompletedExperimentRestartable(instance) {
+			// Check if max trials is reconfigured
+			if (instance.Spec.MaxTrialCount != nil) &&


I think if a user changes MaxTrialCount to nil (infinity), here we should also allow to restart it

hougangliu · 2019-12-09T04:41:22Z

pkg/controller.v1alpha3/experiment/experiment_controller.go

-				(*instance.Spec.MaxTrialCount != instance.Status.Trials) {
+			if (instance.Spec.MaxTrialCount != nil &&
+				*instance.Spec.MaxTrialCount != instance.Status.Trials) ||
+				(instance.Spec.MaxTrialCount == nil && instance.Status.Trials != 0) {


should we consider the case: instance.Spec.MaxTrialCount == 0 and the experiment will be marked completed with instance.Status.Trials == 0; then update instance.Spec.MaxTrialCount to nil?

i think, this is a invalid case. It doesn't make sense to set MaxTrialCount to be zero if set. We should add a validation for this separately.
Related: #768

hougangliu · 2019-12-09T05:06:55Z

/lgtm

johnugeorge · 2019-12-09T05:11:59Z

/hold cancel
as reviews are completed

johnugeorge · 2019-12-09T05:12:04Z

/approve

k8s-ci-robot · 2019-12-09T05:12:09Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnugeorge

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [johnugeorge]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

johnugeorge added 2 commits December 6, 2019 19:57

Resuming experiment with extra trials

0382ae6

Resuming experiment with extra trials

eaebf42

k8s-ci-robot requested review from Akado2009 and garganubhav December 6, 2019 14:32

k8s-ci-robot added the size/M label Dec 6, 2019

k8s-ci-robot requested review from gaocegege and hougangliu December 6, 2019 14:33

k8s-ci-robot added the do-not-merge/hold label Dec 6, 2019

johnugeorge mentioned this pull request Dec 6, 2019

[feature] Add new trials to a succeeded Experiment #891

Closed

k8s-ci-robot requested a review from richardsliu December 6, 2019 15:45

gaocegege reviewed Dec 7, 2019

View reviewed changes

Adding test script

93d3f47

k8s-ci-robot added size/L and removed size/M labels Dec 8, 2019

johnugeorge added 2 commits December 8, 2019 15:37

relative path

0252d2b

Verify if experiment is running again

421ad52

hougangliu reviewed Dec 9, 2019

View reviewed changes

Adding case when maxtrials is not set

96f7f23

hougangliu reviewed Dec 9, 2019

View reviewed changes

k8s-ci-robot assigned hougangliu Dec 9, 2019

k8s-ci-robot added the lgtm label Dec 9, 2019

k8s-ci-robot removed the do-not-merge/hold label Dec 9, 2019

k8s-ci-robot added the approved label Dec 9, 2019

k8s-ci-robot merged commit 4a97e21 into kubeflow:master Dec 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resume experiment with extra trials from last checkpoint #952

Resume experiment with extra trials from last checkpoint #952

johnugeorge commented Dec 6, 2019 •

edited

Loading

johnugeorge commented Dec 6, 2019

johnugeorge commented Dec 6, 2019

gaocegege left a comment

johnugeorge commented Dec 8, 2019

hougangliu Dec 9, 2019

johnugeorge Dec 9, 2019

hougangliu Dec 9, 2019

johnugeorge Dec 9, 2019

hougangliu Dec 9, 2019

johnugeorge Dec 9, 2019

hougangliu Dec 9, 2019

johnugeorge Dec 9, 2019

hougangliu commented Dec 9, 2019

johnugeorge commented Dec 9, 2019

johnugeorge commented Dec 9, 2019

k8s-ci-robot commented Dec 9, 2019

Resume experiment with extra trials from last checkpoint #952

Resume experiment with extra trials from last checkpoint #952

Conversation

johnugeorge commented Dec 6, 2019 • edited Loading

johnugeorge commented Dec 6, 2019

johnugeorge commented Dec 6, 2019

gaocegege left a comment

Choose a reason for hiding this comment

johnugeorge commented Dec 8, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hougangliu commented Dec 9, 2019

johnugeorge commented Dec 9, 2019

johnugeorge commented Dec 9, 2019

k8s-ci-robot commented Dec 9, 2019

johnugeorge commented Dec 6, 2019 •

edited

Loading