Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support running #894

Merged
merged 5 commits into from
Dec 6, 2019
Merged

feat: Support running #894

merged 5 commits into from
Dec 6, 2019

Conversation

gaocegege
Copy link
Member

@gaocegege gaocegege commented Oct 24, 2019

Signed-off-by: Ce Gao gaoce@caicloud.io

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

  1. Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

Release note:



This change is Reviewable

@gaocegege
Copy link
Member Author

/hold

Until 0.7 is released

Copy link
Member Author

@gaocegege gaocegege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the past, we mark the trial running only when the jobs is created. If this reconcile return error, then we will not update the running status again since we cannot go to the code block again. This PR is to fix the problem.

/assign @hougangliu @johnugeorge

@gaocegege
Copy link
Member Author

/hold cancel

@hougangliu
Copy link
Member

/retest

1 similar comment
@gaocegege
Copy link
Member Author

/retest

Signed-off-by: Ce Gao <gaoce@caicloud.io>
Signed-off-by: Ce Gao <gaoce@caicloud.io>
@johnugeorge
Copy link
Member

@gaocegege Can you rebase? I will review it today

"err", "Status is not found in job")
return &jobCondition, nil
// Job does not have the running condition in status, thus we think
// the job is running when it is created.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this comment? This is little confusing. When there is no status field, it means that job is just created.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, The Job does not have a running condition, thus we cannot use the same logic as TFJob

return &jobCondition, nil
// Job does not have the running condition in status, thus we think
// the job is running when it is created.
log.Info("NestedFieldCopy", "err", "status cannot be found in job")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, why is this logic only added for batch job and not for other types?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant, kubeflow provider

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The kubeflow provider already implements it before.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gaocegege
Copy link
Member Author

/hold

Found some problems after rebase.

Signed-off-by: Ce Gao <gaoce@caicloud.io>
@k8s-ci-robot k8s-ci-robot added size/M and removed size/S labels Dec 5, 2019
@gaocegege
Copy link
Member Author

Verified with job:

(base) ➜  katib git:(running) ✗ kubectl -n kubeflow get trials -o json | jq ".items[] | {conditions: .status.conditions}"
{
  "conditions": [
    {
      "lastTransitionTime": "2019-12-05T02:26:59Z",
      "lastUpdateTime": "2019-12-05T02:26:59Z",
      "message": "Trial is created",
      "reason": "TrialCreated",
      "status": "True",
      "type": "Created"
    },
    {
      "lastTransitionTime": "2019-12-05T02:29:27Z",
      "lastUpdateTime": "2019-12-05T02:29:27Z",
      "message": "Trial is running",
      "reason": "TrialRunning",
      "status": "False",
      "type": "Running"
    },
    {
      "lastTransitionTime": "2019-12-05T02:29:27Z",
      "lastUpdateTime": "2019-12-05T02:29:27Z",
      "message": "Trial has succeeded",
      "reason": "TrialSucceeded",
      "status": "True",
      "type": "Succeeded"
    }
  ]
}
{
  "conditions": [
    {
      "lastTransitionTime": "2019-12-05T02:26:42Z",
      "lastUpdateTime": "2019-12-05T02:26:42Z",
      "message": "Trial is created",
      "reason": "TrialCreated",
      "status": "True",
      "type": "Created"
    },
    {
      "lastTransitionTime": "2019-12-05T02:26:43Z",
      "lastUpdateTime": "2019-12-05T02:26:43Z",
      "message": "Trial is running",
      "reason": "TrialRunning",
      "status": "True",
      "type": "Running"
    }
  ]
}
{
  "conditions": [
    {
      "lastTransitionTime": "2019-12-05T02:26:42Z",
      "lastUpdateTime": "2019-12-05T02:26:42Z",
      "message": "Trial is created",
      "reason": "TrialCreated",
      "status": "True",
      "type": "Created"
    },
    {
      "lastTransitionTime": "2019-12-05T02:26:59Z",
      "lastUpdateTime": "2019-12-05T02:26:59Z",
      "message": "Trial is running",
      "reason": "TrialRunning",
      "status": "True",
      "type": "Running"
    }
  ]
}

Signed-off-by: Ce Gao <gaoce@caicloud.io>
@hougangliu
Copy link
Member

/lgtm

@gaocegege
Copy link
Member Author

/retest

@k8s-ci-robot k8s-ci-robot removed the lgtm label Dec 5, 2019
@johnugeorge
Copy link
Member

/retest

@gaocegege
Copy link
Member Author

/hold cancel

@gaocegege
Copy link
Member Author

/hold

Forgot to test on TFJob/PyTorchJob

@gaocegege
Copy link
Member Author

/hold cancel

Signed-off-by: Ce Gao <gaoce@caicloud.io>
@@ -118,6 +130,9 @@ func (r *ReconcileTrial) updateFinalizers(instance *trialsv1alpha3.Trial, finali
}

func isTrialObservationAvailable(instance *trialsv1alpha3.Trial) bool {
if instance == nil {
return false
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just wondering when instance becomes nil ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never, but prefer to keep to avoid potential crashes.

@johnugeorge
Copy link
Member

/retest

@johnugeorge
Copy link
Member

/lgtm

@johnugeorge
Copy link
Member

/approve

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnugeorge

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 5d46799 into kubeflow:master Dec 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants