Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emit events when fails to reconcile all trials #1706

Merged

Conversation

henrysecond1
Copy link
Contributor

What this PR does / why we need it:

The experiment controller is not showing any events when fails to reconcile all trials.
Thus, users not authorized to access the controller logs can not find the reason that why their trials are not created since no events are emitted by the experiment controller.

Please refer #1663 for detail.

Which issue(s) this PR fixes:

Fixes #1663

@aws-kf-ci-bot
Copy link
Contributor

Hi @henrysecond1. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

continue
}
trialNames = append(trialNames, trial.Name)
}
// Failed to reconcile all trials
if len(trialNames) == 0 && len(errStrings) != 0 {
return fmt.Errorf(strings.Join(errStrings, "\n"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we retry to create the trial after the error?

/cc @andreyvelich

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, eventually Experiment controller will try to create Trials until we reach the maxFailedTrialCount: https://github.com/kubeflow/katib/blob/master/pkg/controller.v1beta1/experiment/util/status_util.go#L195-L202.

@henrysecond1 I think it is a great enhancement to record this error to the events, but do we want to check errors and returned Trials ? We should have clear requirement "When Trial is Failed".

For example, in case of this error we don't want to stop reconcile. For example, Trial is already exist because of unsynchronised policy of Kubernetes controllers.

I can see the following solution:

  • rename createTrialInstance to getTrialInstance.
  • move Trial Creation out of this function.
  • if getTrialInstance fails, return error for createTrials function.
  • if r.Create fails continue the loop.

Any other suggestion @henrysecond1 @gaocegege @johnugeorge ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich Sorry for the late response and thank you for your advice.

I agree with the suggested solution. It seems reasonable to me.
I modified the code according to your suggestion, could you check it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich PTAL

Thanks for your time 😄

@gaocegege
Copy link
Member

Hi @henrysecond1 Is it ready for another review?

@henrysecond1
Copy link
Contributor Author

Hi @henrysecond1 Is it ready for another review?

Oh yes, please review it when you are available :)

@coveralls
Copy link

Coverage Status

Coverage increased (+1.009%) to 73.677% when pulling e3398ce on henrysecond1:feature/experiment-events into 30e47df on kubeflow:master.

@andreyvelich
Copy link
Member

Thank you for the updates @henrysecond1!
/lgtm
/assign @gaocegege @johnugeorge

Copy link
Member

@gaocegege gaocegege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

Thanks for your contribution! 🎉 👍

@google-oss-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gaocegege, henrysecond1

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-robot google-oss-robot merged commit 3ed331e into kubeflow:master Oct 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

experiment controller is not showing any events when fails to reconcile all trials
7 participants