-
Notifications
You must be signed in to change notification settings - Fork 454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhancement for Custom CRD #1333
Enhancement for Custom CRD #1333
Conversation
Add retry on empty observation
I figure out that we send all Trials to Suggestion. I think we don't need to send Trials with unavailable metrics to Suggestion service since we try to reconcile controller until metrics are reported. |
SGTM, we do not need to send these trials. |
/lgtm |
If objective metric value is not reported metrics collector reports unavailable value to the DB Controller reconciles Trial until DB is empty
@gaocegege We were discussing with @johnugeorge about my first solution with re-queuing controller and saving data in map.
I propose a bit different solution:
What do you think about this? |
/retest |
/retest |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: johnugeorge The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I was testing Custom CRD.
I was able to run Tekton
TaskRun
as Trial CR. I Will submit PR with few examples soon.I found these problems while running:
I disabled
MutateJob
in controller. We will track it in this issue: Refactoring Supported Job List #1320.I disabled validation for Job other than
SupportedJobList
.We should verify
PrimaryContainerName
inMutateVolume
also.I changed
INSERT
inRegisterObservationLog
function. It is better to insert all lines in one SQL query, instead of running separate query for each line. It helps to avoid unnecessary Trial updates.I added reconcile re-queue if metrics are not available. This use-case happens in Tekton job. Tekton task is succeeded once Training container is finished, but metrics can be not collected, yet. Because of that, controller reconciles Trial without observation and turns Trial status in
Metrics Unavailable
. To avoid it, I try to re-queue controller formaxRequeueCount
times to convert Trial toSucceeded
status. @gaocegege @johnugeorge What do you think about this approach, can we do it in other way?/assign @gaocegege @johnugeorge