Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fix] add early stopped trials in converter #2004

Merged

Conversation

shaowei-su
Copy link
Contributor

Python suggestion services will convert the trials from request payload before passing in actual search algos. However, early stopped trials are filtered out in this process and thus no updates are provided to search algo and there are duplicated trials created.

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

#2002

Checklist:

  • Docs included if any changes are user facing

@coveralls
Copy link

coveralls commented Nov 8, 2022

Coverage Status

Coverage increased (+0.03%) to 73.453% when pulling a05ff34 on shaowei-su:shaowei--fix-es-trials-parse into d97c8ae on kubeflow:master.

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @shaowei-su!
/lgtm
/assign @johnugeorge @tenzen-y

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shaowei-su Thanks for fixing this!
/lgtm

@kubeflow/wg-automl-leads Can you restart failure CI?

@google-oss-prow google-oss-prow bot removed the lgtm label Nov 9, 2022
@shaowei-su
Copy link
Contributor Author

Due to the aysnc nature of the trials' updates, the suggestion service actually might be invoked without all the observed metrics available. I updated this logic too so that suggestion service return empty list if no new updates been made.
PTAL! @andreyvelich @tenzen-y Thanks!

@tenzen-y
Copy link
Member

@shaowei-su Thanks for updating!
/lgtm

/assign @andreyvelich @johnugeorge

@google-oss-prow google-oss-prow bot added the lgtm label Nov 10, 2022
logger.info("Succeeded Trials didn't change: {}\n".format(self.succeeded_trials))
logger.error("Succeeded Trials didn't change: {}\n".format(self.succeeded_trials))
logger.error("No new suggestions could be generated, return early..\n")
return []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shaowei-su What if Trial was failed for some reason (e.g. networking) and we ask for the new Trials ?
In that case succeeded_trials list will stay the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great question @andreyvelich , I can think of two possible routes:

  1. Add more conditional branches in the suggestions service, e.g is the service invoked as a result of failed trials? if so, suggestion service will return duplicated suggestions as before this PR change;

  2. Retry for trials should be kept as a logic within trial templates themself i.e Job, TFJob, MPIJob etc.. HPO search experiment only make progress when there are succeeded (including early stopped ones) trials complete. In the case of multiple concurrent trials, suggestion service will not return new trials until next succeeded trial's ready.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add more conditional branches in the suggestions service, e.g is the service invoked as a result of failed trials?

But do we need conditional branch to the suggestion service ?
If succeeded_trials = Succeeded Trials + Early Stopped Trials, then if Trial was failed the Suggestion service will just ask for the new trials (e.g. with duplicated HPs) using current_request_number.

Retry for trials should be kept as a logic within trial templates themself i.e Job, TFJob, MPIJob etc..

That's valid assumption. In that case, what should be the nature of MaxFailedTrialCount API ?

cc @johnugeorge @tenzen-y

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But do we need conditional branch to the suggestion service ?

The reason is that suggestion service can't tell whether the current_request_number is requested as a result of failed trails (in this case service should return duplicated HPs), or as a result of delay in trial status/observations updates (in this case service should wait until trials are fully updated).

The latter can be reproduced by this example and change the algo to bayesianoptimization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MaxFailedTrialCount still serve its purpose as a way to manage lifecycle of experiments. The difference would be whether we add up duplicated HP trials for counting. If suggestion services are to generate those duplicated HP trials, then a single failed HP trial with N failures will cause an experiment to fail.

Copy link
Member

@andreyvelich andreyvelich Nov 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when trial early stopped without observation updated in time

Should we avoid sending Early Stopped Trials without observation to the Suggestion service ? e.g. we don't send Trials with Metrics Unavailable condition. I think, Trials logically shouldn't be called "EarlyStopped" until Observation is ready for them. WDYT @shaowei-su @johnugeorge @tenzen-y ?
@shaowei-su you are right with your assumption about current_request_number.

I know it is hacky that we set EarlyStopping trial condition from the Early Stopping service and only after that we add Observation Trial results. We might want to reconsider some early stopping design when we introduce push-based Metrics Collector: #577

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To fix this, we could try to add the early stopped but non-observation ready trials into the active trials count as well. wdyt?

Since Suggestion and Trial controller works in async manner, I think we should introduce another check to Suggestion Client as I mentioned above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually those trials are already filtered out on the suggestion service side.

What's missing here is the correct currentRequestNum, which should not be incremented before the early stopping metrics are fully updated.

Copy link
Member

@andreyvelich andreyvelich Nov 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shaowei-su But is that correct approach to filter them on the Suggestion side instead of Katib Controller side ?

Some of the suggestion services might not use internal implementation and just forget about it.
I think the Katib controller should just invoke suggestion service with the correct statuses for the Trials.
E.g. these Trial statuses.
Which means if EarlyStopped Trial doesn't have observation, Controller should not send it to Suggestion Service.
The main purpose of the Suggestion service is just produce new HPs based on Trial results.

I doubt that Suggestion service should do any additional work with Trials from the Katib controller.

What do other think about it ? @johnugeorge @gaocegege @tenzen-y

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich I made a quick updates on the controller side filters and suggestion request count logic, ptal! thanks

Comment on lines 346 to 348
if !t.IsObservationAvailable() {
continue
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also remove this check for the Trial observation since we don't send Trials without observation to the Suggestion ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest to keep this filtering logic since others might run different version of controller vs suggestion service docker images.

Copy link
Member

@andreyvelich andreyvelich Nov 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually, we ask our users to run the same version of Katib Control Plane to avoid incompatibility issues.
Should we add TODO in this section to remove it in the future versions ?
What do others think @gaocegege @johnugeorge @tenzen-y ?

Copy link
Contributor Author

@shaowei-su shaowei-su Nov 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, changed the filter logic to early stopped trials only and let's keep the service side check as it is.

@@ -343,6 +343,9 @@ func (g *General) ConvertTrials(ts []trialsv1beta1.Trial) []*suggestionapi.Trial
if t.IsMetricsUnavailable() {
continue
}
if !t.IsObservationAvailable() {
continue
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, what about Failed Trials (I guess the observation is empty for them, right)? Do we want to send them to the Suggestion service ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it make more sense to filter the failed trails out as those won't provide any updates to the suggestion service.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some Suggestion services record failed Trials on its own DataBase (e.g. Goptuna Suggestion)
@shaowei-su Do you have any concerns with sending Failed Trials to the Suggestion service ?
We add failed Trials to completed, so it won't ask Suggestion service to generate new Trials.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, I didn't realize that failed trials are tracked by other suggestion services. Updated in the latest commit to filter on early stopped & without observations trials only, PTAL.

@shaowei-su shaowei-su force-pushed the shaowei--fix-es-trials-parse branch 2 times, most recently from a05ff34 to 8190f9c Compare November 26, 2022 21:51
@andreyvelich
Copy link
Member

Thank you for these updates @shaowei-su!
/lgtm
/assign @johnugeorge @gaocegege @tenzen-y

Comment on lines +346 to +348
if !t.IsObservationAvailable() && t.IsEarlyStopped() {
continue
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shaowei-su Can we add a test for this condition to the TestSyncAssignments func or to a new test function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tenzen-y ! I added few more trails in the mock so this branching logic is also validated.

Btw, could help restart the failed unit tests? Looks like a flaky test failed.

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shaowei-su Thanks for the updates! I appreciate your work in fixing this bug.
/lgtm

/cc @johnugeorge

@tenzen-y
Copy link
Member

tenzen-y commented Dec 2, 2022

/assign @johnugeorge

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution @shaowei-su!
/approve

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich, shaowei-su

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 1e4df8d into kubeflow:master Dec 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants