Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync up worker status all the time #299

Merged
merged 1 commit into from
Dec 20, 2018
Merged

sync up worker status all the time #299

merged 1 commit into from
Dec 20, 2018

Conversation

hougangliu
Copy link
Member

@hougangliu hougangliu commented Dec 19, 2018

Now, when metricsCollector for a worker failed to be created, the worker job status in studyjob.status.trials will not be updated (blocked in Running), and the corresponding job will not be deleted even completed.

Fixes: #298


This change is Reviewable

@hougangliu
Copy link
Member Author

/assign @YujiOshima @richardsliu

}
}
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation looks misaligned.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it works well (I had verified it). else match with line 347 (not 348)

}
}
} else {
// for some reason, metricsCollector for this worker cannot be found (deleted by anyone accidentally or even failed to be created)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the job finishes early but metric collector is not spawned yet?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this case, metrics including objectiveMetrics will not be collected, studyJob will have no idea which group of hyperParameter have better performance. That is, studyJob will study nothing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a worker job changes to "Running", metric collector cronjob is spawned rightly. That is, when a worker job changes to "Complete", but its corresponding metric collector cronjob cannot be found. In this case, I think it is deleted by anyone accidentally or even failed to be created.

@richardsliu
Copy link
Contributor

Thanks for your explanations.
/lgtm
/approve

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: richardsliu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit b11b81d into kubeflow:master Dec 20, 2018
@hougangliu hougangliu deleted the fix-worker-status branch December 20, 2018 05:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants