-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Enhancement Request] Metrics Collector Push-based Implementation #577
Comments
I am glad to discuss it after v1alpha2 is released. |
Yes. This is a long standing need :) Lets take this up during next api design phase |
This is fixed with new metric collector design in v1alpha3. |
Closing the issue |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
/lifecycle frozen |
/area gsoc |
Hi everyone! I'm Electronic-Waste, a.k.a. Shao Wang, a senior student from Shanghai Jiao Tong Univeristy. My major interests lies in cloud infra concerned with k8s, and also, ai infra. I have two professional experience concerned with k8s and go:
And also, I previous made some contributions to tensorchord/envd, an open source software designed to solve AI/ML development environment I noticed that kubeflow has been chosen as one of the orgs participating in GSoC2024. And I'm interested in this issue. I wonder how I can get started to this issue. Can you please offer me some guidance? |
@Electronic-Waste Hi, Shao. Thank you for your interest in the kubeflow GSoC project. Please join the kubeflow slack workspace to receive some information about GSoC. |
@tenzen-y Okay, thanks for telling me about this. |
cc plz👀 @johnugeorge @gaocegege @andreyvelich @tenzen-y I have a question about the push-based metrics collector. Could you plz have a simple look into my questions? The paper introducing Katib mentions that Katib supports two kinds of metric collection - push based and pull based. While Katib has already implemented the push based way to collect metrics, why do we still need to implement it again in this enhancement request(or in GSoC project)? Do I misunderstand the content in the paper? Or Katib have implemented it in the yaml configuration way but need a python SDK version now? |
@Electronic-Waste IIUC, the current Katib provides only pull based mtrics collector like file-metrics-collector, and tf-event-mentrics-collector. So, we need to implement the push based metrics collector. Regarding the above paper, I'm not sure the above paper isn't correct since I'm not a part of author that paper. |
@tenzen-y Thank you for your clarification! I'll try to look into the source code to get details and write my proposal for implementing this enhance request. |
@Electronic-Waste I think, the main idea was that user can still use Katib DB Manager gRPC API to push metrics to Katib DB. In that case user has to disable sidecar injection and make sure that metrics have been collected. |
@andreyvelich Okay, thanks for you clarification too! If I do not misunderstand your words, do you mean that we want to add a new interface (such as katib_client.push_metrics()) for users to push metrics to Katib DB directly without changing the current logics? |
As an option, yes, but we need to have design proposal to discuss various options and, potential, Experiment API changes. |
@andreyvelich Okay, I'll raise my design proposal in the next few days. Btw, how do you want me to send it to you? |
@Electronic-Waste Please can you join AutoML and Training WG call this Wed at 2pm UTC: https://bit.ly/2PWVCkV so we can discuss details ? |
@andreyvelich Yeah, of course! |
Hi @andreyvelich and @rareddy, out of curiosity for the proposal, what are some use cases that prevent the metrics-collector sidecar from working? |
/assign @Electronic-Waste |
Sorry for the late reply @YelenaYY. It might have some performance drawbacks since we require to parse entire container StdOut to get the required metrics. To redirect output to the file, we wrap the container entrypoint which will always use |
This feature has been implemented as part of this project: #2340. |
@kubeflow/wg-automl-leads Thanks! It's an unforgettable journey collaborating with you in this summer. |
/kind feature
Describe the solution you'd like
[A clear and concise description of what you want to happen.]
Now the design of metrics collector is based on pull. We have a metrics collector cron job for one trial. And it collects logs according to the pods log. Then it parses the log and persist the logs in MySQL.
The design has some problems (kubeflow/training-operator#722 (comment)) @johnugeorge proposed a push-based model to avoid the problems caused by the current design. And I also have some ideas about it.
In my design, we need a push-based implementation to push the metrics to prometheus. Then we can use custom-metrics-server to expose the trial or job level metrics. Then katib could get all periodical metrics from k8s master API. The early stopping services can use the API to determine if we should kill the trial. And UI can use the API to show the metrics.
And, tfjob and pytorchjob can also benefit from the metrics collector. Because we can use it collect periodical metrics for them, too. And the metrics will be exposed by a kubernetes native way: K8s metrics API
Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
The text was updated successfully, but these errors were encountered: