-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tune][autoscaler][observability] One-stop shop for monitoring large-scale DL model training using ray autoscaler #18322
Comments
Would it be possible to leverage an existing Tune callback API for this? https://docs.ray.io/en/latest/tune/user-guide.html?highlight=callbacks#callbacks |
IIUC, all Tune API's can be leveraged and the resulting metrics will be captured as part of the experiment progress. The gap here is the lack of mechanism to auto-monitor this progress and trigger associated actions (e.g terminate experiment and tear down cluster), as mentioned in requirement #2 above. While Callbacks and other trial stats can be emitted to disk and/or custom remote storage, to get insights on "which task on which GPU became the bottleneck of my experiment at what time", one has to manually retrieve cluster metrics emitted by Prometheus (persisted at storage1) and check trials stats (persisted at storage2) side-by-side. Requirement #1 & #3 remove the need of this manual progress (also potentially time-consuming) and enable us to answer this question faster and more automatically. |
Hmm, I actually think the callbacks are sufficient for this purpose.
Hmm, you should be able to create a callback that emits directly to your Prometheus cluster. Here is an example of a callback in Tune: https://docs.ray.io/en/latest/_modules/ray/tune/logger.html#LoggerCallback
You should be able to do this too, via the callback API. For example, you can do:
You can also use a stopper to terminate the experiment. https://docs.ray.io/en/latest/_modules/ray/tune/stopper.html#Stopper Anyways, I would strongly recommend you to try out these APIs, as I believe they should be sufficient. If there are insufficiencies, we will prioritize patches to make sure you can do what you need to do. |
Thanks for the prompt reply. I've previously used these APIs and agreed that they provide a descent control over the experiment status. Let me clarify a bit here:
I think this is aligned with the use-case description. Having callback integrated w/ (Prometheus + CloudWatchAgent) should satisfy a large portion of the Ops requirements. Has ray team considered adding a flag that emits all Tune metrics to Prometheus by default? While callback metrics are accessible in the Prometheus port, the handy TensorBoard resides in a separate port and the event file gets persisted somewhere else. As a result, when Prometheus detects anomaly signals, it still relies on manual effort to locate the corresponding event file and dive deep into the experiment. Having Tensorboard-Prometheus sort of integration would ease this process. It's not a must-have feature for us at this stage but raised up as a RFC.
Indeed. Actually it's very handy for most trials that we've run. They are captured as "algorithm level configuration" in the issue description. Essentially, we can monitor system level metrics via Prometheus and model specific metrics via Tune API. Therefore if all Tune metrics are available in Prometheus, one will have a single go-to place to make a decision like "tear down the cluster if |
@annaluo676 can you advise as to what's the actual request here? Is it just a Prometheus callback for Tune? |
@richardliaw Yes, a Prometheus callback for all metrics Tune emit would be a great feature for cluster and experiment control purpose. |
Hi, I'm a bot from the Ray team :) To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months. If there is no further activity in the 14 days, the issue will be closed!
You can always ask for help on our discussion forum or Ray's public slack channel. |
Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message. Please feel free to reopen or open a new issue if you'd still like it to be addressed. Again, you can always ask for help on our discussion forum or Ray's public slack channel. Thanks again for opening the issue! |
Describe your feature request
This issue captures some requirements for monitoring large-scale DL model training using ray autoscaler.
When using Ray Tune to run distributed model training, users can specify local/s3 path to store experiment progress, then TensorBoard can be used for tracking/monitoring. One caveat is that users need to go to different ports to check different metrics during experimentation. For instance, actor/task status at 8265 (ray dashboard), cluster metrics at 8080 (prometheus) and model metrics at 6006 (TensorBoard). Also, after model training is completed, one needs to know the exact s3 location to reload the tf events and visualize via TensorBoard. For models that take hours or days to train, having the capability to auto-monitor the progress w/o babysitting TensorBoard is crucial to improving the development efficiency and performing automatic intervention (in addition to algorithm level configuration such as early stopping and scheduler level such as PBT and AdaptDL).
From the MLOps perspective, running large-scale DL models using ray autoscaler would require:
These requirements are partially met by the CloudWatch integration currently in progress at #8967 (cc @Zyiqin-Miranda). The CloudWatch agent will automatically scrape metrics from Prometheus and emit them to CloudWatch, but it won’t pick up the metrics tracked by ray tune (e.g. progress.csv, tfevents). One option to meet the above requirements is to export TF events to Prometheus (kubeflow/training-operator#722). Is this something that ray team has considered? Are there other related efforts that help us achieve the above requirements?
The text was updated successfully, but these errors were encountered: