[tune][autoscaler][observability] One-stop shop for monitoring large-scale DL model training using ray autoscaler #18322

annaluo676 · 2021-09-03T00:23:37Z

Describe your feature request

This issue captures some requirements for monitoring large-scale DL model training using ray autoscaler.

When using Ray Tune to run distributed model training, users can specify local/s3 path to store experiment progress, then TensorBoard can be used for tracking/monitoring. One caveat is that users need to go to different ports to check different metrics during experimentation. For instance, actor/task status at 8265 (ray dashboard), cluster metrics at 8080 (prometheus) and model metrics at 6006 (TensorBoard). Also, after model training is completed, one needs to know the exact s3 location to reload the tf events and visualize via TensorBoard. For models that take hours or days to train, having the capability to auto-monitor the progress w/o babysitting TensorBoard is crucial to improving the development efficiency and performing automatic intervention (in addition to algorithm level configuration such as early stopping and scheduler level such as PBT and AdaptDL).

From the MLOps perspective, running large-scale DL models using ray autoscaler would require:

A one-stop shop for monitoring all metric types (both cluster and application level) in dashboards.
The ability to create alarms that trigger actions, (e.g., tear down cluster, stop job execution, etc.).
The ability to emit durable metrics and quickly retrieve all related metrics associated with a past experiment.

These requirements are partially met by the CloudWatch integration currently in progress at #8967 (cc @Zyiqin-Miranda). The CloudWatch agent will automatically scrape metrics from Prometheus and emit them to CloudWatch, but it won’t pick up the metrics tracked by ray tune (e.g. progress.csv, tfevents). One option to meet the above requirements is to export TF events to Prometheus (kubeflow/training-operator#722). Is this something that ray team has considered? Are there other related efforts that help us achieve the above requirements?

annaluo676 · 2021-09-03T00:24:00Z

cc @pdames @richardliaw

richardliaw · 2021-09-03T22:29:20Z

Would it be possible to leverage an existing Tune callback API for this?

https://docs.ray.io/en/latest/tune/user-guide.html?highlight=callbacks#callbacks

annaluo676 · 2021-09-14T21:17:07Z

IIUC, all Tune API's can be leveraged and the resulting metrics will be captured as part of the experiment progress. The gap here is the lack of mechanism to auto-monitor this progress and trigger associated actions (e.g terminate experiment and tear down cluster), as mentioned in requirement #2 above. While Callbacks and other trial stats can be emitted to disk and/or custom remote storage, to get insights on "which task on which GPU became the bottleneck of my experiment at what time", one has to manually retrieve cluster metrics emitted by Prometheus (persisted at storage1) and check trials stats (persisted at storage2) side-by-side. Requirement #1 & #3 remove the need of this manual progress (also potentially time-consuming) and enable us to answer this question faster and more automatically.

richardliaw · 2021-09-14T22:26:54Z

Hmm, I actually think the callbacks are sufficient for this purpose.

A one-stop shop for monitoring all metric types (both cluster and application level) in dashboards.

The ability to emit durable metrics and quickly retrieve all related metrics associated with a past experiment.

Hmm, you should be able to create a callback that emits directly to your Prometheus cluster. Here is an example of a callback in Tune:

https://docs.ray.io/en/latest/_modules/ray/tune/logger.html#LoggerCallback

The ability to create alarms that trigger actions, (e.g., tear down cluster, stop job execution, etc.).

You should be able to do this too, via the callback API. For example, you can do:


class AnomalyCallback(Callback):
    def on_trial_complete(self, iteration: int, trials: List["Trial"],
                          trial: "Trial", **info):
        if trial.last_result.get("acc") > 0.7:
           do_logic()

You can also use a stopper to terminate the experiment.

https://docs.ray.io/en/latest/_modules/ray/tune/stopper.html#Stopper

Anyways, I would strongly recommend you to try out these APIs, as I believe they should be sufficient. If there are insufficiencies, we will prioritize patches to make sure you can do what you need to do.

annaluo676 · 2021-09-15T00:30:09Z

Thanks for the prompt reply. I've previously used these APIs and agreed that they provide a descent control over the experiment status. Let me clarify a bit here:

you should be able to create a callback that emits directly to your Prometheus cluster

I think this is aligned with the use-case description. Having callback integrated w/ (Prometheus + CloudWatchAgent) should satisfy a large portion of the Ops requirements. Has ray team considered adding a flag that emits all Tune metrics to Prometheus by default?

While callback metrics are accessible in the Prometheus port, the handy TensorBoard resides in a separate port and the event file gets persisted somewhere else. As a result, when Prometheus detects anomaly signals, it still relies on manual effort to locate the corresponding event file and dive deep into the experiment. Having Tensorboard-Prometheus sort of integration would ease this process. It's not a must-have feature for us at this stage but raised up as a RFC.

The ability to create alarms that trigger actions, (e.g., tear down cluster, stop job execution, etc.).
You should be able to do this too, via the callback API. For example, you can do:...
You can also use a stopper to terminate the experiment.

Indeed. Actually it's very handy for most trials that we've run. They are captured as "algorithm level configuration" in the issue description. Essentially, we can monitor system level metrics via Prometheus and model specific metrics via Tune API. Therefore if all Tune metrics are available in Prometheus, one will have a single go-to place to make a decision like "tear down the cluster if ray_avg_num_scheduled_tasks is below X and acc is below Y."

richardliaw · 2021-09-16T23:18:52Z

@annaluo676 can you advise as to what's the actual request here? Is it just a Prometheus callback for Tune?

annaluo676 · 2021-09-23T18:46:40Z

@richardliaw Yes, a Prometheus callback for all metrics Tune emit would be a great feature for cluster and experiment control purpose.

stale · 2022-01-21T21:12:34Z

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
If you'd like to get more attention to the issue, please tag one of Ray's contributors.

You can always ask for help on our discussion forum or Ray's public slack channel.

stale · 2022-02-08T10:12:40Z

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!

annaluo676 added the enhancement Request for new feature and/or capability label Sep 3, 2021

richardliaw assigned xwjiang2010 Sep 7, 2021

stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jan 21, 2022

stale bot closed this as completed Feb 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tune][autoscaler][observability] One-stop shop for monitoring large-scale DL model training using ray autoscaler #18322

[tune][autoscaler][observability] One-stop shop for monitoring large-scale DL model training using ray autoscaler #18322

annaluo676 commented Sep 3, 2021 •

edited

Loading

annaluo676 commented Sep 3, 2021

richardliaw commented Sep 3, 2021

annaluo676 commented Sep 14, 2021

richardliaw commented Sep 14, 2021

annaluo676 commented Sep 15, 2021

richardliaw commented Sep 16, 2021

annaluo676 commented Sep 23, 2021

stale bot commented Jan 21, 2022

stale bot commented Feb 8, 2022

[tune][autoscaler][observability] One-stop shop for monitoring large-scale DL model training using ray autoscaler #18322

[tune][autoscaler][observability] One-stop shop for monitoring large-scale DL model training using ray autoscaler #18322

Comments

annaluo676 commented Sep 3, 2021 • edited Loading

Describe your feature request

annaluo676 commented Sep 3, 2021

richardliaw commented Sep 3, 2021

annaluo676 commented Sep 14, 2021

richardliaw commented Sep 14, 2021

annaluo676 commented Sep 15, 2021

richardliaw commented Sep 16, 2021

annaluo676 commented Sep 23, 2021

stale bot commented Jan 21, 2022

stale bot commented Feb 8, 2022

annaluo676 commented Sep 3, 2021 •

edited

Loading