[docs] setting up grafana and prometheus #31129

alanwguo · 2022-12-15T05:28:59Z

Signed-off-by: Alan Guo aguo@anyscale.com

Why are these changes needed?

Many users have struggled setting up prometheus and grafana. At least, we should do a better job pointing people to how to set this up for remote ray clusters and to let users know about configuration options they can set.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Alan Guo <aguo@anyscale.com>

doc/source/cluster/running-applications/monitoring-and-observability.rst

rkooo567 · 2022-12-15T11:04:17Z

doc/source/cluster/running-applications/monitoring-and-observability.rst

+on the head node of the cluster. However, in order to view the :ref:`Dashboard <ray-dashboard>` metrics on your local
+machine, you must configure the Dashboard UI to embed the metrics graphs via a public address for the Grafana instance.
+
+The `RAY_GRAFANA_HOST` env var can be set when launching Ray to configure how the Dashboard UI embeds the metrics.


Can you add this example to ray-metrics page and just have a link here instead?

architkulkarni

Looks good. Nit: I think we should spell out "env vars" as environment variables and be consistent about capitalizing Ray, Prometheus, Grafana, IP

scottsun94

Thanks!
The overall structure is a bit confusing to me. I could think of 4 JTBDs related to this part of documentation. Here are some difficulties for me to complete each one of them

learn about metrics.

In the metrics page, we jump to the prometheus first. I may not know what "metrics" refer to in Ray's context

learn how to collect metrics via prometheus

It's easy to understand how to set it up locally following the documentation. However, it's still not very straightford in terms of how to set it up on a cluster. First, I want to know where to run it. This is in metrics page ("Alternate Prometheus host location"). Then I want to know how to scrape the metrics which is in the "cluster monitoring" page.

learn how to set up grafana

same issues as prometheus. I want to know where to run Grafana first. Then I need to know how to configure it to visualize the prometheus metrics in Grafana.

learn how to view the embedded metrics in Ray dashboard

The ray dashboard/metrics page says "It requires that prometheus and grafana is running for your cluster" and sends me to the metrics page. However, it's not clear what setup of prometheus and grafana is required in the metrics page for the metrics to show up.

Here are the suggested changes to the structure:

Cluster monitoring

Ray dashboard
Ray CLI
Prometheus metrics
- Just a short intro paragraph with a link to metrics page for more details

Metrics

A short intro (keep the current one).
System metrics
Application-level metrics
Prometheus
- A short intro
- Run Prometheus (locally, on a head node or outisde of ray cluster)
- Auto-discovering metrics endpoints
- Manually discovering metrics endpoints
- Customize prometheus export port
Grafana
- A short intro
- Run Grafana and view graphs for Ray (locally, on a head node or outisde of ray cluster)
- Embed Grafana graphs in Ray Dashboard
  - (Prometheus running)
  - (Grafana is able to access Promtheus)
  - (Ray is able to access Grafana)

Ray Dashboard

Metrics view
- "It requires that prometheus and grafana is running for your cluster." + link to "Embed Grafana graphs in Ray Dashboard"

stale · 2023-01-15T16:44:46Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

rkooo567 · 2023-01-18T00:10:00Z

unstale. And cc @alanwguo

Signed-off-by: Alan Guo <aguo@anyscale.com>

alanwguo · 2023-02-09T05:22:33Z

Thanks! The overall structure is a bit confusing to me. I could think of 4 JTBDs related to this part of documentation. Here are some difficulties for me to complete each one of them

learn about metrics.

In the metrics page, we jump to the prometheus first. I may not know what "metrics" refer to in Ray's context

learn how to collect metrics via prometheus

It's easy to understand how to set it up locally following the documentation. However, it's still not very straightford in terms of how to set it up on a cluster. First, I want to know where to run it. This is in metrics page ("Alternate Prometheus host location"). Then I want to know how to scrape the metrics which is in the "cluster monitoring" page.

learn how to set up grafana

same issues as prometheus. I want to know where to run Grafana first. Then I need to know how to configure it to visualize the prometheus metrics in Grafana.

learn how to view the embedded metrics in Ray dashboard

The ray dashboard/metrics page says "It requires that prometheus and grafana is running for your cluster" and sends me to the metrics page. However, it's not clear what setup of prometheus and grafana is required in the metrics page for the metrics to show up.

Here are the suggested changes to the structure:

Cluster monitoring

Ray dashboard

Ray CLI

Prometheus metrics

Just a short intro paragraph with a link to metrics page for more details

Metrics

A short intro (keep the current one).

System metrics

Application-level metrics

Prometheus

A short intro

Run Prometheus (locally, on a head node or outisde of ray cluster)

Auto-discovering metrics endpoints

Manually discovering metrics endpoints

Customize prometheus export port

Grafana

A short intro

Run Grafana and view graphs for Ray (locally, on a head node or outisde of ray cluster)

Embed Grafana graphs in Ray Dashboard

(Prometheus running)

(Grafana is able to access Promtheus)

(Ray is able to access Grafana)

Ray Dashboard

Metrics view

"It requires that prometheus and grafana is running for your cluster." + link to "Embed Grafana graphs in Ray Dashboard"

I think this makes sense but I don't think I have time to make these changes by Friday. I also think we need to redo all the dashboard docs all at once to really re-structure it well and that would require pairing with @rkooo567 at least who is doing parallel doc changes.

I made some updates to include more info on setting up Grafana on a cluster, but I think we should consider a re-write for 2.4.

scottsun94 · 2023-02-09T05:35:51Z

doc/source/ray-observability/ray-metrics.rst

+by setting the `RAY_PROMETHEUS_HOST` env var when launching ray. The env var takes in the address to access Prometheus.
+
+
+Alternate Grafana host location


This is duplicated?

scottsun94 · 2023-02-09T05:36:35Z

I think this makes sense but I don't think I have time to make these changes by Friday. I also think we need to redo all the dashboard docs all at once to really re-structure it well and that would require pairing with @rkooo567 at least who is doing parallel doc changes.

I made some updates to include more info on setting up Grafana on a cluster, but I think we should consider a re-write for 2.4.

SGTM!

Signed-off-by: Alan Guo <aguo@anyscale.com>

doc/source/cluster/running-applications/monitoring-and-observability.rst

doc/source/ray-observability/ray-metrics.rst

rkooo567 · 2023-02-09T22:51:06Z

Looks awesome. Last few comments.

Signed-off-by: Alan Guo <aguo@anyscale.com>

* [docs] setting up grafana and prometheus (#31129) * Apply suggestions from code review Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Alan Guo <aguo@aguo.software> --------- Signed-off-by: Alan Guo <aguo@aguo.software> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

Signed-off-by: elliottower <elliot@elliottower.com>

Improve documentation for setting up grafana and prometheus

108eff5

Signed-off-by: Alan Guo <aguo@anyscale.com>

alanwguo assigned ericl Dec 15, 2022

alanwguo requested review from architkulkarni, wuisawesome and DmitriGekhtman as code owners December 15, 2022 05:29

alanwguo assigned scottsun94 Dec 15, 2022

alanwguo requested review from maxpumperla, pcmoritz and a team as code owners December 15, 2022 05:29

alanwguo assigned rkooo567 Dec 15, 2022

rkooo567 reviewed Dec 15, 2022

View reviewed changes

architkulkarni approved these changes Dec 15, 2022

View reviewed changes

ericl removed their assignment Dec 15, 2022

rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Dec 16, 2022

scottsun94 reviewed Dec 16, 2022

View reviewed changes

stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jan 15, 2023

stale bot removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jan 18, 2023

alanwguo added 3 commits February 8, 2023 20:43

fixup

df52d4a

Signed-off-by: Alan Guo <aguo@anyscale.com>

Merge branch 'master' into better-metrics-docs

4ea6782

additional changes

f1e6637

Signed-off-by: Alan Guo <aguo@anyscale.com>

alanwguo removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Feb 9, 2023

scottsun94 reviewed Feb 9, 2023

View reviewed changes

fixup

7031429

Signed-off-by: Alan Guo <aguo@anyscale.com>

alanwguo added v2.3.0-pick release-blocker P0 Issue that blocks the release labels Feb 9, 2023

rkooo567 reviewed Feb 9, 2023

View reviewed changes

rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Feb 9, 2023

fixup

0bc4e7b

Signed-off-by: Alan Guo <aguo@anyscale.com>

alanwguo removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Feb 10, 2023

rkooo567 approved these changes Feb 10, 2023

View reviewed changes

richardliaw approved these changes Feb 14, 2023

View reviewed changes

richardliaw changed the title ~~Improve documentation for setting up grafana and prometheus~~ [docs] setting up grafana and prometheus Feb 14, 2023

richardliaw merged commit b9f7e19 into ray-project:master Feb 14, 2023

alanwguo added a commit to alanwguo/ray that referenced this pull request Feb 16, 2023

[docs] setting up grafana and prometheus (ray-project#31129)

aa3a724

alanwguo mentioned this pull request Feb 16, 2023

[docs] setting up grafana and prometheus #32606

Merged

7 tasks

edoakes pushed a commit to edoakes/ray that referenced this pull request Mar 22, 2023

[docs] setting up grafana and prometheus (ray-project#31129)

33d40c1

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

elliottower pushed a commit to elliottower/ray that referenced this pull request Apr 22, 2023

[docs] setting up grafana and prometheus (ray-project#31129)

353372c

Signed-off-by: elliottower <elliot@elliottower.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[docs] setting up grafana and prometheus #31129

[docs] setting up grafana and prometheus #31129

alanwguo commented Dec 15, 2022

rkooo567 Dec 15, 2022

architkulkarni left a comment

scottsun94 left a comment

stale bot commented Jan 15, 2023

rkooo567 commented Jan 18, 2023

alanwguo commented Feb 9, 2023

scottsun94 Feb 9, 2023

scottsun94 commented Feb 9, 2023

rkooo567 commented Feb 9, 2023

		by setting the `RAY_PROMETHEUS_HOST` env var when launching ray. The env var takes in the address to access Prometheus.


		Alternate Grafana host location

[docs] setting up grafana and prometheus #31129

[docs] setting up grafana and prometheus #31129

Conversation

alanwguo commented Dec 15, 2022

Why are these changes needed?

Related issue number

Checks

rkooo567 Dec 15, 2022

Choose a reason for hiding this comment

architkulkarni left a comment

Choose a reason for hiding this comment

scottsun94 left a comment

Choose a reason for hiding this comment

stale bot commented Jan 15, 2023

rkooo567 commented Jan 18, 2023

alanwguo commented Feb 9, 2023

scottsun94 Feb 9, 2023

Choose a reason for hiding this comment

scottsun94 commented Feb 9, 2023

rkooo567 commented Feb 9, 2023