Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs] setting up grafana and prometheus #31129

Merged
merged 6 commits into from
Feb 14, 2023

Conversation

alanwguo
Copy link
Contributor

Signed-off-by: Alan Guo aguo@anyscale.com

Why are these changes needed?

Many users have struggled setting up prometheus and grafana. At least, we should do a better job pointing people to how to set this up for remote ray clusters and to let users know about configuration options they can set.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Alan Guo <aguo@anyscale.com>
on the head node of the cluster. However, in order to view the :ref:`Dashboard <ray-dashboard>` metrics on your local
machine, you must configure the Dashboard UI to embed the metrics graphs via a public address for the Grafana instance.

The `RAY_GRAFANA_HOST` env var can be set when launching Ray to configure how the Dashboard UI embeds the metrics.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add this example to ray-metrics page and just have a link here instead?

Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Nit: I think we should spell out "env vars" as environment variables and be consistent about capitalizing Ray, Prometheus, Grafana, IP

@ericl ericl removed their assignment Dec 15, 2022
@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Dec 16, 2022
Copy link
Contributor

@scottsun94 scottsun94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!
The overall structure is a bit confusing to me. I could think of 4 JTBDs related to this part of documentation. Here are some difficulties for me to complete each one of them

  1. learn about metrics.
  • In the metrics page, we jump to the prometheus first. I may not know what "metrics" refer to in Ray's context
  1. learn how to collect metrics via prometheus
  • It's easy to understand how to set it up locally following the documentation. However, it's still not very straightford in terms of how to set it up on a cluster. First, I want to know where to run it. This is in metrics page ("Alternate Prometheus host location"). Then I want to know how to scrape the metrics which is in the "cluster monitoring" page.
  1. learn how to set up grafana
  • same issues as prometheus. I want to know where to run Grafana first. Then I need to know how to configure it to visualize the prometheus metrics in Grafana.
  1. learn how to view the embedded metrics in Ray dashboard
  • The ray dashboard/metrics page says "It requires that prometheus and grafana is running for your cluster" and sends me to the metrics page. However, it's not clear what setup of prometheus and grafana is required in the metrics page for the metrics to show up.

Here are the suggested changes to the structure:

Cluster monitoring

  • Ray dashboard
  • Ray CLI
  • Prometheus metrics
    • Just a short intro paragraph with a link to metrics page for more details

Metrics

  • A short intro (keep the current one).
  • System metrics
  • Application-level metrics
  • Prometheus
    • A short intro
    • Run Prometheus (locally, on a head node or outisde of ray cluster)
    • Auto-discovering metrics endpoints
    • Manually discovering metrics endpoints
    • Customize prometheus export port
  • Grafana
    • A short intro
    • Run Grafana and view graphs for Ray (locally, on a head node or outisde of ray cluster)
    • Embed Grafana graphs in Ray Dashboard
      • (Prometheus running)
      • (Grafana is able to access Promtheus)
      • (Ray is able to access Grafana)

Ray Dashboard

  • Metrics view
    • "It requires that prometheus and grafana is running for your cluster." + link to "Embed Grafana graphs in Ray Dashboard"

@stale
Copy link

stale bot commented Jan 15, 2023

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jan 15, 2023
@rkooo567
Copy link
Contributor

unstale. And cc @alanwguo

@stale stale bot removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jan 18, 2023
Signed-off-by: Alan Guo <aguo@anyscale.com>
Signed-off-by: Alan Guo <aguo@anyscale.com>
@alanwguo
Copy link
Contributor Author

alanwguo commented Feb 9, 2023

Thanks! The overall structure is a bit confusing to me. I could think of 4 JTBDs related to this part of documentation. Here are some difficulties for me to complete each one of them

  1. learn about metrics.
  • In the metrics page, we jump to the prometheus first. I may not know what "metrics" refer to in Ray's context
  1. learn how to collect metrics via prometheus
  • It's easy to understand how to set it up locally following the documentation. However, it's still not very straightford in terms of how to set it up on a cluster. First, I want to know where to run it. This is in metrics page ("Alternate Prometheus host location"). Then I want to know how to scrape the metrics which is in the "cluster monitoring" page.
  1. learn how to set up grafana
  • same issues as prometheus. I want to know where to run Grafana first. Then I need to know how to configure it to visualize the prometheus metrics in Grafana.
  1. learn how to view the embedded metrics in Ray dashboard
  • The ray dashboard/metrics page says "It requires that prometheus and grafana is running for your cluster" and sends me to the metrics page. However, it's not clear what setup of prometheus and grafana is required in the metrics page for the metrics to show up.

Here are the suggested changes to the structure:

Cluster monitoring

  • Ray dashboard

  • Ray CLI

  • Prometheus metrics

    • Just a short intro paragraph with a link to metrics page for more details

Metrics

  • A short intro (keep the current one).

  • System metrics

  • Application-level metrics

  • Prometheus

    • A short intro
    • Run Prometheus (locally, on a head node or outisde of ray cluster)
    • Auto-discovering metrics endpoints
    • Manually discovering metrics endpoints
    • Customize prometheus export port
  • Grafana

    • A short intro

    • Run Grafana and view graphs for Ray (locally, on a head node or outisde of ray cluster)

    • Embed Grafana graphs in Ray Dashboard

      • (Prometheus running)
      • (Grafana is able to access Promtheus)
      • (Ray is able to access Grafana)

Ray Dashboard

  • Metrics view

    • "It requires that prometheus and grafana is running for your cluster." + link to "Embed Grafana graphs in Ray Dashboard"

I think this makes sense but I don't think I have time to make these changes by Friday. I also think we need to redo all the dashboard docs all at once to really re-structure it well and that would require pairing with @rkooo567 at least who is doing parallel doc changes.

I made some updates to include more info on setting up Grafana on a cluster, but I think we should consider a re-write for 2.4.

@alanwguo alanwguo removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Feb 9, 2023
by setting the `RAY_PROMETHEUS_HOST` env var when launching ray. The env var takes in the address to access Prometheus.


Alternate Grafana host location
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is duplicated?

@scottsun94
Copy link
Contributor

I think this makes sense but I don't think I have time to make these changes by Friday. I also think we need to redo all the dashboard docs all at once to really re-structure it well and that would require pairing with @rkooo567 at least who is doing parallel doc changes.

I made some updates to include more info on setting up Grafana on a cluster, but I think we should consider a re-write for 2.4.

SGTM!

Signed-off-by: Alan Guo <aguo@anyscale.com>
@alanwguo alanwguo added v2.3.0-pick release-blocker P0 Issue that blocks the release labels Feb 9, 2023
@rkooo567
Copy link
Contributor

rkooo567 commented Feb 9, 2023

Looks awesome. Last few comments.

@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Feb 9, 2023
Signed-off-by: Alan Guo <aguo@anyscale.com>
@alanwguo alanwguo removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Feb 10, 2023
@richardliaw richardliaw changed the title Improve documentation for setting up grafana and prometheus [docs] setting up grafana and prometheus Feb 14, 2023
@richardliaw richardliaw merged commit b9f7e19 into ray-project:master Feb 14, 2023
alanwguo added a commit to alanwguo/ray that referenced this pull request Feb 16, 2023
cadedaniel pushed a commit that referenced this pull request Feb 16, 2023
* [docs] setting up grafana and prometheus (#31129)

* Apply suggestions from code review

Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Signed-off-by: Alan Guo <aguo@aguo.software>

---------

Signed-off-by: Alan Guo <aguo@aguo.software>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
edoakes pushed a commit to edoakes/ray that referenced this pull request Mar 22, 2023
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
elliottower pushed a commit to elliottower/ray that referenced this pull request Apr 22, 2023
Signed-off-by: elliottower <elliot@elliottower.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-blocker P0 Issue that blocks the release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants