Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs] setting up grafana and prometheus #32606

Merged
merged 2 commits into from
Feb 16, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
Expand Up @@ -87,8 +87,15 @@ below.

.. _multi-node-metrics:

Prometheus metrics
^^^^^^^^^^^^^^^^^^
Prometheus
^^^^^^^^^^
Ray supports Prometheus for emitting and recording time-series metrics.
See :ref:`metrics <ray-metrics>` for more details of the metrics emitted.
To use Prometheus in a Ray cluster, decide where to host it, then configure
it so that it can scrape the metrics from Ray.

Scraping metrics
################

Ray runs a metrics agent per node to export :ref:`metrics <ray-metrics>` about Ray core as well as
custom user-defined metrics. Each metrics agent collects metrics from the local
Expand Down Expand Up @@ -142,7 +149,7 @@ start``. If using KubeRay, you can specify
``rayStartParams.metrics-export-port`` in the RayCluster configuration file.
The port must be specified on all nodes in the cluster.

If you do not know the IP addresses of the nodes in your Ray cluster,
If you do not know the IP addresses of the nodes in your Ray cluster,
you can also programmatically discover the endpoints by reading the
Ray Cluster information. Here, we will use a Python script and the
``ray.nodes()`` API to find the metrics agents' URLs, by combining the
Expand Down Expand Up @@ -188,3 +195,67 @@ Ray Cluster information. Here, we will use a Python script and the
'object_store_memory': 2.0},
'alive': True}]
"""


.. _multi-node-metrics-grafana:


Grafana
^^^^^^^
Ray dashboard integrates with Grafana to show visualizations of time-series metrics.

.. image:: images/graphs.png
:align: center

First decide where to host Grafana. A common location is on the head node of the cluster.
See :ref:`instructions <grafana>` for installing Grafana and using the default Grafana configurations
exported by Ray.

Next, the head node must be able to access Prometheus and Grafana, and the browser of the dashboard user
must be able to access Grafana. Configure these settings using the `RAY_GRAFANA_HOST`, `RAY_PROMETHEUS_HOST`,
and `RAY_GRAFANA_IFRAME_HOST` environment variables.

* Set `RAY_GRAFANA_HOST` to an address that the head node can use to access Grafana.
* Set `RAY_PROMETHEUS_HOST` to an address the head node can use to access Prometheus.
* You can set`RAY_GRAFANA_IFRAME_HOST` to an address for the user's browsers to access Grafana. By default, `RAY_GRAFANA_IFRAME_HOST` is equal to `RAY_GRAFANA_HOST`.

For example, if the IP of the head node is 55.66.77.88 and Grafana is hosted on port 3000. Set the value
to `RAY_GRAFANA_HOST=55.66.77.88:3000`.


.. _multi-node-metrics-grafana-existing:

Using an existing Grafana instance
##################################

To use an existing Grafana instance, set up the environment variable `RAY_GRAFANA_HOST` environment variable with a URL of your Grafana, before starting your Ray cluster. After starting Ray, find the Grafana dashboard JSON at `/tmp/ray/session_latest/metrics/grafana/dashboards/default_grafana_dashboard.json`. `Import this dashboard <https://grafana.com/docs/grafana/latest/dashboards/manage-dashboards/#import-a-dashboard>`_ to your Grafana.

If Grafana reports that datasource is not found, you can `add a datasource variable <https://grafana.com/docs/grafana/latest/dashboards/variables/add-template-variables/?pg=graf&plcmt=data-sources-prometheus-btn-1#add-a-data-source-variable>`_ and using `JSON model view <https://grafana.com/docs/grafana/latest/dashboards/build-dashboards/modify-dashboard-settings/#view-dashboard-json-model>`_ change all values of `datasource` key in the imported `default_grafana_dashboard.json` to the name of the variable. For example, if the variable name is `data_source`, all `"datasource"` mappings should be:

.. code-block:: json

"datasource": {
"type": "prometheus",
"uid": "$data_source"
}

When existing Grafana instance requires user authentication, the following settings have to be in its `configuration file <https://grafana.com/docs/grafana/latest/setup-grafana/configure-grafana/>`_ to correctly embed in Ray dashboard:

.. code-block:: ini

[security]
allow_embedding = true
cookie_secure = true
cookie_samesite = none

If Grafana is exposed via nginx ingress on Kubernetes cluster, the following line should be present in the Grafana ingress annotation:

.. code-block:: yaml

nginx.ingress.kubernetes.io/configuration-snippet: |
add_header X-Frame-Options SAMEORIGIN always;

When both Grafana and Ray cluster are on the same Kubernetes cluster, it is important to set `RAY_GRAFANA_HOST` to the external URL of the Grafana ingress. For successful embedding, `RAY_GRAFANA_HOST` needs to be accessible to both Ray cluster backend and Ray dashboard frontend:

* On the backend, *Ray cluster head* does health checks on Grafana. Hence `RAY_GRAFANA_HOST` needs to be accessible in the Kubernetes pod which is running the head node.
* When accessing *Ray dashboard* from the browser, frontend embeds Grafana dashboard using the URL specified in `RAY_GRAFANA_HOST`. Hence `RAY_GRAFANA_HOST` needs to be accessible from the browser as well.
80 changes: 45 additions & 35 deletions doc/source/ray-observability/ray-metrics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,11 @@ To help monitor Ray applications, Ray
Getting Started
---------------

.. tip::

The below instructions for Prometheus to enable a basic workflow of running and accessing the dashboard on your local machine.
For more information about how to run Prometheus on a remote cluster, see :ref:`here <multi-node-metrics>`.

Ray exposes its metrics in Prometheus format. This allows us to easily scrape them using Prometheus.

First, `download Prometheus <https://prometheus.io/download/>`_. Make sure to download the correct binary for your operating system. (Ex: darwin for mac osx)
Expand Down Expand Up @@ -64,6 +69,12 @@ See :ref:`here <multi-node-metrics>` for more information on how to set up Prome

Grafana
-------

.. tip::

The below instructions for Grafana setup to enable a basic workflow of running and accessing the dashboard on your local machine.
For more information about how to run Grafana on a remote cluster, see :ref:`here <multi-node-metrics-grafana>`.

Grafana is a tool that supports more advanced visualizations of prometheus metrics and
allows you to create custom dashboards with your favorite metrics. Ray exports some default
configurations which includes a default dashboard showing some of the most valuable metrics
Expand Down Expand Up @@ -91,40 +102,8 @@ You can then see the default dashboard by going to dashboards -> manage -> Ray -
.. image:: images/graphs.png
:align: center

Using an existing Grafana instance
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When you want to use existing Grafana instance, before starting your Ray cluster you will need to setup environment variable `RAY_GRAFANA_HOST` with an URL of your Grafana. After starting Ray, you can find Grafana dashboard json at `/tmp/ray/session_latest/metrics/grafana/dashboards/default_grafana_dashboard.json`. `Import this dashboard <https://grafana.com/docs/grafana/latest/dashboards/manage-dashboards/#import-a-dashboard>`_ to your Grafana.

If Grafana reports that datasource is not found, you can `add a datasource variable <https://grafana.com/docs/grafana/latest/dashboards/variables/add-template-variables/?pg=graf&plcmt=data-sources-prometheus-btn-1#add-a-data-source-variable>`_ and using `JSON model view <https://grafana.com/docs/grafana/latest/dashboards/build-dashboards/modify-dashboard-settings/#view-dashboard-json-model>`_ change all values of `datasource` key in the imported `default_grafana_dashboard.json` to the name of the variable. For example, if the variable name is `data_source`, all `"datasource"` mappings should be:

.. code-block:: json

"datasource": {
"type": "prometheus",
"uid": "$data_source"
}

When existing Grafana instance requires user authentication, the following settings have to be in its `configuration file <https://grafana.com/docs/grafana/latest/setup-grafana/configure-grafana/>`_ to correctly embed in Ray dashboard:

.. code-block:: ini

[security]
allow_embedding = true
cookie_secure = true
cookie_samesite = none

If Grafana is exposed via nginx ingress on Kubernetes cluster, the following line should be present in the Grafana ingress annotation:

.. code-block:: yaml

nginx.ingress.kubernetes.io/configuration-snippet: |
add_header X-Frame-Options SAMEORIGIN always;

When both Grafana and Ray cluster are on the same Kubernetes cluster, it is important to set `RAY_GRAFANA_HOST` to the external URL of the Grafana ingress. For successful embedding, `RAY_GRAFANA_HOST` needs to be accessible to both Ray cluster backend and Ray dashboard frontend:

* On the backend, *Ray cluster head* does health checks on Grafana. Hence `RAY_GRAFANA_HOST` needs to be accessible in the Kubernetes pod which is running the head node.
* When accessing *Ray dashboard* from the browser, frontend embeds Grafana dashboard using the URL specified in `RAY_GRAFANA_HOST`. Hence `RAY_GRAFANA_HOST` needs to be accessible from the browser as well.
See :ref:`here <multi-node-metrics-grafana>` for more information on how to set up Grafana on a Ray Cluster.

.. _system-metrics:

Expand Down Expand Up @@ -249,8 +228,11 @@ If you open this in the browser, you should see the following output:

Please see :ref:`ray.util.metrics <custom-metric-api-ref>` for more details.

Configurations
--------------

Customize prometheus export port
--------------------------------
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Ray by default provides the service discovery file, but you can directly scrape metrics from prometheus ports.
To do that, you may want to customize the port that metrics gets exposed to a pre-defined port.
Expand All @@ -261,6 +243,34 @@ To do that, you may want to customize the port that metrics gets exposed to a pr

Now, you can scrape Ray's metrics using Prometheus via ``<ip>:8080``.

Alternate Prometheus host location
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
You can choose to run Prometheus on a non-default port or on a different machine. When doing so, you should
make sure that prometheus can scrape the metrics from your ray nodes following instructions :ref:`here <multi-node-metrics>`.

In addition, both Ray and Grafana needs to know how to access this prometheus instance. This can be configured
by setting the `RAY_PROMETHEUS_HOST` env var when launching ray. The env var takes in the address to access Prometheus. More
info can be found :ref:`here <multi-node-metrics-grafana>`. By default, we assume Prometheus is hosted at `localhost:9090`.

For example, if Prometheus is hosted at port 9000 on a node with ip 55.66.77.88, One should set the value to
`RAY_PROMETHEUS_HOST=http://55.66.77.88:9000`.


Alternate Grafana host location
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
You can choose to run Grafana on a non-default port or on a different machine. If you choose to do this, the
:ref:`Dashboard <ray-dashboard>` needs to be configured with a public address to that service so the web page
can load the graphs. This can be done with the `RAY_GRAFANA_HOST` env var when launching ray. The env var takes
in the address to access Grafana. More info can be found :ref:`here <multi-node-metrics-grafana>`. Instructions
to use an existing Grafana instance can be found :ref:`here <multi-node-metrics-grafana-existing>`.

For the Grafana charts to work on the Ray dashboard, the user of the dashboard's browser must be able to reach
the Grafana service. If this browser cannot reach Grafana the same way the Ray head node can, you can use a separate
env var `RAY_GRAFANA_IFRAME_HOST` to customize the host the browser users to attempt to reach Grafana. If this is not set,
we use the value of `RAY_GRAFANA_HOST` by default.

For example, if Grafana is hosted at is 55.66.77.88 on port 3000. One should set the value
to `RAY_GRAFANA_HOST=http://55.66.77.88:3000`.

Troubleshooting
---------------
Expand All @@ -284,4 +294,4 @@ Grafana dashboards are not embedded in the Ray dashboard
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If you're getting error that `RAY_GRAFANA_HOST` is not setup despite you've set it up, please check:
That you've included protocol in the URL (e.g. `http://your-grafana-url.com` instead of `your-grafana-url.com`).
Also, make sure that url doesn't have trailing slash (e.g. `http://your-grafana-url.com` instead of `http://your-grafana-url.com/`).
Also, make sure that url doesn't have trailing slash (e.g. `http://your-grafana-url.com` instead of `http://your-grafana-url.com/`).