diff --git a/doc/source/cluster/running-applications/images/graphs.png b/doc/source/cluster/running-applications/images/graphs.png new file mode 100644 index 000000000000..2cd41f5b9b27 Binary files /dev/null and b/doc/source/cluster/running-applications/images/graphs.png differ diff --git a/doc/source/cluster/running-applications/monitoring-and-observability.rst b/doc/source/cluster/running-applications/monitoring-and-observability.rst index 9f5938d94ebc..83f621a214d6 100644 --- a/doc/source/cluster/running-applications/monitoring-and-observability.rst +++ b/doc/source/cluster/running-applications/monitoring-and-observability.rst @@ -87,8 +87,15 @@ below. .. _multi-node-metrics: -Prometheus metrics -^^^^^^^^^^^^^^^^^^ +Prometheus +^^^^^^^^^^ +Ray supports prometheus for emitting and recording time-series metrics. +See :ref:`metrics ` for more details of the metrics emitted. +When using Prometheus in a Ray cluster, one must decide where they want to host prometheus and then configure +Prometheus so that Prometheus can scrape the metrics from Ray. + +Scraping metrics +################ Ray runs a metrics agent per node to export :ref:`metrics ` about Ray core as well as custom user-defined metrics. Each metrics agent collects metrics from the local @@ -142,7 +149,7 @@ start``. If using KubeRay, you can specify ``rayStartParams.metrics-export-port`` in the RayCluster configuration file. The port must be specified on all nodes in the cluster. -If you do not know the IP addresses of the nodes in your Ray cluster, +If you do not know the IP addresses of the nodes in your Ray cluster, you can also programmatically discover the endpoints by reading the Ray Cluster information. Here, we will use a Python script and the ``ray.nodes()`` API to find the metrics agents' URLs, by combining the @@ -188,3 +195,67 @@ Ray Cluster information. Here, we will use a Python script and the 'object_store_memory': 2.0}, 'alive': True}] """ + + +.. _multi-node-metrics-grafana: + + +Grafana +^^^^^^^ +Ray dashboard integrates with grafana to show visualizations of time-series metrics. + +.. image:: images/graphs.png + :align: center + +First, you must decide where you want to host Grafana. One common place is to run it on the head node of the cluster. +See :ref:`here ` for instructions on how to install Grafana and how to use the default Grafana configurations +exported by Ray. + +Next, the head node must be able to access Prometheus and Grafana and the browser of the dashboard user +must be able to access Grafana. You can configure these settings using the `RAY_GRAFANA_HOST`, `RAY_PROMETHEUS_HOST`, +and `RAY_GRAFANA_IFRAME_HOST` environment variables. + +* `RAY_GRAFANA_HOST` should be set to an address that the head node can use to access Grafana. +* `RAY_PROMETHEUS_HOST` should be set to an address the head node can use to access Prometheus. +* `RAY_GRAFANA_IFRAME_HOST` can be set to an address for the user's browsers to use to access Grafana. By default, `RAY_GRAFANA_IFRAME_HOST` will be equal to `RAY_GRAFANA_HOST`. + +For example, if the ip of the head node is 55.66.77.88 and grafana is hosted on port 3000. One should set the value +to `RAY_GRAFANA_HOST=55.66.77.88:3000`. + + +.. _multi-node-metrics-grafana-existing: + +Using an existing Grafana instance +################################## + +When you want to use existing Grafana instance, before starting your Ray cluster you will need to setup environment variable `RAY_GRAFANA_HOST` with an URL of your Grafana. After starting Ray, you can find Grafana dashboard json at `/tmp/ray/session_latest/metrics/grafana/dashboards/default_grafana_dashboard.json`. `Import this dashboard `_ to your Grafana. + +If Grafana reports that datasource is not found, you can `add a datasource variable `_ and using `JSON model view `_ change all values of `datasource` key in the imported `default_grafana_dashboard.json` to the name of the variable. For example, if the variable name is `data_source`, all `"datasource"` mappings should be: + +.. code-block:: json + + "datasource": { + "type": "prometheus", + "uid": "$data_source" + } + +When existing Grafana instance requires user authentication, the following settings have to be in its `configuration file `_ to correctly embed in Ray dashboard: + +.. code-block:: ini + + [security] + allow_embedding = true + cookie_secure = true + cookie_samesite = none + +If Grafana is exposed via nginx ingress on Kubernetes cluster, the following line should be present in the Grafana ingress annotation: + +.. code-block:: yaml + + nginx.ingress.kubernetes.io/configuration-snippet: | + add_header X-Frame-Options SAMEORIGIN always; + +When both Grafana and Ray cluster are on the same Kubernetes cluster, it is important to set `RAY_GRAFANA_HOST` to the external URL of the Grafana ingress. For successful embedding, `RAY_GRAFANA_HOST` needs to be accessible to both Ray cluster backend and Ray dashboard frontend: + +* On the backend, *Ray cluster head* does health checks on Grafana. Hence `RAY_GRAFANA_HOST` needs to be accessible in the Kubernetes pod which is running the head node. +* When accessing *Ray dashboard* from the browser, frontend embeds Grafana dashboard using the URL specified in `RAY_GRAFANA_HOST`. Hence `RAY_GRAFANA_HOST` needs to be accessible from the browser as well. diff --git a/doc/source/ray-observability/ray-metrics.rst b/doc/source/ray-observability/ray-metrics.rst index d05363102cbe..5a13cb7f2384 100644 --- a/doc/source/ray-observability/ray-metrics.rst +++ b/doc/source/ray-observability/ray-metrics.rst @@ -14,6 +14,11 @@ To help monitor Ray applications, Ray Getting Started --------------- +.. tip:: + + The below instructions for Prometheus to enable a basic workflow of running and accessing the dashboard on your local machine. + For more information about how to run Prometheus on a remote cluster, see :ref:`here `. + Ray exposes its metrics in Prometheus format. This allows us to easily scrape them using Prometheus. First, `download Prometheus `_. Make sure to download the correct binary for your operating system. (Ex: darwin for mac osx) @@ -64,6 +69,12 @@ See :ref:`here ` for more information on how to set up Prome Grafana ------- + +.. tip:: + + The below instructions for Grafana setup to enable a basic workflow of running and accessing the dashboard on your local machine. + For more information about how to run Grafana on a remote cluster, see :ref:`here `. + Grafana is a tool that supports more advanced visualizations of prometheus metrics and allows you to create custom dashboards with your favorite metrics. Ray exports some default configurations which includes a default dashboard showing some of the most valuable metrics @@ -91,40 +102,8 @@ You can then see the default dashboard by going to dashboards -> manage -> Ray - .. image:: images/graphs.png :align: center -Using an existing Grafana instance -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -When you want to use existing Grafana instance, before starting your Ray cluster you will need to setup environment variable `RAY_GRAFANA_HOST` with an URL of your Grafana. After starting Ray, you can find Grafana dashboard json at `/tmp/ray/session_latest/metrics/grafana/dashboards/default_grafana_dashboard.json`. `Import this dashboard `_ to your Grafana. - -If Grafana reports that datasource is not found, you can `add a datasource variable `_ and using `JSON model view `_ change all values of `datasource` key in the imported `default_grafana_dashboard.json` to the name of the variable. For example, if the variable name is `data_source`, all `"datasource"` mappings should be: - -.. code-block:: json - - "datasource": { - "type": "prometheus", - "uid": "$data_source" - } - -When existing Grafana instance requires user authentication, the following settings have to be in its `configuration file `_ to correctly embed in Ray dashboard: - -.. code-block:: ini - - [security] - allow_embedding = true - cookie_secure = true - cookie_samesite = none -If Grafana is exposed via nginx ingress on Kubernetes cluster, the following line should be present in the Grafana ingress annotation: - -.. code-block:: yaml - - nginx.ingress.kubernetes.io/configuration-snippet: | - add_header X-Frame-Options SAMEORIGIN always; - -When both Grafana and Ray cluster are on the same Kubernetes cluster, it is important to set `RAY_GRAFANA_HOST` to the external URL of the Grafana ingress. For successful embedding, `RAY_GRAFANA_HOST` needs to be accessible to both Ray cluster backend and Ray dashboard frontend: - -* On the backend, *Ray cluster head* does health checks on Grafana. Hence `RAY_GRAFANA_HOST` needs to be accessible in the Kubernetes pod which is running the head node. -* When accessing *Ray dashboard* from the browser, frontend embeds Grafana dashboard using the URL specified in `RAY_GRAFANA_HOST`. Hence `RAY_GRAFANA_HOST` needs to be accessible from the browser as well. +See :ref:`here ` for more information on how to set up Grafana on a Ray Cluster. .. _system-metrics: @@ -249,8 +228,11 @@ If you open this in the browser, you should see the following output: Please see :ref:`ray.util.metrics ` for more details. +Configurations +-------------- + Customize prometheus export port --------------------------------- +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Ray by default provides the service discovery file, but you can directly scrape metrics from prometheus ports. To do that, you may want to customize the port that metrics gets exposed to a pre-defined port. @@ -261,6 +243,34 @@ To do that, you may want to customize the port that metrics gets exposed to a pr Now, you can scrape Ray's metrics using Prometheus via ``:8080``. +Alternate Prometheus host location +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +You can choose to run Prometheus on a non-default port or on a different machine. When doing so, you should +make sure that prometheus can scrape the metrics from your ray nodes following instructions :ref:`here `. + +In addition, both Ray and Grafana needs to know how to access this prometheus instance. This can be configured +by setting the `RAY_PROMETHEUS_HOST` env var when launching ray. The env var takes in the address to access Prometheus. More +info can be found :ref:`here `. By default, we assume Prometheus is hosted at `localhost:9090`. + +For example, if Prometheus is hosted at port 9000 on a node with ip 55.66.77.88, One should set the value to +`RAY_PROMETHEUS_HOST=http://55.66.77.88:9000`. + + +Alternate Grafana host location +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +You can choose to run Grafana on a non-default port or on a different machine. If you choose to do this, the +:ref:`Dashboard ` needs to be configured with a public address to that service so the web page +can load the graphs. This can be done with the `RAY_GRAFANA_HOST` env var when launching ray. The env var takes +in the address to access Grafana. More info can be found :ref:`here `. Instructions +to use an existing Grafana instance can be found :ref:`here `. + +For the Grafana charts to work on the Ray dashboard, the user of the dashboard's browser must be able to reach +the Grafana service. If this browser cannot reach Grafana the same way the Ray head node can, you can use a separate +env var `RAY_GRAFANA_IFRAME_HOST` to customize the host the browser users to attempt to reach Grafana. If this is not set, +we use the value of `RAY_GRAFANA_HOST` by default. + +For example, if Grafana is hosted at is 55.66.77.88 on port 3000. One should set the value +to `RAY_GRAFANA_HOST=http://55.66.77.88:3000`. Troubleshooting --------------- @@ -284,4 +294,4 @@ Grafana dashboards are not embedded in the Ray dashboard ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If you're getting error that `RAY_GRAFANA_HOST` is not setup despite you've set it up, please check: That you've included protocol in the URL (e.g. `http://your-grafana-url.com` instead of `your-grafana-url.com`). -Also, make sure that url doesn't have trailing slash (e.g. `http://your-grafana-url.com` instead of `http://your-grafana-url.com/`). \ No newline at end of file +Also, make sure that url doesn't have trailing slash (e.g. `http://your-grafana-url.com` instead of `http://your-grafana-url.com/`).