System monitoring #139

IceKhan13 · 2023-01-26T18:27:59Z

What is the expected behavior?

Add monitoring of the infrastructure and systems

ray clusters monitoring
supporting apps monitoring
kubernetes cluster monitoring

metrics/system is prometheus

Issues in epic:

psschwei · 2023-01-26T18:38:30Z

It'd be worth looking into Open Telemetry to generate the metrics, as that seems to be what a lot of folks are gravitating towards.

Python docs: https://opentelemetry.io/docs/instrumentation/python/getting-started/

IceKhan13 · 2023-01-26T18:46:59Z

Totally agree

Ray out of the box support prometheus metrics export. OpenTelemetry has metric exporter

pacomf · 2023-01-30T13:21:51Z

I agree with OpenTelemetry (We use Jaeger in our Cloud environment) as well

psschwei · 2023-03-16T20:39:08Z

Just found out that kuberay has its own way to install prometheus / grafana... need to look into it more, but I may end up rolling back #275 in favor of their method

pacomf · 2023-03-16T20:43:24Z

We need to take care with that... to try to be as much agnostic possible from other components (In this case, Ray). I mean, if we consider that the kuberay option is the best one and is worthy, go ahead... but we need to have in mind that if in the future we want to replace Ray (for example in favor of Knative or something different) we can do it with few changes because our system is modular and disengaged to specific technologies.

akihikokuroda · 2023-03-16T20:47:51Z

@psschwei @pacomf I'm just working to add Grafana as a sub chart into our helm. I'll add the grafanaEnable switch in the values file so it may not matter. Please let me know if I should not add it. Thanks!

pacomf · 2023-03-16T20:51:08Z

If we can have some flags to enable/disable components and allow run the middleware stack in the same way, it is the path :). So the idea is provide to the users the ability to add or not the components.

psschwei · 2023-03-16T20:58:11Z

@pacomf based on first glance, it's more around how we configure prometheus to scrape the ray pods, so don't think we'll run into issues if we switch backends (though we may need to tweak what we're scraping...)

@akihikokuroda I'm guessing our grafana will need to be able to display both logs and metrics... so as long as the one you're installing can be configured to pick up metrics too, I think that's fine

psschwei · 2023-03-16T21:05:26Z

One change they have is using https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack for the prometheus helm chart whereas I was using https://github.com/prometheus-community/helm-charts/tree/main/charts/prometheus ... the one I was using had an extra pushgateway and alertmanager (and didn't have grafana), so need to figure out if we actually need those extra things...

pacomf · 2023-03-16T21:10:00Z

@IceKhan13 @Tansito any thoughts on it?

IceKhan13 · 2023-03-17T01:42:29Z

@psschwei which do you think is easier to manage / configure? Does kube-prometheus reduces amout of work needed when setting up all metrics scraping?

Alertmanager is useful and will be incorporated later along the road and it looks like kube-prometheus has it too.

Also @akihikokuroda added #293 potential conflict with grafana in kube-stack.

Tansito · 2023-03-17T11:28:36Z

I just did a quick review and my comments are the next ones:

From the installation guide what the script does is install prometheus (something that we do in Setup prometheus in the system #275) plus prometheus-community/kube-prometheus-stack and internal configurations.
The installation has different opinionated decisions like the use of the namespace prometheus-system without a way to change it.

Analyzing the comments and the process my assumptions are:

I wouldn't revert Setup prometheus in the system #275 by now at least, taking into account that the kuberay script does the same thing that we do in its pull request.
I would analyze if it's worth it for us to introduce kube-prometheus-stack. It seems the stack that this chart is proposing is similar to our approach but I would like to review it more in depth. In this case I would put on-hold Add Promtail to system #293.
I wouldn't use kuberay approach directly taking into account that the installation is relatively manual and is not very configurable. But I think it is useful to take it into account if we decide to introduce kube-prometheus-stack.
I would prioritize in this case configuration possibilities over the kuberay approach taking into account that we have more applications, referencing what @pacomf was mentioning. Seeing that the steps from kuberay are relatively easy to replicate.

At least this is the vision that I got reviewing everything but I'm totally open to discuss.

psschwei · 2023-03-17T13:04:13Z

There's two parts here:

how we configure Prometheus to scrape metrics
how we install Prometheus

For the first one, I'm not sure there's a better way than what Kuberay recommends (i.e. using ServiceMonitors and PodMonitors). Part of the issue is that the typical way to that Ray recommends doesn't really work on Kubernetes (since Prometheus is running in a separate pod than the Ray pod -- worth noting here that Ray on Kubernetes has some particularities of which this is one). But I think we can adapt them to run in the ray namespace and pull for all ray clusters on the k8s cluster (rather than one per ray cluster as the kuberay docs have it) and make them part of the main helm chart (probably as a template). That would make things pretty automatic.

Of course, the simplest solution here would be to just add the prometheus.io/scrape: true annotation on the kuberay pods, but that wasn't grabbing everything when I did it yesterday (will revisit today before making a final call).

As for how we install Prometheus / which chart we use, the only real differences are the components they install. You still get prometheus with both of them. So if we'll need the alertmanager down the line, then it probably makes sense to stick with what's already there.

psschwei · 2023-03-17T16:03:56Z

Hmm, one issue with the old chart is that it doesn't install the CRDs needed for pod/service monitors...

Tansito · 2023-03-17T16:20:51Z

Because the CRDs come from kube-prometheus-stack crds folder, right? Not from prometheus.

And about your previous comment @psschwei totally agree. What I was trying to share is similar to what you are saying. I don't think we can follow Ray steps but I think it has sense to study an adapt them to our use case. For kube-prometheus-stack I'm reviewing different resources like this video and its repository and it looks promising.

Btw it seems that kube-prometheus-stack should replace prometheus-operator if you read different information around it.

psschwei · 2023-03-17T17:12:25Z

Because the CRDs come from kube-prometheus-stack crds folder, right? Not from prometheus.

@Tansito yes, that's right.

Also, the kube-prometheus-stack since it uses the prometheus operator, whereas the regular prometheus chart doesn't. Given that our strategy generally prefers operators, it's probably better to make the switch.

The metrics looks great btw

pacomf · 2023-03-17T17:28:43Z

Beautiful metrics and dashboard @psschwei 😍

IceKhan13 added enhancement New feature or request epic For epics and projects labels Jan 26, 2023

IceKhan13 added this to the 0.2 milestone Jan 26, 2023

IceKhan13 assigned akihikokuroda and unassigned akihikokuroda Jan 26, 2023

IceKhan13 added the priority: low Low priority label Jan 26, 2023

This was referenced Jan 31, 2023

Execution graph tracing and metrics #149

Closed

Add documentation on using redis, jaeger as external services, helm charts. #160

Closed

IceKhan13 added priority: high High priority and removed priority: low Low priority labels Mar 14, 2023

Tansito added the project: monitoring Label to identify features related with monitoring label Mar 15, 2023

Tansito mentioned this issue Mar 17, 2023

Add Promtail to system #293

Merged

psschwei mentioned this issue Mar 17, 2023

Export ray metrics to prometheus #296

Merged

psschwei self-assigned this Mar 21, 2023

psschwei mentioned this issue Mar 27, 2023

Add dashboard for repository metrics #368

Closed

Tansito mentioned this issue Mar 29, 2023

Add prometheus metrics in gateway #374

Closed

psschwei mentioned this issue May 3, 2023

Replace jaeger exporter with otlp exporter #497

Merged

psschwei closed this as completed in #497 May 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

System monitoring #139

System monitoring #139

IceKhan13 commented Jan 26, 2023 •

edited by psschwei

Loading

psschwei commented Jan 26, 2023

IceKhan13 commented Jan 26, 2023

pacomf commented Jan 30, 2023

psschwei commented Mar 16, 2023

pacomf commented Mar 16, 2023

akihikokuroda commented Mar 16, 2023

pacomf commented Mar 16, 2023

psschwei commented Mar 16, 2023

psschwei commented Mar 16, 2023

pacomf commented Mar 16, 2023

IceKhan13 commented Mar 17, 2023 •

edited

Loading

Tansito commented Mar 17, 2023

psschwei commented Mar 17, 2023

psschwei commented Mar 17, 2023

Tansito commented Mar 17, 2023 •

edited

Loading

psschwei commented Mar 17, 2023

pacomf commented Mar 17, 2023

System monitoring #139

System monitoring #139

Comments

IceKhan13 commented Jan 26, 2023 • edited by psschwei Loading

What is the expected behavior?

psschwei commented Jan 26, 2023

IceKhan13 commented Jan 26, 2023

pacomf commented Jan 30, 2023

psschwei commented Mar 16, 2023

pacomf commented Mar 16, 2023

akihikokuroda commented Mar 16, 2023

pacomf commented Mar 16, 2023

psschwei commented Mar 16, 2023

psschwei commented Mar 16, 2023

pacomf commented Mar 16, 2023

IceKhan13 commented Mar 17, 2023 • edited Loading

Tansito commented Mar 17, 2023

psschwei commented Mar 17, 2023

psschwei commented Mar 17, 2023

Tansito commented Mar 17, 2023 • edited Loading

psschwei commented Mar 17, 2023

pacomf commented Mar 17, 2023

IceKhan13 commented Jan 26, 2023 •

edited by psschwei

Loading

IceKhan13 commented Mar 17, 2023 •

edited

Loading

Tansito commented Mar 17, 2023 •

edited

Loading