Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System monitoring #139

Closed
IceKhan13 opened this issue Jan 26, 2023 · 17 comments · Fixed by #497
Closed

System monitoring #139

IceKhan13 opened this issue Jan 26, 2023 · 17 comments · Fixed by #497
Assignees
Labels
enhancement New feature or request epic For epics and projects priority: high High priority project: monitoring Label to identify features related with monitoring
Milestone

Comments

@IceKhan13
Copy link
Member

IceKhan13 commented Jan 26, 2023

What is the expected behavior?

Add monitoring of the infrastructure and systems

  • ray clusters monitoring
  • supporting apps monitoring
  • kubernetes cluster monitoring

metrics/system is prometheus

Issues in epic:

@IceKhan13 IceKhan13 added enhancement New feature or request epic For epics and projects labels Jan 26, 2023
@IceKhan13 IceKhan13 added this to the 0.2 milestone Jan 26, 2023
@IceKhan13 IceKhan13 added the priority: low Low priority label Jan 26, 2023
@psschwei
Copy link
Collaborator

It'd be worth looking into Open Telemetry to generate the metrics, as that seems to be what a lot of folks are gravitating towards.

Python docs: https://opentelemetry.io/docs/instrumentation/python/getting-started/

@IceKhan13
Copy link
Member Author

Totally agree

Ray out of the box support prometheus metrics export. OpenTelemetry has metric exporter

@pacomf
Copy link
Member

pacomf commented Jan 30, 2023

I agree with OpenTelemetry (We use Jaeger in our Cloud environment) as well

@psschwei
Copy link
Collaborator

Just found out that kuberay has its own way to install prometheus / grafana... need to look into it more, but I may end up rolling back #275 in favor of their method

@pacomf
Copy link
Member

pacomf commented Mar 16, 2023

We need to take care with that... to try to be as much agnostic possible from other components (In this case, Ray). I mean, if we consider that the kuberay option is the best one and is worthy, go ahead... but we need to have in mind that if in the future we want to replace Ray (for example in favor of Knative or something different) we can do it with few changes because our system is modular and disengaged to specific technologies.

@akihikokuroda
Copy link
Collaborator

@psschwei @pacomf I'm just working to add Grafana as a sub chart into our helm. I'll add the grafanaEnable switch in the values file so it may not matter. Please let me know if I should not add it. Thanks!

@pacomf
Copy link
Member

pacomf commented Mar 16, 2023

If we can have some flags to enable/disable components and allow run the middleware stack in the same way, it is the path :). So the idea is provide to the users the ability to add or not the components.

@psschwei
Copy link
Collaborator

@pacomf based on first glance, it's more around how we configure prometheus to scrape the ray pods, so don't think we'll run into issues if we switch backends (though we may need to tweak what we're scraping...)

@akihikokuroda I'm guessing our grafana will need to be able to display both logs and metrics... so as long as the one you're installing can be configured to pick up metrics too, I think that's fine

@psschwei
Copy link
Collaborator

One change they have is using https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack for the prometheus helm chart whereas I was using https://github.com/prometheus-community/helm-charts/tree/main/charts/prometheus ... the one I was using had an extra pushgateway and alertmanager (and didn't have grafana), so need to figure out if we actually need those extra things...

@pacomf
Copy link
Member

pacomf commented Mar 16, 2023

@IceKhan13 @Tansito any thoughts on it?

@IceKhan13
Copy link
Member Author

IceKhan13 commented Mar 17, 2023

@psschwei which do you think is easier to manage / configure? Does kube-prometheus reduces amout of work needed when setting up all metrics scraping?

Alertmanager is useful and will be incorporated later along the road and it looks like kube-prometheus has it too.

Also @akihikokuroda added #293 potential conflict with grafana in kube-stack.

@Tansito
Copy link
Member

Tansito commented Mar 17, 2023

I just did a quick review and my comments are the next ones:

  1. From the installation guide what the script does is install prometheus (something that we do in Setup prometheus in the system #275) plus prometheus-community/kube-prometheus-stack and internal configurations.
  2. The installation has different opinionated decisions like the use of the namespace prometheus-system without a way to change it.

Analyzing the comments and the process my assumptions are:

  • I wouldn't revert Setup prometheus in the system #275 by now at least, taking into account that the kuberay script does the same thing that we do in its pull request.
  • I would analyze if it's worth it for us to introduce kube-prometheus-stack. It seems the stack that this chart is proposing is similar to our approach but I would like to review it more in depth. In this case I would put on-hold Add Promtail to system #293.
  • I wouldn't use kuberay approach directly taking into account that the installation is relatively manual and is not very configurable. But I think it is useful to take it into account if we decide to introduce kube-prometheus-stack.
  • I would prioritize in this case configuration possibilities over the kuberay approach taking into account that we have more applications, referencing what @pacomf was mentioning. Seeing that the steps from kuberay are relatively easy to replicate.

At least this is the vision that I got reviewing everything but I'm totally open to discuss.

@psschwei
Copy link
Collaborator

There's two parts here:

  • how we configure Prometheus to scrape metrics
  • how we install Prometheus

For the first one, I'm not sure there's a better way than what Kuberay recommends (i.e. using ServiceMonitors and PodMonitors). Part of the issue is that the typical way to that Ray recommends doesn't really work on Kubernetes (since Prometheus is running in a separate pod than the Ray pod -- worth noting here that Ray on Kubernetes has some particularities of which this is one). But I think we can adapt them to run in the ray namespace and pull for all ray clusters on the k8s cluster (rather than one per ray cluster as the kuberay docs have it) and make them part of the main helm chart (probably as a template). That would make things pretty automatic.

Of course, the simplest solution here would be to just add the prometheus.io/scrape: true annotation on the kuberay pods, but that wasn't grabbing everything when I did it yesterday (will revisit today before making a final call).

As for how we install Prometheus / which chart we use, the only real differences are the components they install. You still get prometheus with both of them. So if we'll need the alertmanager down the line, then it probably makes sense to stick with what's already there.

@psschwei
Copy link
Collaborator

Hmm, one issue with the old chart is that it doesn't install the CRDs needed for pod/service monitors...

@Tansito
Copy link
Member

Tansito commented Mar 17, 2023

Because the CRDs come from kube-prometheus-stack crds folder, right? Not from prometheus.

And about your previous comment @psschwei totally agree. What I was trying to share is similar to what you are saying. I don't think we can follow Ray steps but I think it has sense to study an adapt them to our use case. For kube-prometheus-stack I'm reviewing different resources like this video and its repository and it looks promising.

Btw it seems that kube-prometheus-stack should replace prometheus-operator if you read different information around it.

@psschwei
Copy link
Collaborator

Because the CRDs come from kube-prometheus-stack crds folder, right? Not from prometheus.

@Tansito yes, that's right.

Also, the kube-prometheus-stack since it uses the prometheus operator, whereas the regular prometheus chart doesn't. Given that our strategy generally prefers operators, it's probably better to make the switch.

The metrics looks great btw

image

@pacomf
Copy link
Member

pacomf commented Mar 17, 2023

Beautiful metrics and dashboard @psschwei 😍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request epic For epics and projects priority: high High priority project: monitoring Label to identify features related with monitoring
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants