-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
System monitoring #139
Comments
It'd be worth looking into Open Telemetry to generate the metrics, as that seems to be what a lot of folks are gravitating towards. Python docs: https://opentelemetry.io/docs/instrumentation/python/getting-started/ |
Totally agree Ray out of the box support prometheus metrics export. OpenTelemetry has metric exporter |
I agree with OpenTelemetry (We use Jaeger in our Cloud environment) as well |
Just found out that kuberay has its own way to install prometheus / grafana... need to look into it more, but I may end up rolling back #275 in favor of their method |
We need to take care with that... to try to be as much agnostic possible from other components (In this case, Ray). I mean, if we consider that the kuberay option is the best one and is worthy, go ahead... but we need to have in mind that if in the future we want to replace Ray (for example in favor of Knative or something different) we can do it with few changes because our system is modular and disengaged to specific technologies. |
If we can have some flags to enable/disable components and allow run the middleware stack in the same way, it is the path :). So the idea is provide to the users the ability to add or not the components. |
@pacomf based on first glance, it's more around how we configure prometheus to scrape the ray pods, so don't think we'll run into issues if we switch backends (though we may need to tweak what we're scraping...) @akihikokuroda I'm guessing our grafana will need to be able to display both logs and metrics... so as long as the one you're installing can be configured to pick up metrics too, I think that's fine |
One change they have is using https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack for the prometheus helm chart whereas I was using https://github.com/prometheus-community/helm-charts/tree/main/charts/prometheus ... the one I was using had an extra pushgateway and alertmanager (and didn't have grafana), so need to figure out if we actually need those extra things... |
@IceKhan13 @Tansito any thoughts on it? |
@psschwei which do you think is easier to manage / configure? Does kube-prometheus reduces amout of work needed when setting up all metrics scraping? Alertmanager is useful and will be incorporated later along the road and it looks like kube-prometheus has it too. Also @akihikokuroda added #293 potential conflict with grafana in kube-stack. |
I just did a quick review and my comments are the next ones:
Analyzing the comments and the process my assumptions are:
At least this is the vision that I got reviewing everything but I'm totally open to discuss. |
There's two parts here:
For the first one, I'm not sure there's a better way than what Kuberay recommends (i.e. using ServiceMonitors and PodMonitors). Part of the issue is that the typical way to that Ray recommends doesn't really work on Kubernetes (since Prometheus is running in a separate pod than the Ray pod -- worth noting here that Ray on Kubernetes has some particularities of which this is one). But I think we can adapt them to run in the Of course, the simplest solution here would be to just add the As for how we install Prometheus / which chart we use, the only real differences are the components they install. You still get prometheus with both of them. So if we'll need the alertmanager down the line, then it probably makes sense to stick with what's already there. |
Hmm, one issue with the old chart is that it doesn't install the CRDs needed for pod/service monitors... |
Because the CRDs come from kube-prometheus-stack crds folder, right? Not from And about your previous comment @psschwei totally agree. What I was trying to share is similar to what you are saying. I don't think we can follow Ray steps but I think it has sense to study an adapt them to our use case. For Btw it seems that |
@Tansito yes, that's right. Also, the kube-prometheus-stack since it uses the prometheus operator, whereas the regular prometheus chart doesn't. Given that our strategy generally prefers operators, it's probably better to make the switch. The metrics looks great btw |
Beautiful metrics and dashboard @psschwei 😍 |
What is the expected behavior?
Add monitoring of the infrastructure and systems
metrics/system is prometheus
Issues in epic:
The text was updated successfully, but these errors were encountered: