-
Notifications
You must be signed in to change notification settings - Fork 544
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ruler querier service option #5081
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks pretty good. I should have some time tomorrow to give it a try locally. In the meantime I've left a few comments.
operations/helm/charts/mimir-distributed/templates/ruler/ruler-dep.yaml
Outdated
Show resolved
Hide resolved
ruler_querier_service: | ||
enabled: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ruler_querier_service: | |
enabled: false | |
remote_rule_evaluation: | |
enabled: false |
I think this name better aligns with the terminology we use elsewhere, WDYT?
Obviously, we'd need to update all the references in this PR as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to wire this in when the ruler-querier replicas are more than 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this feature we would have 2 evaluation modes but with 2 implementations of remote_rule_evaluation (shared vs dedicated ruler query paths) We should probably distinguish those in the name ? Unless we simply have remote evaluation always deploy a dedicated query path ? @Logiraptor @dimitarvdimitrov
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the changes you have only the dedicated path is used when remote_rule_evaluation.enabled: true
mimir/operations/helm/charts/mimir-distributed/values.yaml
Lines 337 to 340 in 8de84b0
{{- if .Values.ruler_querier_service.enabled }} | |
query_frontend: | |
address: dns:///{{ template "mimir.fullname" . }}-ruler-query-frontend.{{ .Release.Namespace }}.svc:{{ include "mimir.serverHttpListenPort" . }} | |
{{- end }} |
I agree that it should be an decision the user makes - to use shared vs dedicated. In this case the conditional linked above should work for both cases. We should also document on all of remote_rule_evaluation
, ruler_query_frontend
, ruler_querier
, and ruler_query_scheduler
how they interact and how to configure them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to wire this in when the ruler-querier replicas are more than 0?
@dimitarvdimitrov the default values for ruler-querier will always have the replicas > 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the operator can choose to set the dedicated rulers to 0 replicas and use the shared query path. This will give them e.g. query sharding and query scheduling, which the ruler itself doesn't support; all without having to deploy more resources.
I suggest to make it possible to deploy in both ways - dedicated and shared. Dedicated when the dedicated query-frontend replicas are more than 0 and shared when they are 0.
If you agree, then I'd also suggest to make the dedicated ruler path deploy with 0 replicas. The reason is that we aim to have the chart more compatible with smaller deployments, but still enable larger deployments. A dedicated ruler path is a somewhat advanced feature and many regular operators might not need. Once we do that it would be great to document how to setup the remote ruler evaluation both with dedicated query path and a shared query path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thoughts on this @gonzalez?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution. I reviewed it on a high level and haven't looked at the manifests in detail yet.
I think we should also add docs for using this. We have a page for running the chart in production. This is a good candidate for providing a bit more isolation between rules and the user queries for more sizable clusters. This is the doc I am referring to https://github.com/grafana/mimir/blob/335c64a12f3102fc0399fd9666f7d238ab857168/docs/sources/helm-charts/mimir-distributed/run-production-environment-with-helm/_index.md
@@ -0,0 +1,131 @@ | |||
{{- if .Values.ruler_querier_service.enabled -}} | |||
apiVersion: apps/v1 | |||
kind: Deployment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're having issues with the duplicated nginx deployment because its manifests are copied. WDYT about extracting this as a separate named template and invoking it for the regular querier and the ruler-querier (same for query-frontend...)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand - where are you seeing the duplicate nginx deployments - thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant this nginx deployment
image: {{ .Values.nginx.image.registry }}/{{ .Values.nginx.image.repository }}:{{ .Values.nginx.image.tag }} |
and this one
mimir/operations/helm/charts/mimir-distributed/templates/gateway/gateway-dep.yaml
Line 78 in 18f1dd0
image: {{ .nginx.image.registry }}/{{ .nginx.image.repository }}:{{ .nginx.image.tag }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it will make it easier to make changes to these deployments in the future. Such as changing a default value in one deployment and making sure it's not lost on the other.
It the two templates are distinct, then when someone opens a PR to update one, there is nothing immediately telling neither the reviewers nor the author that there's another querier deployment that needs the same change. And at that stage there's drift between the two. I think the same applies to the query-frontend and query-scheduler.
This increases the effort for this PR, but I believe it makes the chart slightly more maintainable in the long run
ruler_querier_service: | ||
enabled: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to wire this in when the ruler-querier replicas are more than 0?
Co-authored-by: Patrick Oyarzun <patrick.oyarzun@grafana.com>
...helm/charts/mimir-distributed/templates/ruler-query-scheduler/ruler-query-scheduler-dep.yaml
Show resolved
Hide resolved
Co-authored-by: Patrick Oyarzun <patrick.oyarzun@grafana.com>
…o use the headless service
I'm looking for this functionality currently, it doesn't look like this PR has been touched in a year. |
If @gonzalez doesn't have the bandwidth to take care of this, I think anyone else should be able to complete this building on top of his work. I haven't looked at the PR in a while, but IIRC the comments I left were the only major blockers to merging this. It looks like there are a lot of conflicts, but they seem minor to me - the majority of files are just autogenerated manifests; |
I created a new PR which tries to add this feature: #7964 |
What this PR does
This PR provides the option to deploy a read path dedicated to the ruler.
This allows you to scale user/dashboards queries independently from the ruler usage.