Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ruler querier service option #5081

Closed
wants to merge 22 commits into from
Closed

Conversation

gonzalez
Copy link
Contributor

What this PR does

This PR provides the option to deploy a read path dedicated to the ruler.
This allows you to scale user/dashboards queries independently from the ruler usage.

@gonzalez gonzalez requested a review from a team as a code owner May 25, 2023 16:43
Copy link
Contributor

@Logiraptor Logiraptor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks pretty good. I should have some time tomorrow to give it a try locally. In the meantime I've left a few comments.

operations/helm/charts/mimir-distributed/CHANGELOG.md Outdated Show resolved Hide resolved
operations/helm/charts/mimir-distributed/values.yaml Outdated Show resolved Hide resolved
operations/helm/charts/mimir-distributed/values.yaml Outdated Show resolved Hide resolved
Comment on lines +3746 to +3747
ruler_querier_service:
enabled: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ruler_querier_service:
enabled: false
remote_rule_evaluation:
enabled: false

I think this name better aligns with the terminology we use elsewhere, WDYT?

Obviously, we'd need to update all the references in this PR as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to wire this in when the ruler-querier replicas are more than 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this feature we would have 2 evaluation modes but with 2 implementations of remote_rule_evaluation (shared vs dedicated ruler query paths) We should probably distinguish those in the name ? Unless we simply have remote evaluation always deploy a dedicated query path ? @Logiraptor @dimitarvdimitrov

Copy link
Contributor

@dimitarvdimitrov dimitarvdimitrov May 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the changes you have only the dedicated path is used when remote_rule_evaluation.enabled: true

{{- if .Values.ruler_querier_service.enabled }}
query_frontend:
address: dns:///{{ template "mimir.fullname" . }}-ruler-query-frontend.{{ .Release.Namespace }}.svc:{{ include "mimir.serverHttpListenPort" . }}
{{- end }}

I agree that it should be an decision the user makes - to use shared vs dedicated. In this case the conditional linked above should work for both cases. We should also document on all of remote_rule_evaluation, ruler_query_frontend, ruler_querier, and ruler_query_scheduler how they interact and how to configure them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to wire this in when the ruler-querier replicas are more than 0?

@dimitarvdimitrov the default values for ruler-querier will always have the replicas > 0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the operator can choose to set the dedicated rulers to 0 replicas and use the shared query path. This will give them e.g. query sharding and query scheduling, which the ruler itself doesn't support; all without having to deploy more resources.

I suggest to make it possible to deploy in both ways - dedicated and shared. Dedicated when the dedicated query-frontend replicas are more than 0 and shared when they are 0.

If you agree, then I'd also suggest to make the dedicated ruler path deploy with 0 replicas. The reason is that we aim to have the chart more compatible with smaller deployments, but still enable larger deployments. A dedicated ruler path is a somewhat advanced feature and many regular operators might not need. Once we do that it would be great to document how to setup the remote ruler evaluation both with dedicated query path and a shared query path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thoughts on this @gonzalez?

Copy link
Contributor

@dimitarvdimitrov dimitarvdimitrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution. I reviewed it on a high level and haven't looked at the manifests in detail yet.

I think we should also add docs for using this. We have a page for running the chart in production. This is a good candidate for providing a bit more isolation between rules and the user queries for more sizable clusters. This is the doc I am referring to https://github.com/grafana/mimir/blob/335c64a12f3102fc0399fd9666f7d238ab857168/docs/sources/helm-charts/mimir-distributed/run-production-environment-with-helm/_index.md

operations/helm/charts/mimir-distributed/values.yaml Outdated Show resolved Hide resolved
@@ -0,0 +1,131 @@
{{- if .Values.ruler_querier_service.enabled -}}
apiVersion: apps/v1
kind: Deployment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're having issues with the duplicated nginx deployment because its manifests are copied. WDYT about extracting this as a separate named template and invoking it for the regular querier and the ruler-querier (same for query-frontend...)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand - where are you seeing the duplicate nginx deployments - thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant this nginx deployment

image: {{ .Values.nginx.image.registry }}/{{ .Values.nginx.image.repository }}:{{ .Values.nginx.image.tag }}

and this one

image: {{ .nginx.image.registry }}/{{ .nginx.image.repository }}:{{ .nginx.image.tag }}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it will make it easier to make changes to these deployments in the future. Such as changing a default value in one deployment and making sure it's not lost on the other.

It the two templates are distinct, then when someone opens a PR to update one, there is nothing immediately telling neither the reviewers nor the author that there's another querier deployment that needs the same change. And at that stage there's drift between the two. I think the same applies to the query-frontend and query-scheduler.

This increases the effort for this PR, but I believe it makes the chart slightly more maintainable in the long run

operations/helm/charts/mimir-distributed/CHANGELOG.md Outdated Show resolved Hide resolved
Comment on lines +3746 to +3747
ruler_querier_service:
enabled: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to wire this in when the ruler-querier replicas are more than 0?

operations/helm/charts/mimir-distributed/values.yaml Outdated Show resolved Hide resolved
gonzalez and others added 2 commits June 1, 2023 08:56
Co-authored-by: Patrick Oyarzun <patrick.oyarzun@grafana.com>
@lasermoth
Copy link
Contributor

I'm looking for this functionality currently, it doesn't look like this PR has been touched in a year.
Are we able to progress it?

@dimitarvdimitrov
Copy link
Contributor

If @gonzalez doesn't have the bandwidth to take care of this, I think anyone else should be able to complete this building on top of his work.

I haven't looked at the PR in a while, but IIRC the comments I left were the only major blockers to merging this.

It looks like there are a lot of conflicts, but they seem minor to me - the majority of files are just autogenerated manifests; make build-helm-tests after rebasing should resolve them

@alex5517
Copy link
Contributor

@dimitarvdimitrov,

I created a new PR which tries to add this feature: #7964

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants