-
Notifications
You must be signed in to change notification settings - Fork 486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposed OLM-based Monitoring Stack Solution for RH Managed Services and future needs. #866
Proposed OLM-based Monitoring Stack Solution for RH Managed Services and future needs. #866
Conversation
…and future needs. As an effect of the collaboration of Monitoring Team with Service Delivery org in our Monitoring Enhancement Working Group, with @fpetkovski would like to propose solution for immdiate needs for Managed Services (SaaS/Layered Services/Addons). Feedback very welcome! Please keep all offline discussions in this PR so everyone is up-to-date. Signed-by: Bartlomiej Plotka <bwplotka@gmail.com>
95a7553
to
ce1f129
Compare
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Got bit upset by linter, produced issue: #869 (: (with some ideas how to improve it) |
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
/test markdownlint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very good! Thanks for the effort.
|
||
For simplicity reasons, routing alerts that come from managed service monitoring stacks will be handled by a single Alertmanager. | ||
We can take advantage of the recently added <code>[AlertmanagerConfig](https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/api.md#alertmanagerconfig)</code> and allow MTO/MTSRE to create their own routing configuration in a centrally deployed Alertmanager instance. | ||
Finally, the MTSRE team could centrally configure receivers in one place for all managed services and service owners would simply use those receivers in their routing rules. It is worth noting that the last feature depends on [https://github.com/prometheus-operator/prometheus-operator/issues/4033](https://github.com/prometheus-operator/prometheus-operator/issues/4033). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see this issue as a blocker since it's already possible to configure the global Alertmanager configuration with a secret. Having said that supporting the same structures for global and per-namespace Alertmanager configurations would be nice indeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏽
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simonpasquier is it possible to use both the secret and an AlertmanagerConfig
? I thought that with the AlertmanagerConfig
everything gets scoped to the namespace from which the config is coming. If that is the case, how would a tenant reference a receiver from the common secret in their AlertmanagerConfig
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for working on this! ❤️
Added some suggestions and clarifying questions :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Solid chunk of work 💪
Mostly minor suggestions from my side
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, this sounds really good! Left a few comments/suggestions inline.
I'm not sure whether the following is out of scope for this document or not, but should we consider maintenance of this somewhere? Maybe this risk section?
I assume the Monitoring team will maintain this. Adding an additional operator to the teams maintenance workload seems worth mentioning.
/lgtm |
@cooktheryan our robot overlords have you set as the next reviewer. Is that correct? |
@jeremyeder I wish that was the case but I don't believe I am the correct individual to be making that decision. This should be tagged for someone else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello, I am part of the MTSRE team and would like to thank you for putting this proposal together. This is awesome and I am looking forward to working together in order to make this new project happen.
I posted a few comments/suggestions, and they all have the same two underlying goals:
- Expose an SLO based interface to our tenants
- Make sure we perform active reconciliation on this interface (operator-pattern)
This is very important as the end goal here is to be able to measure SLOs/SLIs and error budgets. We should not ask our MTO to tweak or touch any prometheus/alertmanager/grafana configs. They need to spend their time doing feature development and bug fixing.
There exists various tools out there that could help us abstract away prometheus/alertmanager and grafana configs:
|
||
#### Story 1 | ||
|
||
As a MTSRE/MTO, I would like to define ServiceMonitors that will scrape metrics from my managed service by a Prometheus-compatible system. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to challenge the interface we want to expose. Instead of ServiceMonitor
it would be so much better to provide an SLO
-based interface. The reasoning is that we want to abstract as much of the monitoring stack as we can. We want MTO to focus on SLOs and SLIs, not being Prometheus experts. The story I suggest is this one:
As a MTSRE/MTO, I would like to define a SLO based CR that will create Prometheus-compatible config to scrape metrics from my managed service by a Prometheus-compatible system.
Example implementation:
PrometheusServiceLevel
CR from the sloth project: https://sloth.dev/examples/kubernetes/getting-started/
|
||
#### Story 3 | ||
|
||
As an MTSRE/MTO, I would like to create alerting and recording rules for Prometheus metrics exposed by my managed service. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again proposing to abstract alerting and recording rules away, in favor of PrometheusServiceLevel
CR or something similar (sloth):
As an MTSRE/MTO, I would like to create SLO-oriented CRs, that will translate to Prometheus alerting and recording rules for my managed service.
Example implementation:
PrometheusServiceLevel
CR from the sloth project: https://sloth.dev/examples/kubernetes/getting-started/
|
||
#### Story 4 | ||
|
||
As an MTSRE, I would like to configure routing for alerts defined on Prometheus metrics exposed by my SaaS/Addon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again proposing to abstract routing away, in favor of PrometheusServiceLevel CR or something similar (sloth):
As an MTSRE/MTO, I would like to create SLO-oriented CRs, that will configure alerts defined on Prometheus metrics exposed by my SaaS/Addon.
Example implementation:
PrometheusServiceLevel
CR from the sloth project: https://sloth.dev/examples/kubernetes/getting-started/
|
||
#### Story 6 | ||
|
||
As an MTO, I would like to create dashboards for certain metrics across all clusters where the SaaS/Addon is deployed or used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think SLO dashboard can be abstracted as well. Creating grafana dashboards is extremely time consuming, and as an MTSRE engineer, I want to have an homogeneous environment when switching from one addon to another. Plus, we do not want MTO to become grafana dashboard experts, or spend any time on this. This is why I would highly suggest:
As an MTO, I would like to provide an SLO based CR, that automatically creates dashboards that allow me to visualize SLO/SLI and error budgets for certain metrics across all clusters where the SaaS/Addon is deployed or used.
Example implementation:
- Sloth SLO oriented grafana dashboards: https://sloth.dev/dashboards/
#### Story 7 | ||
|
||
As a Customer, I can use UWM without seeing Addons metrics, alerting or recording configurations. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding an extra requirement for MTSRE/MTO:
### Story 8
As an MTO, I would like to provide an SLO based CR, that is actively reconciled by a kubernetes-based operator, so that I can update my SLO/SLI configurations smoothly.
Example implementation:
- sloth operator/controller: https://sloth.dev/usage/kubernetes/
Hi @sugarraysam, thanks for the suggestions, having a higher-level abstraction on top of ServiceMonitors and Dashboards is an awesome idea. Since it comes a bit late into the proposal process and we've already started work on this project, could you create an issue with your suggestions instead? This is the repository for the project: https://github.com/rhobs/monitoring-stack-operator |
@fpetkovski @sugarraysam In the cases I have seen so far, including our own case - ODF MS, the underlying product (that is backing up the service) is the one responsible to deploy/manage/reconcile the monitoring resources (ServiceMonitors, PodMonitors, and PrometheusRuels). If we go with the proposed abstraction as the only way to configure the scraping we will create a situation where these kinds of tenants will have to "jump through many hops", and introduce some ugly workarounds, just to configure scraping. If we look at the big picture, taking into account all the deployed components (from infrastructure to the products themselves), we will see that all these workarounds are just creating a less stable solution. |
I think that we don't have to make a mutually-exclusive decision. We can still support all the existing APIs and add a few ones on top. |
@nb-ohad @fpetkovski I like the idea of exposing multiple interfaces, it makes it flexible and allows a transitioning period. I understand it might be some work to migrate the existing monitoring to an SLO-based approach, but in the long-run this is what we should aim for. Monitoring without SLO does not make much sense. Why are you alerting on X or Y? How does it relate to your users? CPU usage and memory consumption without context are not good measures. The whole industry is shifting towards SLO based monitoring. Will follow up with a feature request (issue) -> https://github.com/rhobs/monitoring-stack-operator (edit: this is what app-sre is doing in app-interface as well |
@sugarraysam I get your point, but your proposal is narrowing the solution to a single use case which is to facilitate service SLO tracking and visibility. There are other use cases for monitoring a managed service, for example, providing product-defined dashboard / OCP console plugins to the customer (as ODF is doing). The metrics that are needed to accomplish that are unrelated to SLO, the SLIs, or even the serviceability of the service. Saying that I don't see the proposed interface as a replacement of the "older" interface, maybe as an enhancement. This means that we should not put any expectation for a transition to happen for some offerings. |
f111642
to
969fb38
Compare
I updated the enhancement with the following changes:
As per your comments @sugarraysam thanks for your idea. It is being discussed in rhobs/observability-operator#13 and I think it's solid thing to implement on top of ServiceMonitors. I don't think we can remove everything else which is not SLO based. Monitoring is more than SLO. It's also about troubleshooting, sometimes billing, data analysis of features etc. So I would not trim all down to SLO usage only, even as MTSRE you have right to ignore anything else. Does it make sense? (: |
969fb38
to
a46d443
Compare
|
||
#### Tech Preview -> GA | ||
|
||
TBD | ||
* More options exposed in Monitoring Stack resource | ||
* We have onboard at lease few Addons on this feature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"at lease" -> "at least"
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
What's missing on this to get this merged? 🤗 |
I scanned this whole thing again and since the SLO abstraction (which I also highly support) is scoped out, I would lgtm this as-is. |
/approve I think that everybody had a chance to share their feedback and the current version reflects as much as we can where we want to/should go. Thank you all for the constructive comments and thanks @bwplotka and @fpetkovski for this significant proposal :) |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: simonpasquier The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
As an effect of the collaboration of the Monitoring Team with Service Delivery org in our Monitoring Enhancement Working Group, we on behalf of our team we would like to propose with @fpetkovski a solution for immediate needs for Managed Services (SaaS/Layered Services/Addons).
Feedback is very welcome!
PTAL @alanmoran @smarterclayton @kbsingh @pb82 @simonpasquier @brampling
Please keep all offline discussions in this PR so everyone is up-to-date.
Signed-by: Bartlomiej Plotka bwplotka@gmail.com