-
Notifications
You must be signed in to change notification settings - Fork 486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added proposal for HyperShift monitoring. #981
Conversation
A general comment – I'm having trouble reading the diagrams, because the terminology isn't consistent with https://hypershift-docs.netlify.app/reference/concepts/ |
|
||
### User Stories | ||
|
||
TBD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has to be fleshed out before this enhancement should be merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind helping us with that? @jeremyeder @csrwng
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like no help, so will try to fill here what we know so far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for taking the time to put this together :)
I have some feedback, but on a high-level my biggest concern is that this conflates "Requirements for Hypershift" with "Requirements for an observability product suited for the SD organization". IMHO this enhancement should only describe one of the two and the interface between them, but not both.
Now that I am at the end of the doc my impression is that this mostly describes requirements for the obvervability service, not for Hypershift, as the number of changes for Hypershift seem to be tiny:
- Setup management cluster monitoring to remote_write (It is debatable is this is even sth hypershift should do)
- Deploy some kind of agent into the cluster that scrapes service-provider-owned components and remote_writes that into a central metrics service.
Is my understanding correct?
|
||
## Alternatives | ||
|
||
#### Pay for Observability Vendor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not an alternative to the proposed changes in Hypershift, this is an alternative to the in-house observability product.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a viable alternative we already use for logging AFAIK, no?
I feel it's integral to the solution we propose.
|
||
#### What About Layered Service / Addons Monitoring? | ||
|
||
This was discussed and solved by [Monitoring Stacks](monitoring-stack-operator.md). For HyperShift context, Addons requirements are no different to what we assume Customer would want to do / use. So in the context of this enhancement Addons are just another CUS. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see why this would be a non-goal above. It's easier to say they are just "Customer workloads".
However, I don't think this is the right attitude if we want addons to succeed as a whole.
I'm thinking internal (RH) here, where there's this RHOBS thing that is solving a lot of problems.
Maybe this isn't the right forum, but why wouldn't addons use that service?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I see your point, I don't think that is the message that is here. I never said Addons are never part of RHOBS. You are already part of it and using it, so there is definitely the direction to have you in.
We only try to descope too many topics from one proposal. We like it or not our requirements are really not different than customer ones for addons observability (by what we gathered so far). Addons are first citizens as SD org in RHOBS, but AFAIK you can deploy the same way to HyperShift + this enhancement AND to normal OSD and that is the point here, no? (:
Sorry if you got it as "addons" are less important. They are not less - they are equally important. Is there a way I could improve the language here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense.
Maybe all that's needed here is saying that RHOBS would still be the intended metrics & alerting solution for RH managed services on Hypershift as per https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/monitoring-stack-operator.md) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Epic work! Similar to what @alvaroaleman already mentioned I think, I'd like to see more details about what needs to be changed in the cluster monitoring operator (CMO) to support the HyperShift deployment model. Describing how CMO works currently in HyperShift would be valuable too (I assume that @csrwng has the best knowledge here).
It would good to detail which components depend on the monitoring interfaces today (console, metrics API, HPA, ...) and how they will be impacted (for instance, do HyperShift customers have access to the OCP console? if yes from where would the console retrieve the metrics/alerts?).
Hi All! Please PTAL latest update to this enhancement. Changes:
Signed-off-by: Bartlomiej Plotka bwplotka@gmail.com |
|
||
### Open Questions | ||
|
||
* Who will configure CMO to allow remote writing metrics to RHOBS? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In both the management cluster and guest cluster case I expect CMO to be configured by SD. Specifically by App SRE as they manage the RHOBS instance and know where metrics should be routed, but in collaboration with SRE Platform to work through nuances such as guest clusters with no egress.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
mgmt CMO is standard OSD/ROSA by SREP?
-
we should consider solving for guest clusters with no egress because that solves everything?
Let’s take a look at all parts: | ||
|
||
1. Hosted control planes does not deploy any own monitoring, other than relevant Service Monitors to be scraped. | ||
2. For all control planes we propose to deploy [Monitoring Stack Operator](https://github.com/rhobs/monitoring-stack-operator) that will deploy set of Prometheus/Prometheus Agents that will be sending data off cluster to RHOBS. No local alerting and local querying will be allowed. This path will forward both monitoring data as well as part of telemetry relevant for Telemeter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the requirement for prometheus here? How does the agent not suffice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would assume that it's a matter of time to market. The Prometheus operator doesn't support Prometheus in agent mode yet though we have work in progress (prometheus-operator/prometheus-operator#3989).
In the mean time, I suppose that the monitoring stack operator could deploy a Prometheus server with no rules and not exposing the API endpoints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏽
Let’s take a look at all parts: | ||
|
||
1. Hosted control planes does not deploy any own monitoring, other than relevant Service Monitors to be scraped. | ||
2. For all control planes we propose to deploy [Monitoring Stack Operator](https://github.com/rhobs/monitoring-stack-operator) that will deploy set of Prometheus/Prometheus Agents that will be sending data off cluster to RHOBS. No local alerting and local querying will be allowed. This path will forward both monitoring data as well as part of telemetry relevant for Telemeter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would assume that it's a matter of time to market. The Prometheus operator doesn't support Prometheus in agent mode yet though we have work in progress (prometheus-operator/prometheus-operator#3989).
In the mean time, I suppose that the monitoring stack operator could deploy a Prometheus server with no rules and not exposing the API endpoints.
Let’s take a look at all parts: | ||
|
||
1. Hosted control planes does not deploy any own monitoring, other than relevant Service Monitors to be scraped. | ||
2. For all control planes we propose to deploy [Monitoring Stack Operator](https://github.com/rhobs/monitoring-stack-operator) that will deploy set of Prometheus/Prometheus Agents that will be sending data off cluster to RHOBS. No local alerting and local querying will be allowed. This path will forward both monitoring data as well as part of telemetry relevant for Telemeter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be sure I understand, there will be one "service" per management cluster scraping and forwarding metrics for all hosted control planes?
1. Ensuring control plane service monitors: HOSTEDCP-169 | ||
2. Ensuring forwarding mechanism (e.g using monitoring stack and Prometheus/Prometheus agent) on management cluster. | ||
3. Configuring it for each hosted control planes automatically, so it remote writes to RHOBS using correct credentials and correct whitelist for both monitoring and telemetry. | ||
4. Allowing deployment of CMO in data plane: MON-2143 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is already the case as @csrwng commented above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few points discussed on the sync call this morning that need incorporated if not already.
- need consistent key on all metrics from a cluster to identify the cluster
- this includes all SRE and managed service metrics
- running telemeter-client to scrape and push worker, SRE, and managed service metrics might be a good solution
There would be changes needed I think. We need to have a way to configure what metrics get shipped beyond the standard ones. We need a way to set the frequency for scraping and remote writing. I don't know that these need to be different. Probably doesn't make sense to be different actually.
Regardless if its telemeter that ships or prom/remoteWrite that ships. SRE will need |
CMO adds the |
Remote write would just push samples as they get ingested by Prometheus. Though you can control the size of the shards, the time to wait before sending a batch, what's the max number of samples per batch, ... (see |
/remove-lifecycle rotten |
Thanks, resolving the CI |
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Changes: * Changed data-plane details given feedback from in-cluster team * Changed how HyperShift admins can forward metrics to RHOBS on data-plane * Removed details around RHOBS. Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
/approve During the last HyperShift meeting, involved parties agreed to move forward and merge the enhancement proposal. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: simonpasquier The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/label tide/merge-method-squash |
/lgtm |
@bwplotka: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Signed-off-by: Bartlomiej Plotka bwplotka@gmail.com