Added proposal for HyperShift monitoring. #981

bwplotka · 2021-12-08T20:46:16Z

Signed-off-by: Bartlomiej Plotka bwplotka@gmail.com

enhancements/monitoring/hypershift-monitoring.md

mmazur · 2021-12-09T12:24:45Z

A general comment – I'm having trouble reading the diagrams, because the terminology isn't consistent with https://hypershift-docs.netlify.app/reference/concepts/
Granted, there's not a term for "hosted cluster worker node part" and you use "dataplane". Not sure if that's intuitive/optimal.

enhancements/monitoring/hypershift-monitoring.md

jeremyeder · 2021-12-09T12:58:00Z

enhancements/monitoring/hypershift-monitoring.md

+
+### User Stories
+
+TBD


This has to be fleshed out before this enhancement should be merged.

Do you mind helping us with that? @jeremyeder @csrwng

Looks like no help, so will try to fill here what we know so far.

alvaroaleman

Thanks a lot for taking the time to put this together :)

I have some feedback, but on a high-level my biggest concern is that this conflates "Requirements for Hypershift" with "Requirements for an observability product suited for the SD organization". IMHO this enhancement should only describe one of the two and the interface between them, but not both.

Now that I am at the end of the doc my impression is that this mostly describes requirements for the obvervability service, not for Hypershift, as the number of changes for Hypershift seem to be tiny:

Setup management cluster monitoring to remote_write (It is debatable is this is even sth hypershift should do)
Deploy some kind of agent into the cluster that scrapes service-provider-owned components and remote_writes that into a central metrics service.

Is my understanding correct?

enhancements/monitoring/hypershift-monitoring.md

alvaroaleman · 2021-12-09T18:38:50Z

enhancements/monitoring/hypershift-monitoring.md

+
+## Alternatives
+
+#### Pay for Observability Vendor


This is not an alternative to the proposed changes in Hypershift, this is an alternative to the in-house observability product.

I think it's a viable alternative we already use for logging AFAIK, no?

I feel it's integral to the solution we propose.

david-martin · 2021-12-14T15:49:37Z

enhancements/monitoring/hypershift-monitoring.md

+
+#### What About Layered Service / Addons Monitoring?
+
+This was discussed and solved by [Monitoring Stacks](monitoring-stack-operator.md). For HyperShift context, Addons requirements are no different to what we assume Customer would want to do / use. So in the context of this enhancement Addons are just another CUS.


I can see why this would be a non-goal above. It's easier to say they are just "Customer workloads".
However, I don't think this is the right attitude if we want addons to succeed as a whole.
I'm thinking internal (RH) here, where there's this RHOBS thing that is solving a lot of problems.
Maybe this isn't the right forum, but why wouldn't addons use that service?

Thanks, I see your point, I don't think that is the message that is here. I never said Addons are never part of RHOBS. You are already part of it and using it, so there is definitely the direction to have you in.

We only try to descope too many topics from one proposal. We like it or not our requirements are really not different than customer ones for addons observability (by what we gathered so far). Addons are first citizens as SD org in RHOBS, but AFAIK you can deploy the same way to HyperShift + this enhancement AND to normal OSD and that is the point here, no? (:

Sorry if you got it as "addons" are less important. They are not less - they are equally important. Is there a way I could improve the language here?

That makes sense.
Maybe all that's needed here is saying that RHOBS would still be the intended metrics & alerting solution for RH managed services on Hypershift as per https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/monitoring-stack-operator.md) ?

enhancements/monitoring/hypershift-monitoring.md

simonpasquier

Epic work! Similar to what @alvaroaleman already mentioned I think, I'd like to see more details about what needs to be changed in the cluster monitoring operator (CMO) to support the HyperShift deployment model. Describing how CMO works currently in HyperShift would be valuable too (I assume that @csrwng has the best knowledge here).
It would good to detail which components depend on the monitoring interfaces today (console, metrics API, HPA, ...) and how they will be impacted (for instance, do HyperShift customers have access to the OCP console? if yes from where would the console retrieve the metrics/alerts?).

enhancements/monitoring/hypershift-monitoring.md

bwplotka · 2022-01-11T11:57:00Z

Hi All!

Please PTAL latest update to this enhancement. Changes:

Changed data-plane details given feedback from the in-cluster team. Right now we propose deploying an unchanged CMO that will be also used to remote write metrics tiny amount metrics for HSRE use to RHOBS.
Removed details around RHOBS. Given the feedback we got, the proposal now focuses on changed in OpenShift in HyperShift topology, without mentioning RHOBS platform details.

Signed-off-by: Bartlomiej Plotka bwplotka@gmail.com

enhancements/monitoring/hypershift-monitoring.md

jewzaam · 2022-01-12T16:28:23Z

enhancements/monitoring/hypershift-monitoring.md

+
+### Open Questions
+
+* Who will configure CMO to allow remote writing metrics to RHOBS?


In both the management cluster and guest cluster case I expect CMO to be configured by SD. Specifically by App SRE as they manage the RHOBS instance and know where metrics should be routed, but in collaboration with SRE Platform to work through nuances such as guest clusters with no egress.

mgmt CMO is standard OSD/ROSA by SREP?

we should consider solving for guest clusters with no egress because that solves everything?

enhancements/monitoring/hypershift-monitoring.md

dofinn · 2022-01-13T10:47:26Z

enhancements/monitoring/hypershift-monitoring.md

+Let’s take a look at all parts:
+
+1. Hosted control planes does not deploy any own monitoring, other than relevant Service Monitors to be scraped.
+2. For all control planes we propose to deploy [Monitoring Stack Operator](https://github.com/rhobs/monitoring-stack-operator) that will deploy set of Prometheus/Prometheus Agents that will be sending data off cluster to RHOBS. No local alerting and local querying will be allowed. This path will forward both monitoring data as well as part of telemetry relevant for Telemeter.


What is the requirement for prometheus here? How does the agent not suffice?

I would assume that it's a matter of time to market. The Prometheus operator doesn't support Prometheus in agent mode yet though we have work in progress (prometheus-operator/prometheus-operator#3989).
In the mean time, I suppose that the monitoring stack operator could deploy a Prometheus server with no rules and not exposing the API endpoints.

enhancements/monitoring/hypershift-monitoring.md

simonpasquier · 2022-01-19T12:27:07Z

enhancements/monitoring/hypershift-monitoring.md

+Let’s take a look at all parts:
+
+1. Hosted control planes does not deploy any own monitoring, other than relevant Service Monitors to be scraped.
+2. For all control planes we propose to deploy [Monitoring Stack Operator](https://github.com/rhobs/monitoring-stack-operator) that will deploy set of Prometheus/Prometheus Agents that will be sending data off cluster to RHOBS. No local alerting and local querying will be allowed. This path will forward both monitoring data as well as part of telemetry relevant for Telemeter.


I would assume that it's a matter of time to market. The Prometheus operator doesn't support Prometheus in agent mode yet though we have work in progress (prometheus-operator/prometheus-operator#3989).
In the mean time, I suppose that the monitoring stack operator could deploy a Prometheus server with no rules and not exposing the API endpoints.

simonpasquier · 2022-01-19T12:34:06Z

enhancements/monitoring/hypershift-monitoring.md

+Let’s take a look at all parts:
+
+1. Hosted control planes does not deploy any own monitoring, other than relevant Service Monitors to be scraped.
+2. For all control planes we propose to deploy [Monitoring Stack Operator](https://github.com/rhobs/monitoring-stack-operator) that will deploy set of Prometheus/Prometheus Agents that will be sending data off cluster to RHOBS. No local alerting and local querying will be allowed. This path will forward both monitoring data as well as part of telemetry relevant for Telemeter.


To be sure I understand, there will be one "service" per management cluster scraping and forwarding metrics for all hosted control planes?

enhancements/monitoring/hypershift-monitoring.md

simonpasquier · 2022-01-19T12:49:37Z

enhancements/monitoring/hypershift-monitoring.md

+1. Ensuring control plane service monitors: HOSTEDCP-169
+2. Ensuring forwarding mechanism (e.g using monitoring stack and Prometheus/Prometheus agent) on management cluster.
+3. Configuring it for each hosted control planes automatically, so it remote writes to RHOBS using correct credentials and correct whitelist for both monitoring and telemetry.
+4. Allowing deployment of CMO in data plane: MON-2143


this is already the case as @csrwng commented above

jewzaam

A few points discussed on the sync call this morning that need incorporated if not already.

need consistent key on all metrics from a cluster to identify the cluster
this includes all SRE and managed service metrics
running telemeter-client to scrape and push worker, SRE, and managed service metrics might be a good solution

There would be changes needed I think. We need to have a way to configure what metrics get shipped beyond the standard ones. We need a way to set the frequency for scraping and remote writing. I don't know that these need to be different. Probably doesn't make sense to be different actually.

@bwplotka
fyi @dofinn

dofinn · 2022-02-04T00:42:06Z

A few points discussed on the sync call this morning that need incorporated if not already.

need consistent key on all metrics from a cluster to identify the cluster

this includes all SRE and managed service metrics

running telemeter-client to scrape and push worker, SRE, and managed service metrics might be a good solution

There would be changes needed I think. We need to have a way to configure what metrics get shipped beyond the standard ones. We need a way to set the frequency for scraping and remote writing. I don't know that these need to be different. Probably doesn't make sense to be different actually.

@bwplotka fyi @dofinn

Regardless if its telemeter that ships or prom/remoteWrite that ships. SRE will need _id field on all exported metrics. This is a problem we have not solved in OSD/ROSA https://issues.redhat.com/browse/OSD-6573 as we dont have the ability to configure external_labels dynamically per clusters/remoteWrite into obs-mst.

jan--f · 2022-02-04T15:15:54Z

CMO adds the _id label to the telemeter-client config today. We can certainly do the same for any remote_write config as well.

simonpasquier · 2022-02-07T09:19:41Z

We need a way to set the frequency for scraping and remote writing. I don't know that these need to be different. Probably doesn't make sense to be different actually.

Remote write would just push samples as they get ingested by Prometheus. Though you can control the size of the shards, the time to wait before sending a batch, what's the max number of samples per batch, ... (see queue_config in https://prometheus.io/docs/prometheus/latest/configuration/configuration/#remote_write).

csrwng · 2022-05-11T13:40:42Z

/remove-lifecycle rotten

bwplotka · 2022-06-08T14:25:27Z

Thanks, resolving the CI

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

Changes: * Changed data-plane details given feedback from in-cluster team * Changed how HyperShift admins can forward metrics to RHOBS on data-plane * Removed details around RHOBS. Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

simonpasquier · 2022-06-09T14:14:01Z

/approve

During the last HyperShift meeting, involved parties agreed to move forward and merge the enhancement proposal.

openshift-ci · 2022-06-09T14:14:34Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: simonpasquier

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~enhancements/monitoring/OWNERS~~ [simonpasquier]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

csrwng · 2022-06-13T14:04:54Z

/label tide/merge-method-squash

csrwng · 2022-06-13T14:06:09Z

/lgtm

openshift-ci · 2022-06-13T14:21:59Z

@bwplotka: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci bot requested review from bill3tt and stevekuznetsov December 8, 2021 20:46

gregsheremeta reviewed Dec 9, 2021

View reviewed changes

enhancements/monitoring/hypershift-monitoring.md Outdated Show resolved Hide resolved

gregsheremeta reviewed Dec 9, 2021

View reviewed changes

enhancements/monitoring/hypershift-monitoring.md Show resolved Hide resolved

enxebre reviewed Dec 9, 2021

View reviewed changes

enhancements/monitoring/hypershift-monitoring.md Show resolved Hide resolved

jeremyeder reviewed Dec 9, 2021

View reviewed changes

enhancements/monitoring/hypershift-monitoring.md Outdated Show resolved Hide resolved

jeremyeder reviewed Dec 9, 2021

View reviewed changes

enhancements/monitoring/hypershift-monitoring.md Show resolved Hide resolved

jeremyeder reviewed Dec 9, 2021

View reviewed changes

enhancements/monitoring/hypershift-monitoring.md Outdated Show resolved Hide resolved

jeremyeder reviewed Dec 9, 2021

View reviewed changes

enhancements/monitoring/hypershift-monitoring.md Outdated Show resolved Hide resolved

jeremyeder reviewed Dec 9, 2021

View reviewed changes

enhancements/monitoring/hypershift-monitoring.md Outdated Show resolved Hide resolved

jeremyeder reviewed Dec 9, 2021

View reviewed changes

enhancements/monitoring/hypershift-monitoring.md Outdated Show resolved Hide resolved

jeremyeder reviewed Dec 9, 2021

View reviewed changes

enhancements/monitoring/hypershift-monitoring.md Show resolved Hide resolved

jeremyeder reviewed Dec 9, 2021

View reviewed changes

alvaroaleman reviewed Dec 9, 2021

View reviewed changes

david-martin reviewed Dec 14, 2021

View reviewed changes

csrwng reviewed Dec 16, 2021

View reviewed changes

enhancements/monitoring/hypershift-monitoring.md Outdated Show resolved Hide resolved

simonpasquier reviewed Dec 17, 2021

View reviewed changes

kyoto reviewed Dec 22, 2021

View reviewed changes

enhancements/monitoring/hypershift-monitoring.md Show resolved Hide resolved

kyoto reviewed Dec 22, 2021

View reviewed changes

enhancements/monitoring/hypershift-monitoring.md Outdated Show resolved Hide resolved

jeremyeder reviewed Jan 11, 2022

View reviewed changes

enhancements/monitoring/hypershift-monitoring.md Show resolved Hide resolved

jewzaam reviewed Jan 12, 2022

View reviewed changes

dofinn reviewed Jan 13, 2022

View reviewed changes

csrwng reviewed Jan 13, 2022

View reviewed changes

enhancements/monitoring/hypershift-monitoring.md Outdated Show resolved Hide resolved

simonpasquier reviewed Jan 19, 2022

View reviewed changes

jewzaam suggested changes Feb 3, 2022

View reviewed changes

openshift-ci bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label May 11, 2022

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 8, 2022

bwplotka added 9 commits June 8, 2022 17:24

Added proposal for HyperShift monitoring.

1f5e1af

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

nits.

1274daf

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

Addressed few comments.

0a11f7d

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

Fixed action list.

ff3c57a

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

Updated enhancement.

5c4f3a4

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

Added info about cluster ID.

623b1f4

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

Graduation criteria finished, attempt to fix CI.

4f8aabb

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

CI fixes + rebase.

7fef76d

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

bwplotka force-pushed the hyper branch from 9f81c2b to 7fef76d Compare June 8, 2022 15:36

bwplotka added 6 commits June 8, 2022 18:09

CI fixes.

f28168d

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

CI lint.

52bb0b2

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

CI lint.

bba4af1

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

CI lint.

447870c

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

CI lint.

ad84cf8

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

CI lint.

8ba7ecd

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 9, 2022

openshift-ci bot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Jun 13, 2022

openshift-ci bot assigned csrwng Jun 13, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 13, 2022

openshift-merge-robot merged commit 58894a6 into openshift:master Jun 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added proposal for HyperShift monitoring. #981

Added proposal for HyperShift monitoring. #981

bwplotka commented Dec 8, 2021

mmazur commented Dec 9, 2021

jeremyeder Dec 9, 2021

bwplotka Dec 9, 2021

bwplotka Mar 3, 2022

alvaroaleman left a comment

alvaroaleman Dec 9, 2021

bwplotka Dec 15, 2021

david-martin Dec 14, 2021

bwplotka Dec 15, 2021 •

edited

Loading

david-martin Dec 16, 2021

simonpasquier left a comment

bwplotka commented Jan 11, 2022

jewzaam Jan 12, 2022

dofinn Jan 13, 2022

dofinn Jan 13, 2022

simonpasquier Jan 19, 2022

bwplotka Mar 3, 2022

simonpasquier Jan 19, 2022

simonpasquier Jan 19, 2022

simonpasquier Jan 19, 2022

jewzaam left a comment

dofinn commented Feb 4, 2022

jan--f commented Feb 4, 2022

simonpasquier commented Feb 7, 2022

csrwng commented May 11, 2022

bwplotka commented Jun 8, 2022

simonpasquier commented Jun 9, 2022

openshift-ci bot commented Jun 9, 2022

csrwng commented Jun 13, 2022

csrwng commented Jun 13, 2022

openshift-ci bot commented Jun 13, 2022


		#### What About Layered Service / Addons Monitoring?

		This was discussed and solved by [Monitoring Stacks](monitoring-stack-operator.md). For HyperShift context, Addons requirements are no different to what we assume Customer would want to do / use. So in the context of this enhancement Addons are just another CUS.


		### Open Questions

		* Who will configure CMO to allow remote writing metrics to RHOBS?

Added proposal for HyperShift monitoring. #981

Added proposal for HyperShift monitoring. #981

Conversation

bwplotka commented Dec 8, 2021

mmazur commented Dec 9, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alvaroaleman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bwplotka Dec 15, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simonpasquier left a comment

Choose a reason for hiding this comment

bwplotka commented Jan 11, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jewzaam left a comment

Choose a reason for hiding this comment

dofinn commented Feb 4, 2022

jan--f commented Feb 4, 2022

simonpasquier commented Feb 7, 2022

csrwng commented May 11, 2022

bwplotka commented Jun 8, 2022

simonpasquier commented Jun 9, 2022

openshift-ci bot commented Jun 9, 2022

csrwng commented Jun 13, 2022

csrwng commented Jun 13, 2022

openshift-ci bot commented Jun 13, 2022

bwplotka Dec 15, 2021 •

edited

Loading