Insights operator up to date gathering #465

martinkunc · 2020-09-02T04:41:06Z

This is suggesting how Insights Operator could become untied from release cycle and offer stability and dynamic changes when adding new data.

openshift-ci-robot · 2020-09-02T04:41:23Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: martinkunc
To complete the pull request process, please assign jwforres
You can assign the PR to them by writing /assign @jwforres in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

iNecas · 2020-09-02T07:31:42Z

enhancements/insights/up-to-date-gathering.md

+### Goals
+
+As part of this enhancement we plan to do the following:
+* Release process change. On a fast-track release, publish Insights Operator fast-track changes during build to new quay tag for example 4.5-fast-track.


Maybe worth splitting to two enhancements: one allowing the gathering part to run from different version of the image, and potentially other with changing the process to support one way of selecting the images.

iNecas · 2020-09-02T07:35:01Z

enhancements/insights/up-to-date-gathering.md

+There is still to be described how we are going to keep upgraded Operator Deployment after the image url change.
+Cluster Version Operator is reverting Operator Deployment config even if I just add a new container.
+There are two possible ways. Create separate, non CVO managed Deployment for Gatherer and make Gatherer separate Pod,
+or find some way how to make CVO keep changes in second container image after the Gatherer image update. This is still


Even if CVO didn't revert the changes periodically, the insights-operator managing a separate "gathering" deployment and its images sounds more compelling to me than it updating it's own deployment.

You are right. Because of CVO reconcillation, we would need to create separate Pod (I guess ReplicaSet) dynamically and then drive the Image version in it.

iNecas · 2020-09-02T07:36:16Z

enhancements/insights/up-to-date-gathering.md

+### Image selection
+
+There might be various ways how to select the good new image.
+The initial idea is to check for new versions check in quay.io. We would need to create new tag in quay, 


Other option I would consider is taking the latest image available in the particular stream.

smarterclayton · 2020-09-03T13:57:09Z

enhancements/insights/up-to-date-gathering.md

+When CCX needs release faster: 
+- To identify affected customers (suggest Workaround), 
+- To find real feature usage, especially in in historical releases (because all that waiting on merge)
+- Fix IO bug


In general this is something that should be handled by more caution before adding gathering endpoints and better testing and verification in master. So it's helped by fast fixes (everything is) but it's not solely dependent on this feature to fix.

smarterclayton · 2020-09-03T13:59:55Z

enhancements/insights/up-to-date-gathering.md

+The *main* will instead of local Gather call remote Gather as Api, but will keep all the status handling and uploading
+in *main*.
+
+The Insights Operator *gatherer* will become just rest server on operators framework exposing Gather endpoint.


I don't understand this. The gatherer isn't a server, it's a client (to other things) and then writes things to disk. The uploader takes things from disk and moves them to insights.

I do not see a justification for a server endpoint for the gatherer or for the uploader.

The idea is the gatherer would be separate container (where the image could be from somewhere else) of the same pod: so when talking about the the rest api here, it would be still just within the pod rather than exposing a global API.

I don't think we can solve this with one pod (as otherwise, the pod would need to update it's own deployment, which doesn't sound very appealing and brings a lot of complications). What could be done is this split:

main pod: the entry point for the operator, reads the configuration for the gatherer, chooses the right image, drives the life-cycle, runs one container with the image from the same version as the cluster itself

gathering pod: deals with the gathering, has two containers:

gatherer: runs from updated image, responsible for building the payload (could be triggered via private http API)

collector: runs same version as main pod, triggers the gathering, uploads the payloads, exposes metrics about the upload.

The core part is that only gathering container gets updated, to minimize the risks.

There could be even variant where there gatherer has no internal rest API, and would just periodically dump the payloads to disk, as it does today, and the collector would be sending it out.

bbrowning · 2020-09-03T14:12:17Z

On the surface, if the Insights operator wants to ship updates separate from the OpenShift releases, the typical answer would be to ship Insights operator via OLM instead of bundled in OpenShift releases. I don't see it addressed in this proposal why that wasn't considered and why we're looking at a separate way to ship updated container images for this operator outside of either the typical CVO or OLM processes?

smarterclayton · 2020-09-03T14:36:30Z

enhancements/insights/up-to-date-gathering.md

+* Release process change. On a fast-track release, publish Insights Operator fast-track changes during build to new quay tag for example 4.5-fast-track.
+The rest of release process will behave like today, release to 4.x.
+
+There will be two containers in Insights Operator Pod (*main* container with driver and *gatherer* container with Insights Operator).


This sounds like trying to bypass being part of the payload (which updates at a certain rate) and seems to align with a lifecycle delivered via OLM. This proposal should probably discuss "insights-operator delivered via OLM" as an alternative to this design, which would solve those problems more effectively.

Those are good points. Questions given we have an option for insights-operator to be delivered via OLM:

would there be a way to have the OLM one installed by default?

or would we keep the OLM one as optional, and still have the CVO present by default?

or could the insights-operator run an image from OLM instead of the CVO one, but still the same pod?

Alternatively, the insights-operator could collect payloads from external gatherers (with having it's own by default), and these gatherers being provided by OLM: it could be updated version of io itself, but there might be more, especially around other OLM operators for managed services.

I dont know much about lifecycle of OLM, if it would allow faster updates. But combined with failback to standard gathering it makes sense.

I was thinking if we would keep current Gathering + Saving archives to volume move to a shared volume, where whoever would be interested could provide more files to upload in some schema.
For example we have insights-2020-09-15-090843.tar.gz, but any other operator could leave here "some-other-data-[DATE].tar.gz", which could main IO just upload (latest by date).

I realised that the suggestion might look confusing when it is mixing deployment using standard CVO operated operator and also custom deployment with reaching quay manually. The reason for this was that CVO based operator would act "slow" branch where we would fail-back from "insights operator" downloaded by custom mechanism.

It looks like using CVO and OLM managed fast track would be more suitable. Would you see any problems when CVO classic operator would install operator using OLM and let it gather (eventually fail-back to CVO, released one) ?

jwforres · 2020-09-03T14:30:12Z

enhancements/insights/up-to-date-gathering.md

+The initial idea is to check for new versions check in quay.io. We would need to create new tag in quay, 
+for example fast-track, and we would be checking
+```
+quay.io/openshift/origin-insights-operator:4-5-fast-track


We need to support clusters where all images are mirrored to a private registry. I didn't see anything in the enhancement that talks about how it will work in those environments. All images in the core platform installation must be referenced by SHA today to work with ImageContentSourcePolicy.

jwforres · 2020-09-03T14:34:37Z

enhancements/insights/up-to-date-gathering.md

+quay.io/openshift/origin-insights-operator:4-5-fast-track
+```
+
+Note: I am using origin-insights-operator image, I am not sure if there is a better url to use for OCP image.


All pull-spec translation for downstream (OCP) happens as part of our build processes. This is what the "image-references" file in the operator repos is for. But these processes are all designed to do translation to SHA pull-specs (no tags) for supporting mirrored registry environments.

What would be the possibilities? I guess extracting the image references from the latest ocp-release image would be one option (not very optimal one), having an API that would give the point to the right image would be the other. But I tent to agree that going an OLM way would cause less hassle.

jwforres · 2020-09-03T14:43:49Z

enhancements/insights/up-to-date-gathering.md

+(4.4.z fast-track cluster might get gatherer from 4.4.z2 once available): 4.4.5 gatherer would still need 
+to go though the backporting process to verify it works with that particular stream. Reverting on failure to 
+"release - non fast-track" will be on place.
+3. Unification of fast-track streams. Instead of checking 4.5-fast-track on quay, driver will be looking for unified fast-track.


This (and 2 above) sound like OLM channels. I'm concerned this enhancement is re-implementing OLM's out of band update model but for a single image within a component that is part of the core release payload. The underlying problem you are describing sounds like you need to install an OLM operator by default as part of a cluster installation.

Good points jwforres, we would need to reconsider the approach to be more OLM specific. Thanks for all the good comments !

enhancements/insights/up-to-date-gathering.md

openshift-bot · 2020-12-17T10:34:09Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2021-01-16T12:25:25Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

markmc · 2021-01-25T13:33:55Z

@sdodson, @bparees, and I met @iNecas and others responsible for insights-operator to briefly discuss this problem space last week.

They explained how - in the months since this enhancement was first proprosed - it's become ever more clear that in order to achieve their support experience objectives, they need to be able to react to newly discovered issues in real-world clusters by implementing new Insights rules, and these rules oftern require additional data to be gathered from clusters.

In other words, the motivation for this enhancement remains relevant - if the only avenue for rolling out data gathering changes across releases is via an OCP release image update, then we'll be severly restricted in our ability to respond quickly.

At a high level, separating the the data collection code from the rest of the operator makes the most sense to me. If we can minimize the risk of a data gathering change from disrupting the cluster, then allowing the operator to dynamically pull down those changes seems reasonable.

/assign sdodson bparees markmc

markmc · 2021-01-25T13:35:29Z

Some specific points to consider when refreshing the proposal:

Make this enhancement very specifically about updating the data gathering code in order to allow Insights rules to identify a newly disocvered issue on real-world clusters
Detail the prior art of how updates are pushed for data gathering code for the RHEL insights-client
Explain how you will minimize the risk of a data gathering change causing disruption to the cluster
Take a close look at how must-gather works and whether you can build on it - e.g. the way a gather image gets run and the data copied out; the way multiple gather images are shipped; how it might be ueful if users could use oc adm must-gather to run the insights gather image to take a look at what would be uploaded
Consider whether 'oc adm inspect' is something you build on (like must-gather does) - perhaps it could be enhanced to gain anonymization support, for example
Discuss where how new data gathering code will be published - e.g. should the operator hit the insights API to find the latest gather image it should use
If you want to keep fixes for the operator itself in scope for the enhancement, then OLM seems like the right way to go. If you decide not to go the OLM route, add an explanation to the alternatives section about it.

sdodson · 2021-02-01T16:21:26Z

* Discuss where how new data gathering code will be published - e.g. should the operator hit the insights API to find the latest gather image it should use

And if we lose the strong assurances provided by a signed release manifest by fetching code from a remote endpoint how will we provide the same code signing guarantees?

I think it'd be helpful to elaborate on why there are situations where quick turn around on gathering updates is the only option. If we had infinite understanding of the product and ability to deliver rules in the next minor release are there scenarios where we still cannot predict data we may need to gather in the future?

Are we just substituting flexibility in gathering code for better system design? For instance, thinking about recent sudo vulnerabilities if that were something Insights attempted to identify, designing a system that maps RHCOS versions to the version of sudo present in those versions is better than writing a rule which looks at sudo package directly because it provides a platform for all package based vulnerability alerting without fleet updates.

If we pursue either OLM or fetching gathering rules from a remote location, what facilities do we have to halt that rollout in the event that we discover problems with the new rules? CVO managed rollouts can be halted via Cincinnati and in the future will have more advanced progressive rollout characteristics.

Some of the questions from reviewers focused on ability to run in disconnected environments, however it seems that Insights provides no autonomous reporting, do we know whether or not it's common for clusters to require disconnected image mirroring but have access to our backend services?

tremes · 2021-02-16T07:05:10Z

@markmc , @sdodson Thanks for the update. I will reopen this topic in my team (and with Ivan) and I think we will create a new enhancement proposal because Martin Kunc has left the company.

openshift-bot · 2021-03-18T11:02:45Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci-robot · 2021-03-18T11:02:56Z

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Insights operator up to date gathering

d481dd4

openshift-ci-robot requested review from abhinavdahiya and jwmatthews September 2, 2020 04:41

iNecas reviewed Sep 2, 2020

View reviewed changes

smarterclayton reviewed Sep 3, 2020

View reviewed changes

jwforres reviewed Sep 3, 2020

View reviewed changes

russellb reviewed Sep 4, 2020

View reviewed changes

enhancements/insights/up-to-date-gathering.md Outdated Show resolved Hide resolved

enhancements/insights/up-to-date-gathering.md Outdated Show resolved Hide resolved

Fix formatting and link

0eff61e

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 17, 2020

openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 16, 2021

openshift-ci-robot assigned bparees, markmc and sdodson Jan 25, 2021

openshift-ci-robot closed this Mar 18, 2021

wking mentioned this pull request Aug 18, 2021

Conditional Data Gathering for Insights Operator #837

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Insights operator up to date gathering #465

Insights operator up to date gathering #465

martinkunc commented Sep 2, 2020

openshift-ci-robot commented Sep 2, 2020

iNecas Sep 2, 2020

iNecas Sep 2, 2020

martinkunc Sep 3, 2020

iNecas Sep 2, 2020

smarterclayton Sep 3, 2020

smarterclayton Sep 3, 2020

iNecas Sep 4, 2020

bbrowning commented Sep 3, 2020

smarterclayton Sep 3, 2020

iNecas Sep 4, 2020

iNecas Sep 4, 2020

martinkunc Sep 15, 2020

martinkunc Sep 18, 2020

jwforres Sep 3, 2020

jwforres Sep 3, 2020

iNecas Sep 4, 2020

jwforres Sep 3, 2020

martinkunc Sep 15, 2020

openshift-bot commented Dec 17, 2020

openshift-bot commented Jan 16, 2021

markmc commented Jan 25, 2021

markmc commented Jan 25, 2021

sdodson commented Feb 1, 2021

tremes commented Feb 16, 2021

openshift-bot commented Mar 18, 2021

openshift-ci-robot commented Mar 18, 2021

Insights operator up to date gathering #465

Insights operator up to date gathering #465

Conversation

martinkunc commented Sep 2, 2020

openshift-ci-robot commented Sep 2, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bbrowning commented Sep 3, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-bot commented Dec 17, 2020

openshift-bot commented Jan 16, 2021

markmc commented Jan 25, 2021

markmc commented Jan 25, 2021

sdodson commented Feb 1, 2021

tremes commented Feb 16, 2021

openshift-bot commented Mar 18, 2021

openshift-ci-robot commented Mar 18, 2021