Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Insights operator up to date gathering #465

Closed

Conversation

martinkunc
Copy link

This is suggesting how Insights Operator could become untied from release cycle and offer stability and dynamic changes when adding new data.

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: martinkunc
To complete the pull request process, please assign jwforres
You can assign the PR to them by writing /assign @jwforres in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

### Goals

As part of this enhancement we plan to do the following:
* Release process change. On a fast-track release, publish Insights Operator fast-track changes during build to new quay tag for example 4.5-fast-track.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth splitting to two enhancements: one allowing the gathering part to run from different version of the image, and potentially other with changing the process to support one way of selecting the images.

There is still to be described how we are going to keep upgraded Operator Deployment after the image url change.
Cluster Version Operator is reverting Operator Deployment config even if I just add a new container.
There are two possible ways. Create separate, non CVO managed Deployment for Gatherer and make Gatherer separate Pod,
or find some way how to make CVO keep changes in second container image after the Gatherer image update. This is still
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if CVO didn't revert the changes periodically, the insights-operator managing a separate "gathering" deployment and its images sounds more compelling to me than it updating it's own deployment.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. Because of CVO reconcillation, we would need to create separate Pod (I guess ReplicaSet) dynamically and then drive the Image version in it.

### Image selection

There might be various ways how to select the good new image.
The initial idea is to check for new versions check in quay.io. We would need to create new tag in quay,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other option I would consider is taking the latest image available in the particular stream.

When CCX needs release faster:
- To identify affected customers (suggest Workaround),
- To find real feature usage, especially in in historical releases (because all that waiting on merge)
- Fix IO bug
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general this is something that should be handled by more caution before adding gathering endpoints and better testing and verification in master. So it's helped by fast fixes (everything is) but it's not solely dependent on this feature to fix.

The *main* will instead of local Gather call remote Gather as Api, but will keep all the status handling and uploading
in *main*.

The Insights Operator *gatherer* will become just rest server on operators framework exposing Gather endpoint.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this. The gatherer isn't a server, it's a client (to other things) and then writes things to disk. The uploader takes things from disk and moves them to insights.

I do not see a justification for a server endpoint for the gatherer or for the uploader.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is the gatherer would be separate container (where the image could be from somewhere else) of the same pod: so when talking about the the rest api here, it would be still just within the pod rather than exposing a global API.

I don't think we can solve this with one pod (as otherwise, the pod would need to update it's own deployment, which doesn't sound very appealing and brings a lot of complications). What could be done is this split:

  • main pod: the entry point for the operator, reads the configuration for the gatherer, chooses the right image, drives the life-cycle, runs one container with the image from the same version as the cluster itself
  • gathering pod: deals with the gathering, has two containers:
    • gatherer: runs from updated image, responsible for building the payload (could be triggered via private http API)
    • collector: runs same version as main pod, triggers the gathering, uploads the payloads, exposes metrics about the upload.

The core part is that only gathering container gets updated, to minimize the risks.

There could be even variant where there gatherer has no internal rest API, and would just periodically dump the payloads to disk, as it does today, and the collector would be sending it out.

@bbrowning
Copy link

On the surface, if the Insights operator wants to ship updates separate from the OpenShift releases, the typical answer would be to ship Insights operator via OLM instead of bundled in OpenShift releases. I don't see it addressed in this proposal why that wasn't considered and why we're looking at a separate way to ship updated container images for this operator outside of either the typical CVO or OLM processes?

* Release process change. On a fast-track release, publish Insights Operator fast-track changes during build to new quay tag for example 4.5-fast-track.
The rest of release process will behave like today, release to 4.x.

There will be two containers in Insights Operator Pod (*main* container with driver and *gatherer* container with Insights Operator).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like trying to bypass being part of the payload (which updates at a certain rate) and seems to align with a lifecycle delivered via OLM. This proposal should probably discuss "insights-operator delivered via OLM" as an alternative to this design, which would solve those problems more effectively.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are good points. Questions given we have an option for insights-operator to be delivered via OLM:

  • would there be a way to have the OLM one installed by default?
  • or would we keep the OLM one as optional, and still have the CVO present by default?
  • or could the insights-operator run an image from OLM instead of the CVO one, but still the same pod?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, the insights-operator could collect payloads from external gatherers (with having it's own by default), and these gatherers being provided by OLM: it could be updated version of io itself, but there might be more, especially around other OLM operators for managed services.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont know much about lifecycle of OLM, if it would allow faster updates. But combined with failback to standard gathering it makes sense.

I was thinking if we would keep current Gathering + Saving archives to volume move to a shared volume, where whoever would be interested could provide more files to upload in some schema.
For example we have insights-2020-09-15-090843.tar.gz, but any other operator could leave here "some-other-data-[DATE].tar.gz", which could main IO just upload (latest by date).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realised that the suggestion might look confusing when it is mixing deployment using standard CVO operated operator and also custom deployment with reaching quay manually. The reason for this was that CVO based operator would act "slow" branch where we would fail-back from "insights operator" downloaded by custom mechanism.

It looks like using CVO and OLM managed fast track would be more suitable. Would you see any problems when CVO classic operator would install operator using OLM and let it gather (eventually fail-back to CVO, released one) ?

The initial idea is to check for new versions check in quay.io. We would need to create new tag in quay,
for example fast-track, and we would be checking
```
quay.io/openshift/origin-insights-operator:4-5-fast-track
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to support clusters where all images are mirrored to a private registry. I didn't see anything in the enhancement that talks about how it will work in those environments. All images in the core platform installation must be referenced by SHA today to work with ImageContentSourcePolicy.

quay.io/openshift/origin-insights-operator:4-5-fast-track
```

Note: I am using origin-insights-operator image, I am not sure if there is a better url to use for OCP image.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All pull-spec translation for downstream (OCP) happens as part of our build processes. This is what the "image-references" file in the operator repos is for. But these processes are all designed to do translation to SHA pull-specs (no tags) for supporting mirrored registry environments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the possibilities? I guess extracting the image references from the latest ocp-release image would be one option (not very optimal one), having an API that would give the point to the right image would be the other. But I tent to agree that going an OLM way would cause less hassle.

(4.4.z fast-track cluster might get gatherer from 4.4.z2 once available): 4.4.5 gatherer would still need
to go though the backporting process to verify it works with that particular stream. Reverting on failure to
"release - non fast-track" will be on place.
3. Unification of fast-track streams. Instead of checking 4.5-fast-track on quay, driver will be looking for unified fast-track.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (and 2 above) sound like OLM channels. I'm concerned this enhancement is re-implementing OLM's out of band update model but for a single image within a component that is part of the core release payload. The underlying problem you are describing sounds like you need to install an OLM operator by default as part of a cluster installation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points jwforres, we would need to reconsider the approach to be more OLM specific. Thanks for all the good comments !

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 17, 2020
@openshift-bot
Copy link

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 16, 2021
@markmc
Copy link
Contributor

markmc commented Jan 25, 2021

@sdodson, @bparees, and I met @iNecas and others responsible for insights-operator to briefly discuss this problem space last week.

They explained how - in the months since this enhancement was first proprosed - it's become ever more clear that in order to achieve their support experience objectives, they need to be able to react to newly discovered issues in real-world clusters by implementing new Insights rules, and these rules oftern require additional data to be gathered from clusters.

In other words, the motivation for this enhancement remains relevant - if the only avenue for rolling out data gathering changes across releases is via an OCP release image update, then we'll be severly restricted in our ability to respond quickly.

At a high level, separating the the data collection code from the rest of the operator makes the most sense to me. If we can minimize the risk of a data gathering change from disrupting the cluster, then allowing the operator to dynamically pull down those changes seems reasonable.

/assign sdodson bparees markmc

@markmc
Copy link
Contributor

markmc commented Jan 25, 2021

Some specific points to consider when refreshing the proposal:

  • Make this enhancement very specifically about updating the data gathering code in order to allow Insights rules to identify a newly disocvered issue on real-world clusters

  • Detail the prior art of how updates are pushed for data gathering code for the RHEL insights-client

  • Explain how you will minimize the risk of a data gathering change causing disruption to the cluster

  • Take a close look at how must-gather works and whether you can build on it - e.g. the way a gather image gets run and the data copied out; the way multiple gather images are shipped; how it might be ueful if users could use oc adm must-gather to run the insights gather image to take a look at what would be uploaded

  • Consider whether 'oc adm inspect' is something you build on (like must-gather does) - perhaps it could be enhanced to gain anonymization support, for example

  • Discuss where how new data gathering code will be published - e.g. should the operator hit the insights API to find the latest gather image it should use

  • If you want to keep fixes for the operator itself in scope for the enhancement, then OLM seems like the right way to go. If you decide not to go the OLM route, add an explanation to the alternatives section about it.

@sdodson
Copy link
Member

sdodson commented Feb 1, 2021

* Discuss where how new data gathering code will be published - e.g. should the operator hit the insights API to find the latest gather image it should use

And if we lose the strong assurances provided by a signed release manifest by fetching code from a remote endpoint how will we provide the same code signing guarantees?

I think it'd be helpful to elaborate on why there are situations where quick turn around on gathering updates is the only option. If we had infinite understanding of the product and ability to deliver rules in the next minor release are there scenarios where we still cannot predict data we may need to gather in the future?

Are we just substituting flexibility in gathering code for better system design? For instance, thinking about recent sudo vulnerabilities if that were something Insights attempted to identify, designing a system that maps RHCOS versions to the version of sudo present in those versions is better than writing a rule which looks at sudo package directly because it provides a platform for all package based vulnerability alerting without fleet updates.

If we pursue either OLM or fetching gathering rules from a remote location, what facilities do we have to halt that rollout in the event that we discover problems with the new rules? CVO managed rollouts can be halted via Cincinnati and in the future will have more advanced progressive rollout characteristics.

Some of the questions from reviewers focused on ability to run in disconnected environments, however it seems that Insights provides no autonomous reporting, do we know whether or not it's common for clusters to require disconnected image mirroring but have access to our backend services?

@tremes
Copy link
Contributor

tremes commented Feb 16, 2021

@markmc , @sdodson Thanks for the update. I will reopen this topic in my team (and with Ivan) and I think we will create a new enhancement proposal because Martin Kunc has left the company.

@openshift-bot
Copy link

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci-robot
Copy link

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.