-
Notifications
You must be signed in to change notification settings - Fork 486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Insights operator up to date gathering #465
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: martinkunc The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
### Goals | ||
|
||
As part of this enhancement we plan to do the following: | ||
* Release process change. On a fast-track release, publish Insights Operator fast-track changes during build to new quay tag for example 4.5-fast-track. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe worth splitting to two enhancements: one allowing the gathering part to run from different version of the image, and potentially other with changing the process to support one way of selecting the images.
There is still to be described how we are going to keep upgraded Operator Deployment after the image url change. | ||
Cluster Version Operator is reverting Operator Deployment config even if I just add a new container. | ||
There are two possible ways. Create separate, non CVO managed Deployment for Gatherer and make Gatherer separate Pod, | ||
or find some way how to make CVO keep changes in second container image after the Gatherer image update. This is still |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if CVO didn't revert the changes periodically, the insights-operator managing a separate "gathering" deployment and its images sounds more compelling to me than it updating it's own deployment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. Because of CVO reconcillation, we would need to create separate Pod (I guess ReplicaSet) dynamically and then drive the Image version in it.
### Image selection | ||
|
||
There might be various ways how to select the good new image. | ||
The initial idea is to check for new versions check in quay.io. We would need to create new tag in quay, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other option I would consider is taking the latest image available in the particular stream.
When CCX needs release faster: | ||
- To identify affected customers (suggest Workaround), | ||
- To find real feature usage, especially in in historical releases (because all that waiting on merge) | ||
- Fix IO bug |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general this is something that should be handled by more caution before adding gathering endpoints and better testing and verification in master. So it's helped by fast fixes (everything is) but it's not solely dependent on this feature to fix.
The *main* will instead of local Gather call remote Gather as Api, but will keep all the status handling and uploading | ||
in *main*. | ||
|
||
The Insights Operator *gatherer* will become just rest server on operators framework exposing Gather endpoint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this. The gatherer isn't a server, it's a client (to other things) and then writes things to disk. The uploader takes things from disk and moves them to insights.
I do not see a justification for a server endpoint for the gatherer or for the uploader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is the gatherer would be separate container (where the image could be from somewhere else) of the same pod: so when talking about the the rest api here, it would be still just within the pod rather than exposing a global API.
I don't think we can solve this with one pod (as otherwise, the pod would need to update it's own deployment, which doesn't sound very appealing and brings a lot of complications). What could be done is this split:
- main pod: the entry point for the operator, reads the configuration for the gatherer, chooses the right image, drives the life-cycle, runs one container with the image from the same version as the cluster itself
- gathering pod: deals with the gathering, has two containers:
- gatherer: runs from updated image, responsible for building the payload (could be triggered via private http API)
- collector: runs same version as main pod, triggers the gathering, uploads the payloads, exposes metrics about the upload.
The core part is that only gathering container gets updated, to minimize the risks.
There could be even variant where there gatherer has no internal rest API, and would just periodically dump the payloads to disk, as it does today, and the collector would be sending it out.
On the surface, if the Insights operator wants to ship updates separate from the OpenShift releases, the typical answer would be to ship Insights operator via OLM instead of bundled in OpenShift releases. I don't see it addressed in this proposal why that wasn't considered and why we're looking at a separate way to ship updated container images for this operator outside of either the typical CVO or OLM processes? |
* Release process change. On a fast-track release, publish Insights Operator fast-track changes during build to new quay tag for example 4.5-fast-track. | ||
The rest of release process will behave like today, release to 4.x. | ||
|
||
There will be two containers in Insights Operator Pod (*main* container with driver and *gatherer* container with Insights Operator). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like trying to bypass being part of the payload (which updates at a certain rate) and seems to align with a lifecycle delivered via OLM. This proposal should probably discuss "insights-operator delivered via OLM" as an alternative to this design, which would solve those problems more effectively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are good points. Questions given we have an option for insights-operator to be delivered via OLM:
- would there be a way to have the OLM one installed by default?
- or would we keep the OLM one as optional, and still have the CVO present by default?
- or could the insights-operator run an image from OLM instead of the CVO one, but still the same pod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, the insights-operator could collect payloads from external gatherers (with having it's own by default), and these gatherers being provided by OLM: it could be updated version of io itself, but there might be more, especially around other OLM operators for managed services.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont know much about lifecycle of OLM, if it would allow faster updates. But combined with failback to standard gathering it makes sense.
I was thinking if we would keep current Gathering + Saving archives to volume move to a shared volume, where whoever would be interested could provide more files to upload in some schema.
For example we have insights-2020-09-15-090843.tar.gz, but any other operator could leave here "some-other-data-[DATE].tar.gz", which could main IO just upload (latest by date).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realised that the suggestion might look confusing when it is mixing deployment using standard CVO operated operator and also custom deployment with reaching quay manually. The reason for this was that CVO based operator would act "slow" branch where we would fail-back from "insights operator" downloaded by custom mechanism.
It looks like using CVO and OLM managed fast track would be more suitable. Would you see any problems when CVO classic operator would install operator using OLM and let it gather (eventually fail-back to CVO, released one) ?
The initial idea is to check for new versions check in quay.io. We would need to create new tag in quay, | ||
for example fast-track, and we would be checking | ||
``` | ||
quay.io/openshift/origin-insights-operator:4-5-fast-track |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to support clusters where all images are mirrored to a private registry. I didn't see anything in the enhancement that talks about how it will work in those environments. All images in the core platform installation must be referenced by SHA today to work with ImageContentSourcePolicy.
quay.io/openshift/origin-insights-operator:4-5-fast-track | ||
``` | ||
|
||
Note: I am using origin-insights-operator image, I am not sure if there is a better url to use for OCP image. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All pull-spec translation for downstream (OCP) happens as part of our build processes. This is what the "image-references" file in the operator repos is for. But these processes are all designed to do translation to SHA pull-specs (no tags) for supporting mirrored registry environments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be the possibilities? I guess extracting the image references from the latest ocp-release image would be one option (not very optimal one), having an API that would give the point to the right image would be the other. But I tent to agree that going an OLM way would cause less hassle.
(4.4.z fast-track cluster might get gatherer from 4.4.z2 once available): 4.4.5 gatherer would still need | ||
to go though the backporting process to verify it works with that particular stream. Reverting on failure to | ||
"release - non fast-track" will be on place. | ||
3. Unification of fast-track streams. Instead of checking 4.5-fast-track on quay, driver will be looking for unified fast-track. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This (and 2 above) sound like OLM channels. I'm concerned this enhancement is re-implementing OLM's out of band update model but for a single image within a component that is part of the core release payload. The underlying problem you are describing sounds like you need to install an OLM operator by default as part of a cluster installation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good points jwforres, we would need to reconsider the approach to be more OLM specific. Thanks for all the good comments !
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
@sdodson, @bparees, and I met @iNecas and others responsible for insights-operator to briefly discuss this problem space last week. They explained how - in the months since this enhancement was first proprosed - it's become ever more clear that in order to achieve their support experience objectives, they need to be able to react to newly discovered issues in real-world clusters by implementing new Insights rules, and these rules oftern require additional data to be gathered from clusters. In other words, the motivation for this enhancement remains relevant - if the only avenue for rolling out data gathering changes across releases is via an OCP release image update, then we'll be severly restricted in our ability to respond quickly. At a high level, separating the the data collection code from the rest of the operator makes the most sense to me. If we can minimize the risk of a data gathering change from disrupting the cluster, then allowing the operator to dynamically pull down those changes seems reasonable. /assign sdodson bparees markmc |
Some specific points to consider when refreshing the proposal:
|
And if we lose the strong assurances provided by a signed release manifest by fetching code from a remote endpoint how will we provide the same code signing guarantees? I think it'd be helpful to elaborate on why there are situations where quick turn around on gathering updates is the only option. If we had infinite understanding of the product and ability to deliver rules in the next minor release are there scenarios where we still cannot predict data we may need to gather in the future? Are we just substituting flexibility in gathering code for better system design? For instance, thinking about recent sudo vulnerabilities if that were something Insights attempted to identify, designing a system that maps RHCOS versions to the version of sudo present in those versions is better than writing a rule which looks at sudo package directly because it provides a platform for all package based vulnerability alerting without fleet updates. If we pursue either OLM or fetching gathering rules from a remote location, what facilities do we have to halt that rollout in the event that we discover problems with the new rules? CVO managed rollouts can be halted via Cincinnati and in the future will have more advanced progressive rollout characteristics. Some of the questions from reviewers focused on ability to run in disconnected environments, however it seems that Insights provides no autonomous reporting, do we know whether or not it's common for clusters to require disconnected image mirroring but have access to our backend services? |
Rotten issues close after 30d of inactivity. Reopen the issue by commenting /close |
@openshift-bot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This is suggesting how Insights Operator could become untied from release cycle and offer stability and dynamic changes when adding new data.