-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1942271: Gather openshift-cluster-version pods and events #381
Bug 1942271: Gather openshift-cluster-version pods and events #381
Conversation
@wking: This pull request references Bugzilla bug 1941901, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@wking: This pull request references Bugzilla bug 1942271, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
d258760
to
8699181
Compare
c <- gatherResult{nil, []error{err}} | ||
return | ||
klog.V(2).Infof("Unable to find pods in namespace %s for cluster-version operator", namespace) | ||
return records, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe return the error here as well? Btw what is the expected number of pods in this namespace please? only one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, in most cases there will only be one. Returning nil
here is pattern-matching this existing error-swallowing. If you want me including this error in the return, do you want me to also patch to return that one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh ok. Well it's true that it's not so critical for the gatherer, so we can swallow it.
pod := &pods.Items[i] | ||
|
||
// TODO: shift after IsHealthyPod | ||
records = append(records, record.Record{Name: fmt.Sprintf("config/pod/%s/%s", pod.Namespace, pod.Name), Item: record.JSONMarshaller{Object: pod}}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK but what is the plan with the shift? It can always cause some problems if you remove some things that were already in the archive. I guess it's hopefully not a big deal in this case, but it's still good to know the plan :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plan is:
- Land this PR for rhbz#1942271.
- Backport this PR to 4.7.
- Collect a few weeks of Insights tarballs.
- Audit them for CVO
tolerations
, to see if Insights-reporting users have been making any changes we're concerned about clobbering. - Follow-up insights PR to address the TODOs here, so we only collect the CVO pod when it's unhealthy.
- In parallel with 5, patch the CVO to fix rhbz#1941901.
I'm agnostic about whether 5 gets a bug and a backport to 4.7.z. Would be easy enough to hang on a third "insights operator collects CVO when the pod is healthy" bug series if we wanted to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK thanks so there's plan to backport this to 4.7 as well if I understand it correctly right.
@wking Can I ask you to run |
…Version There's no reason to fetch the ClusterVersion twice, even if we are creating two Records based on its content. This also sets the stage for gathering additional items like sad cluster-version operator pods. I personally don't see a need to call out the fact that ClusterVersion included clusterID, but Tomas wanted it [1]. [1]: openshift#381 (comment)
ad12c5c
to
518c25a
Compare
Checking an earlier CI run, which was $ tar tvz < "$(ls | tail -n1)" | grep 'config/id\|config/version\|cluster-version'
-rw-r----- 0/0 7934 2021-03-24 04:36 events/openshift-cluster-version.json
-rw-r----- 0/0 7691 2021-03-24 04:36 config/pod/openshift-cluster-version/cluster-version-operator-bdf7f95f8-68qx2.json
-rw-r----- 0/0 36 2021-03-24 04:36 config/id
-rw-r----- 0/0 1975 2021-03-24 04:36 config/version.json
$ tar xOz config/pod/openshift-cluster-version/cluster-version-operator-bdf7f95f8-68qx2.json < "$(ls | tail -n1)" | jq -cS '.spec.tolerations[]'
{"effect":"NoSchedule","key":"node-role.kubernetes.io/master","operator":"Exists"}
{"effect":"NoSchedule","key":"node.kubernetes.io/unschedulable","operator":"Exists"}
{"effect":"NoSchedule","key":"node.kubernetes.io/network-unavailable","operator":"Exists"}
{"effect":"NoExecute","key":"node.kubernetes.io/not-ready","operator":"Exists","tolerationSeconds":120}
{"effect":"NoExecute","key":"node.kubernetes.io/unreachable","operator":"Exists","tolerationSeconds":120}
{"effect":"NoSchedule","key":"node.kubernetes.io/memory-pressure","operator":"Exists"} So that looks good to me. |
/retest |
I ran it locally against cluster-bot cluster. The |
/lgtm |
/retest Please review the full test history for this PR and help us cut down flakes. |
@wking please rebase. |
Using the native client code directly, instead of through gather's wrapper, is not much of a convenience hit. And this gives the client space to improve its client persistence going forward. Completes the transition begun in 73d3cfd (Move gather function execution into the Gatherer, also removes the link between gathering and insightsclient, 2020-12-03, openshift#279) to decouple the client from the gatherer. This also allows me to make getClusterVersion internal on the gather side, so I can shuffle it's API a bit in future commits without breaking any consumers.
…Version There's no reason to fetch the ClusterVersion twice, even if we are creating two Records based on its content. This also sets the stage for gathering additional items like sad cluster-version operator pods. I personally don't see a need to call out the fact that ClusterVersion included clusterID, but Tomas wanted it [1]. [1]: openshift#381 (comment)
518c25a
to
2f6d165
Compare
Using similar logic to what gatherClusterOperators uses, but for the ClusterVersion operator's namespace. Usually we'd only gather the pod YAML if the pod was failing, but in order to help audit tolerations for [1], at the moment I'm gathering it every time. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1941901
Generated with: $ make gen-doc
2f6d165
to
aec63b2
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: tremes, wking The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@wking: All pull requests linked via external trackers have merged: Bugzilla bug 1942271 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cherrypick release-4.7 |
@wking: #381 failed to apply on top of branch "release-4.7":
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Using similar logic to what
gatherClusterOperators
uses, but for the ClusterVersion operator's namespace. Usually we'd only gather the pod YAML if the pod was failing, but in order to help audit tolerations for rhbz#1941901, at the moment I'm gathering it every time.With two initial pivots to set the stage:
Details on everything in the respective commit messages.