Bug 1942271: Gather openshift-cluster-version pods and events #381

wking · 2021-03-24T02:30:54Z

Using similar logic to what gatherClusterOperators uses, but for the ClusterVersion operator's namespace. Usually we'd only gather the pod YAML if the pod was failing, but in order to help audit tolerations for rhbz#1941901, at the moment I'm gathering it every time.

With two initial pivots to set the stage:

pkg/gather/clusterconfig: Collapse GatherClusterID into GatherClusterVersion
pkg/insights/insightsclient: Inline client handling for ClusterVersion

Details on everything in the respective commit messages.

openshift-ci-robot · 2021-03-24T02:31:01Z

@wking: This pull request references Bugzilla bug 1941901, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.8.0) matches configured target release for branch (4.8.0)
bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @jianlinliu

In response to this:

Bug 1941901: Gather openshift-cluster-version pods and events

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2021-03-24T02:39:48Z

@wking: This pull request references Bugzilla bug 1942271, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.8.0) matches configured target release for branch (4.8.0)
bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @psimovec

In response to this:

Bug 1942271: Gather openshift-cluster-version pods and events

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tremes · 2021-03-24T10:37:59Z

pkg/gather/clusterconfig/version.go

-		c <- gatherResult{nil, []error{err}}
-		return
+		klog.V(2).Infof("Unable to find pods in namespace %s for cluster-version operator", namespace)
+		return records, nil


Maybe return the error here as well? Btw what is the expected number of pods in this namespace please? only one?

yeah, in most cases there will only be one. Returning nil here is pattern-matching this existing error-swallowing. If you want me including this error in the return, do you want me to also patch to return that one?

Ahh ok. Well it's true that it's not so critical for the gatherer, so we can swallow it.

tremes · 2021-03-24T10:41:26Z

pkg/gather/clusterconfig/version.go

+		pod := &pods.Items[i]
+
+		// TODO: shift after IsHealthyPod
+		records = append(records, record.Record{Name: fmt.Sprintf("config/pod/%s/%s", pod.Namespace, pod.Name), Item: record.JSONMarshaller{Object: pod}})


OK but what is the plan with the shift? It can always cause some problems if you remove some things that were already in the archive. I guess it's hopefully not a big deal in this case, but it's still good to know the plan :)

Plan is:

Land this PR for rhbz#1942271.

Backport this PR to 4.7.

Collect a few weeks of Insights tarballs.

Audit them for CVO tolerations, to see if Insights-reporting users have been making any changes we're concerned about clobbering.

Follow-up insights PR to address the TODOs here, so we only collect the CVO pod when it's unhealthy.

In parallel with 5, patch the CVO to fix rhbz#1941901.

I'm agnostic about whether 5 gets a bug and a backport to 4.7.z. Would be easy enough to hang on a third "insights operator collects CVO when the pod is healthy" bug series if we wanted to.

OK thanks so there's plan to backport this to 4.7 as well if I understand it correctly right.

pkg/gather/clusterconfig/version.go

tremes · 2021-03-25T08:03:35Z

@wking Can I ask you to run make gen-doc (to update our simple doc in docs/gathered-data/md) in your branch? It's also slightly related to my https://github.com/openshift/insights-operator/pull/381/files#r601169083. I'll approve the PR then :) Thanks.

…Version There's no reason to fetch the ClusterVersion twice, even if we are creating two Records based on its content. This also sets the stage for gathering additional items like sad cluster-version operator pods. I personally don't see a need to call out the fact that ClusterVersion included clusterID, but Tomas wanted it [1]. [1]: openshift#381 (comment)

wking · 2021-03-26T17:28:29Z

Can I ask you to run make gen-doc...

Done with 8699181 -> 518c25a.

wking · 2021-03-26T17:39:31Z

Checking an earlier CI run, which was b15a9131-b10c-499b-a60a-dea5810c7a73:

$ tar tvz < "$(ls | tail -n1)" | grep 'config/id\|config/version\|cluster-version'
-rw-r----- 0/0            7934 2021-03-24 04:36 events/openshift-cluster-version.json
-rw-r----- 0/0            7691 2021-03-24 04:36 config/pod/openshift-cluster-version/cluster-version-operator-bdf7f95f8-68qx2.json
-rw-r----- 0/0              36 2021-03-24 04:36 config/id
-rw-r----- 0/0            1975 2021-03-24 04:36 config/version.json
$ tar xOz config/pod/openshift-cluster-version/cluster-version-operator-bdf7f95f8-68qx2.json < "$(ls | tail -n1)" | jq -cS '.spec.tolerations[]'
{"effect":"NoSchedule","key":"node-role.kubernetes.io/master","operator":"Exists"}
{"effect":"NoSchedule","key":"node.kubernetes.io/unschedulable","operator":"Exists"}
{"effect":"NoSchedule","key":"node.kubernetes.io/network-unavailable","operator":"Exists"}
{"effect":"NoExecute","key":"node.kubernetes.io/not-ready","operator":"Exists","tolerationSeconds":120}
{"effect":"NoExecute","key":"node.kubernetes.io/unreachable","operator":"Exists","tolerationSeconds":120}
{"effect":"NoSchedule","key":"node.kubernetes.io/memory-pressure","operator":"Exists"}

So that looks good to me.

tremes · 2021-03-27T19:44:04Z

/retest

tremes · 2021-03-29T09:03:32Z

I ran it locally against cluster-bot cluster. The config/version.json and config/id resources are gathered in the same way. The cluster-version-operator (from openshit-cluster-version namespace) pod definition is always gathered.

tremes · 2021-03-29T09:03:48Z

/lgtm

openshift-bot · 2021-03-29T11:56:57Z

/retest

Please review the full test history for this PR and help us cut down flakes.

tremes · 2021-03-30T05:55:01Z

@wking please rebase.

Using the native client code directly, instead of through gather's wrapper, is not much of a convenience hit. And this gives the client space to improve its client persistence going forward. Completes the transition begun in 73d3cfd (Move gather function execution into the Gatherer, also removes the link between gathering and insightsclient, 2020-12-03, openshift#279) to decouple the client from the gatherer. This also allows me to make getClusterVersion internal on the gather side, so I can shuffle it's API a bit in future commits without breaking any consumers.

…Version There's no reason to fetch the ClusterVersion twice, even if we are creating two Records based on its content. This also sets the stage for gathering additional items like sad cluster-version operator pods. I personally don't see a need to call out the fact that ClusterVersion included clusterID, but Tomas wanted it [1]. [1]: openshift#381 (comment)

wking · 2021-04-05T18:13:14Z

Rebased with 518c25a -> 2f6d165

Using similar logic to what gatherClusterOperators uses, but for the ClusterVersion operator's namespace. Usually we'd only gather the pod YAML if the pod was failing, but in order to help audit tolerations for [1], at the moment I'm gathering it every time. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1941901

Generated with: $ make gen-doc

tremes · 2021-04-06T08:05:04Z

/lgtm

openshift-ci-robot · 2021-04-06T08:05:20Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tremes, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [tremes]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2021-04-06T08:05:51Z

@wking: All pull requests linked via external trackers have merged:

openshift/insights-operator#381

Bugzilla bug 1942271 has been moved to the MODIFIED state.

In response to this:

Bug 1942271: Gather openshift-cluster-version pods and events

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking · 2021-04-09T21:10:12Z

/cherrypick release-4.7

openshift-cherrypick-robot · 2021-04-09T21:10:48Z

@wking: #381 failed to apply on top of branch "release-4.7":

Applying: pkg/insights/insightsclient: Inline client handling for ClusterVersion
Using index info to reconstruct a base tree...
M	pkg/gather/clusterconfig/version.go
M	pkg/insights/insightsclient/insightsclient.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/insights/insightsclient/insightsclient.go
Auto-merging pkg/gather/clusterconfig/version.go
Applying: pkg/gather/clusterconfig: Collapse GatherClusterID into GatherClusterVersion
Using index info to reconstruct a base tree...
M	pkg/gather/clusterconfig/0_gatherer.go
M	pkg/gather/clusterconfig/version.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/gather/clusterconfig/version.go
CONFLICT (content): Merge conflict in pkg/gather/clusterconfig/version.go
Auto-merging pkg/gather/clusterconfig/0_gatherer.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0002 pkg/gather/clusterconfig: Collapse GatherClusterID into GatherClusterVersion
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherrypick release-4.7

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot added the bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. label Mar 24, 2021

openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Mar 24, 2021

openshift-ci-robot requested review from jianlinliu, smarterclayton and tremes March 24, 2021 02:31

wking changed the title ~~Bug 1941901: Gather openshift-cluster-version pods and events~~ Bug 1942271: Gather openshift-cluster-version pods and events Mar 24, 2021

openshift-ci-robot requested a review from psimovec March 24, 2021 02:39

wking force-pushed the gather-cluster-version-pods branch 4 times, most recently from d258760 to 8699181 Compare March 24, 2021 03:35

tremes reviewed Mar 24, 2021

View reviewed changes

tremes reviewed Mar 25, 2021

View reviewed changes

pkg/gather/clusterconfig/version.go Outdated Show resolved Hide resolved

wking force-pushed the gather-cluster-version-pods branch 2 times, most recently from ad12c5c to 518c25a Compare March 26, 2021 17:28

openshift-ci-robot assigned tremes Mar 29, 2021

openshift-ci-robot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Mar 29, 2021

openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 29, 2021

wking added 2 commits April 1, 2021 10:52

wking force-pushed the gather-cluster-version-pods branch from 518c25a to 2f6d165 Compare April 5, 2021 18:12

openshift-ci-robot removed lgtm Indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Apr 5, 2021

wking added 2 commits April 5, 2021 18:55

docs/gathered-data: Regenerate

aec63b2

Generated with: $ make gen-doc

wking force-pushed the gather-cluster-version-pods branch from 2f6d165 to aec63b2 Compare April 6, 2021 01:55

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 6, 2021

openshift-merge-robot merged commit 79935d0 into openshift:master Apr 6, 2021

wking deleted the gather-cluster-version-pods branch April 6, 2021 18:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1942271: Gather openshift-cluster-version pods and events #381

Bug 1942271: Gather openshift-cluster-version pods and events #381

wking commented Mar 24, 2021

openshift-ci-robot commented Mar 24, 2021

openshift-ci-robot commented Mar 24, 2021

tremes Mar 24, 2021

wking Mar 24, 2021

tremes Mar 25, 2021

tremes Mar 24, 2021

wking Mar 24, 2021

tremes Mar 25, 2021

tremes commented Mar 25, 2021

wking commented Mar 26, 2021

wking commented Mar 26, 2021 •

edited

Loading

tremes commented Mar 27, 2021

tremes commented Mar 29, 2021

tremes commented Mar 29, 2021

openshift-bot commented Mar 29, 2021

tremes commented Mar 30, 2021

wking commented Apr 5, 2021

tremes commented Apr 6, 2021

openshift-ci-robot commented Apr 6, 2021

openshift-ci-robot commented Apr 6, 2021

wking commented Apr 9, 2021

openshift-cherrypick-robot commented Apr 9, 2021

Bug 1942271: Gather openshift-cluster-version pods and events #381

Bug 1942271: Gather openshift-cluster-version pods and events #381

Conversation

wking commented Mar 24, 2021

openshift-ci-robot commented Mar 24, 2021

openshift-ci-robot commented Mar 24, 2021

tremes Mar 24, 2021

Choose a reason for hiding this comment

wking Mar 24, 2021

Choose a reason for hiding this comment

tremes Mar 25, 2021

Choose a reason for hiding this comment

tremes Mar 24, 2021

Choose a reason for hiding this comment

wking Mar 24, 2021

Choose a reason for hiding this comment

tremes Mar 25, 2021

Choose a reason for hiding this comment

tremes commented Mar 25, 2021

wking commented Mar 26, 2021

wking commented Mar 26, 2021 • edited Loading

tremes commented Mar 27, 2021

tremes commented Mar 29, 2021

tremes commented Mar 29, 2021

openshift-bot commented Mar 29, 2021

tremes commented Mar 30, 2021

wking commented Apr 5, 2021

tremes commented Apr 6, 2021

openshift-ci-robot commented Apr 6, 2021

openshift-ci-robot commented Apr 6, 2021

wking commented Apr 9, 2021

openshift-cherrypick-robot commented Apr 9, 2021

wking commented Mar 26, 2021 •

edited

Loading