Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1942271: Gather openshift-cluster-version pods and events #381

Merged

Conversation

wking
Copy link
Member

@wking wking commented Mar 24, 2021

Using similar logic to what gatherClusterOperators uses, but for the ClusterVersion operator's namespace. Usually we'd only gather the pod YAML if the pod was failing, but in order to help audit tolerations for rhbz#1941901, at the moment I'm gathering it every time.

With two initial pivots to set the stage:

  • pkg/gather/clusterconfig: Collapse GatherClusterID into GatherClusterVersion
  • pkg/insights/insightsclient: Inline client handling for ClusterVersion

Details on everything in the respective commit messages.

@openshift-ci-robot openshift-ci-robot added the bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. label Mar 24, 2021
@openshift-ci-robot
Copy link
Contributor

@wking: This pull request references Bugzilla bug 1941901, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.8.0) matches configured target release for branch (4.8.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @jianlinliu

In response to this:

Bug 1941901: Gather openshift-cluster-version pods and events

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Mar 24, 2021
@wking wking changed the title Bug 1941901: Gather openshift-cluster-version pods and events Bug 1942271: Gather openshift-cluster-version pods and events Mar 24, 2021
@openshift-ci-robot
Copy link
Contributor

@wking: This pull request references Bugzilla bug 1942271, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.8.0) matches configured target release for branch (4.8.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @psimovec

In response to this:

Bug 1942271: Gather openshift-cluster-version pods and events

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking wking force-pushed the gather-cluster-version-pods branch 4 times, most recently from d258760 to 8699181 Compare March 24, 2021 03:35
c <- gatherResult{nil, []error{err}}
return
klog.V(2).Infof("Unable to find pods in namespace %s for cluster-version operator", namespace)
return records, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe return the error here as well? Btw what is the expected number of pods in this namespace please? only one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, in most cases there will only be one. Returning nil here is pattern-matching this existing error-swallowing. If you want me including this error in the return, do you want me to also patch to return that one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh ok. Well it's true that it's not so critical for the gatherer, so we can swallow it.

pod := &pods.Items[i]

// TODO: shift after IsHealthyPod
records = append(records, record.Record{Name: fmt.Sprintf("config/pod/%s/%s", pod.Namespace, pod.Name), Item: record.JSONMarshaller{Object: pod}})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK but what is the plan with the shift? It can always cause some problems if you remove some things that were already in the archive. I guess it's hopefully not a big deal in this case, but it's still good to know the plan :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plan is:

  1. Land this PR for rhbz#1942271.
  2. Backport this PR to 4.7.
  3. Collect a few weeks of Insights tarballs.
  4. Audit them for CVO tolerations, to see if Insights-reporting users have been making any changes we're concerned about clobbering.
  5. Follow-up insights PR to address the TODOs here, so we only collect the CVO pod when it's unhealthy.
  6. In parallel with 5, patch the CVO to fix rhbz#1941901.

I'm agnostic about whether 5 gets a bug and a backport to 4.7.z. Would be easy enough to hang on a third "insights operator collects CVO when the pod is healthy" bug series if we wanted to.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK thanks so there's plan to backport this to 4.7 as well if I understand it correctly right.

@tremes
Copy link
Contributor

tremes commented Mar 25, 2021

@wking Can I ask you to run make gen-doc (to update our simple doc in docs/gathered-data/md) in your branch? It's also slightly related to my https://github.com/openshift/insights-operator/pull/381/files#r601169083. I'll approve the PR then :) Thanks.

wking added a commit to wking/insights-operator that referenced this pull request Mar 26, 2021
…Version

There's no reason to fetch the ClusterVersion twice, even if we are
creating two Records based on its content.  This also sets the stage
for gathering additional items like sad cluster-version operator pods.

I personally don't see a need to call out the fact that ClusterVersion
included clusterID, but Tomas wanted it [1].

[1]: openshift#381 (comment)
@wking wking force-pushed the gather-cluster-version-pods branch 2 times, most recently from ad12c5c to 518c25a Compare March 26, 2021 17:28
@wking
Copy link
Member Author

wking commented Mar 26, 2021

Can I ask you to run make gen-doc...

Done with 8699181 -> 518c25a.

@wking
Copy link
Member Author

wking commented Mar 26, 2021

Checking an earlier CI run, which was b15a9131-b10c-499b-a60a-dea5810c7a73:

$ tar tvz < "$(ls | tail -n1)" | grep 'config/id\|config/version\|cluster-version'
-rw-r----- 0/0            7934 2021-03-24 04:36 events/openshift-cluster-version.json
-rw-r----- 0/0            7691 2021-03-24 04:36 config/pod/openshift-cluster-version/cluster-version-operator-bdf7f95f8-68qx2.json
-rw-r----- 0/0              36 2021-03-24 04:36 config/id
-rw-r----- 0/0            1975 2021-03-24 04:36 config/version.json
$ tar xOz config/pod/openshift-cluster-version/cluster-version-operator-bdf7f95f8-68qx2.json < "$(ls | tail -n1)" | jq -cS '.spec.tolerations[]'
{"effect":"NoSchedule","key":"node-role.kubernetes.io/master","operator":"Exists"}
{"effect":"NoSchedule","key":"node.kubernetes.io/unschedulable","operator":"Exists"}
{"effect":"NoSchedule","key":"node.kubernetes.io/network-unavailable","operator":"Exists"}
{"effect":"NoExecute","key":"node.kubernetes.io/not-ready","operator":"Exists","tolerationSeconds":120}
{"effect":"NoExecute","key":"node.kubernetes.io/unreachable","operator":"Exists","tolerationSeconds":120}
{"effect":"NoSchedule","key":"node.kubernetes.io/memory-pressure","operator":"Exists"}

So that looks good to me.

@tremes
Copy link
Contributor

tremes commented Mar 27, 2021

/retest

@tremes
Copy link
Contributor

tremes commented Mar 29, 2021

I ran it locally against cluster-bot cluster. The config/version.json and config/id resources are gathered in the same way. The cluster-version-operator (from openshit-cluster-version namespace) pod definition is always gathered.

@tremes
Copy link
Contributor

tremes commented Mar 29, 2021

/lgtm

@openshift-ci-robot openshift-ci-robot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Mar 29, 2021
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 29, 2021
@tremes
Copy link
Contributor

tremes commented Mar 30, 2021

@wking please rebase.

Using the native client code directly, instead of through gather's
wrapper, is not much of a convenience hit.  And this gives the client
space to improve its client persistence going forward.  Completes the
transition begun in 73d3cfd (Move gather function execution into
the Gatherer, also removes the link between gathering and
insightsclient, 2020-12-03, openshift#279) to decouple the client from the
gatherer.  This also allows me to make getClusterVersion internal on
the gather side, so I can shuffle it's API a bit in future commits
without breaking any consumers.
…Version

There's no reason to fetch the ClusterVersion twice, even if we are
creating two Records based on its content.  This also sets the stage
for gathering additional items like sad cluster-version operator pods.

I personally don't see a need to call out the fact that ClusterVersion
included clusterID, but Tomas wanted it [1].

[1]: openshift#381 (comment)
@wking wking force-pushed the gather-cluster-version-pods branch from 518c25a to 2f6d165 Compare April 5, 2021 18:12
@openshift-ci-robot openshift-ci-robot removed lgtm Indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Apr 5, 2021
@wking
Copy link
Member Author

wking commented Apr 5, 2021

Rebased with 518c25a -> 2f6d165

Using similar logic to what gatherClusterOperators uses, but for the
ClusterVersion operator's namespace.  Usually we'd only gather the pod
YAML if the pod was failing, but in order to help audit tolerations
for [1], at the moment I'm gathering it every time.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1941901
Generated with:

  $ make gen-doc
@wking wking force-pushed the gather-cluster-version-pods branch from 2f6d165 to aec63b2 Compare April 6, 2021 01:55
@tremes
Copy link
Contributor

tremes commented Apr 6, 2021

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 6, 2021
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tremes, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 79935d0 into openshift:master Apr 6, 2021
@openshift-ci-robot
Copy link
Contributor

@wking: All pull requests linked via external trackers have merged:

Bugzilla bug 1942271 has been moved to the MODIFIED state.

In response to this:

Bug 1942271: Gather openshift-cluster-version pods and events

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking wking deleted the gather-cluster-version-pods branch April 6, 2021 18:20
@wking
Copy link
Member Author

wking commented Apr 9, 2021

/cherrypick release-4.7

@openshift-cherrypick-robot

@wking: #381 failed to apply on top of branch "release-4.7":

Applying: pkg/insights/insightsclient: Inline client handling for ClusterVersion
Using index info to reconstruct a base tree...
M	pkg/gather/clusterconfig/version.go
M	pkg/insights/insightsclient/insightsclient.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/insights/insightsclient/insightsclient.go
Auto-merging pkg/gather/clusterconfig/version.go
Applying: pkg/gather/clusterconfig: Collapse GatherClusterID into GatherClusterVersion
Using index info to reconstruct a base tree...
M	pkg/gather/clusterconfig/0_gatherer.go
M	pkg/gather/clusterconfig/version.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/gather/clusterconfig/version.go
CONFLICT (content): Merge conflict in pkg/gather/clusterconfig/version.go
Auto-merging pkg/gather/clusterconfig/0_gatherer.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0002 pkg/gather/clusterconfig: Collapse GatherClusterID into GatherClusterVersion
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherrypick release-4.7

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants