Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conditional gatherer of logs of unhealthy pods #509

Merged

Conversation

natiiix
Copy link
Contributor

@natiiix natiiix commented Sep 22, 2021

This PR adds a new conditional gatherer that gathers pod logs based on firing alerts. It looks for KubePodCrashlooping and KubePodNotReady, and collects the appropriate log from their corresponding pods/containers.

Categories

  • Bugfix
  • Enhancement
  • Backporting
  • Others (CI, Infrastructure, Documentation)

Sample Archive

  • pkg/gatherers/conditional/gathering_rule.schema.json

Documentation

  • docs/gathered-data.md

Unit Tests

  • pkg/gatherers/conditional/gather_logs_of_unhealthy_pods_test.go

Privacy

Yes. There are no sensitive data in the newly collected information.

We are already gathering pod logs elsewhere and this gatherer should only be triggered for pods/alerts from openshift-* namespaces.

Changelog

No.

Breaking Changes

No, there are only additions; no modification to existing gathering.

References

Jira Task: https://issues.redhat.com/browse/CCXDEV-5499

@openshift-ci
Copy link

openshift-ci bot commented Sep 22, 2021

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 22, 2021
@natiiix natiiix force-pushed the conditional-unhealthy-pods-logs branch 2 times, most recently from 507a66d to 1760e64 Compare September 29, 2021 07:16
@natiiix natiiix force-pushed the conditional-unhealthy-pods-logs branch 2 times, most recently from 4f9d16c to acdb78b Compare October 11, 2021 08:08
@sferich888
Copy link

Is log collection something we really want in 'insights'?

cc: @smarterclayton - I know you and I have talked on this in the past; have we changed our stance/scope on what insights should focus on?

@tremes
Copy link
Contributor

tremes commented Oct 12, 2021

@sferich888 I think the answer is yes. We already have some gatherers collecting various container logs and I believe some of them are really important - e.g https://github.com/openshift/insights-operator/blob/master/docs/gathered-data.md#clusteroperatorpodsandevents. Other examples are:

Also note that these logs in this PR are gathered only when the corresponding alert is firing.

@natiiix natiiix marked this pull request as ready for review October 12, 2021 08:28
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 12, 2021
@natiiix natiiix changed the title [WIP] Conditional gatherer of logs of unhealthy pods Conditional gatherer of logs of unhealthy pods Oct 12, 2021
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 12, 2021
containerFilter := ""
if alertContainer != "" {
containerFilter = fmt.Sprintf("^%s$", alertContainer)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just cosmetic detail, but maybe doing:

		if alertContainer, ok := alertLabels["container"]; ok {
			containerFilter = fmt.Sprintf("^%s$", alertContainer)
		}

would be more straightforward

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tremes You're right. The process that lead to this code had many steps where this was all done very differently, so as I was moving parts of the code around, I didn't realize the alertContainer was no longer used further in the code (because it used to be).

@tremes
Copy link
Contributor

tremes commented Oct 12, 2021

I didn't try to reproduce the alerts to trigger this new gatherer. I added few minor suggestions, but it looks very good! Thanks. You need to satisfy the linting.

@natiiix
Copy link
Contributor Author

natiiix commented Oct 12, 2021

/test lint

@natiiix
Copy link
Contributor Author

natiiix commented Oct 13, 2021

/retest

@natiiix
Copy link
Contributor Author

natiiix commented Oct 13, 2021

/test unit

1 similar comment
@natiiix
Copy link
Contributor Author

natiiix commented Oct 13, 2021

/test unit

@natiiix natiiix force-pushed the conditional-unhealthy-pods-logs branch from 8509358 to 345c9e6 Compare October 13, 2021 11:26
@@ -27,6 +27,8 @@ const (
GatherAPIRequestCounts GatheringFunctionName = "api_request_counts_of_resource_from_alert"
)

const GatherLogsOfUnhealthyPods GatheringFunctionName = "logs_of_unhealthy_pods"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could probably make it more specific as in api_request_counts_of_resource_from_alert, but dunno.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I would like if this gatherer could be used in general for any case of an alert where we want to gather logs from the corresponding pod, so the only specification I could add would be that it's based on alerts.

Copy link
Contributor

@Serhii1011010 Serhii1011010 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great

@natiiix
Copy link
Contributor Author

natiiix commented Oct 25, 2021

/test e2e-agnostic-upgrade

1 similar comment
@natiiix
Copy link
Contributor Author

natiiix commented Oct 26, 2021

/test e2e-agnostic-upgrade

@xJustin
Copy link
Contributor

xJustin commented Nov 1, 2021

/label docs-approved

@openshift-ci openshift-ci bot added the docs-approved Signifies that Docs has signed off on this PR label Nov 1, 2021
@sferich888
Copy link

/label px-approved

@openshift-ci openshift-ci bot added the px-approved Signifies that Product Support has signed off on this PR label Nov 1, 2021
@quarckster
Copy link
Contributor

/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Nov 1, 2021
Copy link
Contributor

@rluders rluders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@rluders
Copy link
Contributor

rluders commented Nov 2, 2021

/label lgtm

@openshift-ci
Copy link

openshift-ci bot commented Nov 2, 2021

@rluders: The label(s) /label lgtm cannot be applied. These labels are supported: platform/aws, platform/azure, platform/baremetal, platform/google, platform/libvirt, platform/openstack, ga, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, px-approved, docs-approved, qe-approved, downstream-change-needed, backport-risk-assessed, cherry-pick-approved

In response to this:

/label lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 2, 2021
@openshift-ci
Copy link

openshift-ci bot commented Nov 2, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: natiiix, rluders

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 22ea9c9 into openshift:master Nov 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. docs-approved Signifies that Docs has signed off on this PR lgtm Indicates that a PR is ready to be merged. px-approved Signifies that Product Support has signed off on this PR qe-approved Signifies that QE has signed off on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants