Must-Gather

Taken from (since deleted) must-gather.md

From https://github.com/openshift/enhancements/pull/7/must-gather.md
From https://github.com/openshift/enhancements/pull/18/must-gather.md
From https://github.com/openshift/enhancements/pull/77/must-gather.md
From https://github.com/openshift/enhancements/pull/135/must-gather.md

Release Signoff Checklist

Enhancement is implementable
Design details are appropriately documented from clear requirements
Test plan is defined
Graduation criteria for dev preview, tech preview, GA
User-facing documentation is created in [openshift/docs]

Summary

To debug something broken in the cluster, it is important to have a single command for an unskilled customer to run that gathers all the information we may need to solve the problem. If you're familiar with sosreport, the idea is to have that, but focused on a kuberentes cluster instead of a host. We need avoid the versioning skew complexity and the scaling problems inherent in a shared repo with input from multiple products.

Motivation

Gather everything. Software breaks and you aren't smart enough to know exactly what you need to debug the problem ahead of time. You figure that out after you've debugged the problem. This tool is about the first shotgun gathering so you only have to ask the customer once. It must be simple. You're gathering because your software is buggy and hard to use. The more complex your gathering software, the more likely it is that the gathering software fails too. Simplify your gathering software by only using a matching version of the gathering tool. This simplifies code and test matrices so that your tool always works. For instance, OpenShift payloads include the exact level of gathering used to debug that payload. Own your destiny. You own shipping your own software. If you can be trusted to ship your software, you can be trusted to ship your gathering tool to match it, don't let your gathering be gated by another product. It may seems easier to start, but ultimately you'll end up constrained by different motivations, styles, and cadences. If you can ship one image, you can ship a second one.

Goals

Gathering should exactly match a single version of the product it is inspecting.
Different products should be responsible for gathering for their own components.
Introspection for clusteroperators should be largely automatic, covering a broad range of generic use-cases.
A single, low-arg client command for users.
In a failing cluster, gathering should be maximized to collect everything it can even when part of it fails.
CEE should own the gather script itself, since they are the first consumers.

Non-Goals

Proposal

must-gather for openshift is a combination of three tools:

A client-side inspect command that works like a super-get. It has semantic understand of some resources and traverses links to get interesting information beyond the current. Pods in namespaces and logs for those pods as a for instance. Currently this is openshift-must-gather inspect, but we are porting this to oc adm as experiemental in 4.3. We may change and extend arguments over time, but the intent of the command will remain.
The openshift-must-gather image, produced from https://github.com/openshift/must-gather. The entry point is a /gather bash script owned by CEE (not the developers) that describes what to gather. It is tightly coupled to the openshift payload and only contains logic to gather information from that payload. We have e2e tests that make sure this functions.
oc adm must-gather --image which is a client-side tool that runs any must-gather compatible image by creating a pod, running the /usr/bin/gather binary, and then rsyncing the /must-gather and includes the logs of the pod.

`inspect`

oc adm inspect is a noteworthy command because of the way that it traverses and gathers information. Intead of being truly generic, it has a generic fallback, but it understands many resources so that you can express an intent like, "look at this cluster operator".

oc adm inspect clusteroperator/kube-apiserver does...

Queue the resource
Get and dump the resource (clusteroperator)
Check against a well-known list of resources to do custom logic for
If custom logic is found, queue more resources to iterate through
Perform the custom logic. There are several special cases today.
clusteroperators
1. get all config.openshift.io resources
2. queue all clusteroperator's related resources under .status.relatedObjects
namespaces
1. queue everything in the all API category
2. queue secrets, configmaps, events, and PVCs (these are standard, core kube resources)
routes
1. elide secret content from routes
secrets
1. elide secret content from secrets. Some keys are known to be non-secret though (ca.crt or tls.crt, for instance)
pods
1. gets all current and previous container logs
2. take a best guess to find a metrics endpoint
3. take a best guess to find a healthz endpoint and all sub-healthz endpoints

must-gather Images

To provide your own must-gather image, it must....

Must have a zero-arg, executable file at /usr/bin/gather that does your default gathering
Must produce data to be copied back at /must-gather. The data must not contain any sensitive data. We don't string PII information, only secret information.
Must produce a text /must-gather/version that indicates the product (first line) and the version (second line, major.minor.micro.qualifier), so that programmatic analysis can be developed.

local fall-back

If the oc adm must-gather tool's pod cannot be scheduled or run on the cluster, the oc adm must-gather tool will, after a timeout, fall-back to running oc adm inspect clusteroperators locally.

User Stories [optional]

Story 1

Story 2

Implementation Details/Notes/Constraints [optional]

What are the caveats to the implementation? What are some important details that didn't come across above. Go in to as much detail as necessary here. This might be a good place to talk about core concepts and how they releate.

Risks and Mitigations

What are the risks of this proposal and how do we mitigate. Think broadly. For example, consider both security and how this will impact the larger OKD ecosystem. How will security be reviewed and by whom? How will UX be reviewed and by whom? Consider including folks that also work outside your immediate sub-project.

Design Details

This is subject to change, but today we do this by running the must-gather image in an init containers and then we have a container that sleeps forever. We download the result and then delete the namespace to cleanup.

Output Format

The output of a must-gather image is up to the component producing the image. This is how openshift/must-gather is currently organized.

├── audit_logs
│   ├── kube-apiserver
│   │   ├── zipped audit files from each master here
│   ├── openshift-apiserver
│   │   ├── zipped audit files from each master here
├── cluster-scoped-resources
│   ├── <API_GROUP_NAME>
│   │   ├── <API_RESOURCE_PLURAL>.yaml
│   │   └── <API_RESOURCE_PLURAL>
│   │       └── individually referenced resources here
│   ├── config.openshift.io
│   │   ├── authentications.yaml
│   │   ├── apiservers.yaml
│   │   ├── builds.yaml
│   │   ├── clusteroperators.yaml
│   │   ├── clusterversions.yaml
│   │   ├── consoles.yaml
│   │   ├── dnses.yaml
│   │   ├── featuregates.yaml
│   │   ├── images.yaml
│   │   ├── infrastructures.yaml
│   │   ├── ingresses.yaml
│   │   ├── networks.yaml
│   │   ├── oauths.yaml
│   │   ├── projects.yaml
│   │   ├── schedulers.yaml
│   │   └── support.yaml
│   ├── core
│   │   └── nodes
│   ├── machineconfiguration.openshift.io
│   │   ├── machineconfigpools
│   │   └── machineconfigs
│   ├── network.openshift.io
│   │   ├── clusternetworks
│   │   └── hostsubnets
│   ├── oauth.openshift.io
│   │   └── oauthclients
│   ├── operator.openshift.io
│   │   ├── authentications
│   │   ├── consoles
│   │   ├── kubeapiservers
│   │   ├── kubecontrollermanagers
│   │   ├── kubeschedulers
│   │   ├── openshiftapiservers
│   │   ├── openshiftcontrollermanagers
│   │   ├── servicecas
│   │   └── servicecatalogcontrollermanagers
│   ├── rbac.authorization.k8s.io
│   │   ├── clusterrolebindings
│   │   └── clusterroles
│   ├── samples.operator.openshift.io
│   │   └── configs
│   └── storage.k8s.io
│       └── storageclasses
├── host_service_logs
│   └── masters
│       ├── crio_service.log
│       └── kubelet_service.log
└── namespaces
    ├── <NAMESPACE>
    │   ├── <API_GROUP_NAME>
    │   |   ├── <API_RESOURCE_PLURAL>.yaml
    │   |   └── <API_RESOURCE_PLURAL>
    │   |       └── individually referenced resources here
    │   └── pods
    │       └── <POD_NAME>
    │           ├── <POD_NAME>.yaml
    │           └── <CONTAINER_NAME>
    │               └── <CONTAINER_NAME>
    │                   ├── healthz
    │                   |   └── <SUB_HEALTH>
    │                   ├── logs
    │                   |   ├── current.log
    │                   |   └── previous.log
    │                   └── metrics.json
    ├── default
    │   ├── apps
    │   │   ├── daemonsets.yaml
    │   │   ├── deployments.yaml
    │   │   ├── replicasets.yaml
    │   │   └── statefulsets.yaml
    │   ├── apps.openshift.io
    │   │   └── deploymentconfigs.yaml
    │   ├── autoscaling
    │   │   └── horizontalpodautoscalers.yaml
    │   ├── batch
    │   │   ├── cronjobs.yaml
    │   │   └── jobs.yaml
    │   ├── build.openshift.io
    │   │   ├── buildconfigs.yaml
    │   │   └── builds.yaml
    │   ├── core
    │   │   ├── configmaps.yaml
    │   │   ├── events.yaml
    │   │   ├── pods.yaml
    │   │   ├── replicationcontrollers.yaml
    │   │   ├── secrets.yaml
    │   │   └── services.yaml
    │   ├── default.yaml
    │   ├── image.openshift.io
    │   │   └── imagestreams.yaml
    │   └── route.openshift.io
    │       └── routes.yaml
...

Test Plan

There is an e2e test that makes sure the command always exits successfully and that certain apsects of the content are always present.

Graduation Criteria

Upgrade / Downgrade Strategy

The image is included in the payload, but has no content running in a cluster to upgrade.

Version Skew Strategy

The oc command must skew +/- one like normal commands.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

must-gather.md

must-gather.md

Must-Gather

Taken from (since deleted) must-gather.md

Release Signoff Checklist

Summary

Motivation

Goals

Non-Goals

Proposal

`inspect`

must-gather Images

local fall-back

User Stories [optional]

Story 1

Story 2

Implementation Details/Notes/Constraints [optional]

Risks and Mitigations

Design Details

Output Format

Test Plan

Graduation Criteria

Upgrade / Downgrade Strategy

Version Skew Strategy

Implementation History

Drawbacks

Alternatives

Infrastructure Needed [optional]

Files

must-gather.md

Latest commit

History

must-gather.md

File metadata and controls

Must-Gather

Taken from (since deleted) must-gather.md

Release Signoff Checklist

Summary

Motivation

Goals

Non-Goals

Proposal

inspect

must-gather Images

local fall-back

User Stories [optional]

Story 1

Story 2

Implementation Details/Notes/Constraints [optional]

Risks and Mitigations

Design Details

Output Format

Test Plan

Graduation Criteria

Upgrade / Downgrade Strategy

Version Skew Strategy

Implementation History

Drawbacks

Alternatives

Infrastructure Needed [optional]

`inspect`