title | authors | reviewers | approvers | creation-date | last-updated | status | see-also | replaces | superseded-by | |||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Must-Gather |
|
|
|
2019-09-09 |
2019-09-09 |
implemented |


- From https://github.com/openshift/enhancements/pull/7/must-gather.md
- From https://github.com/openshift/enhancements/pull/18/must-gather.md
- From https://github.com/openshift/enhancements/pull/77/must-gather.md
- From https://github.com/openshift/enhancements/pull/135/must-gather.md

- Enhancement is
implementable
- Design details are appropriately documented from clear requirements
- Test plan is defined
- Graduation criteria for dev preview, tech preview, GA
- User-facing documentation is created in [openshift/docs] 

To debug something broken in the cluster, it is important to have a single command for an unskilled customer to run that
gathers all the information we may need to solve the problem. If you're familiar with sosreport
, the idea is to have
that, but focused on a kuberentes cluster instead of a host. We need avoid the versioning skew complexity and the scaling
problems inherent in a shared repo with input from multiple products.

 Gather everything. Software breaks and you aren't smart enough to know exactly what you need to debug the problem ahead of time. You figure that out after you've debugged the problem. This tool is about the first shotgun gathering so you only have to ask the customer once.  It must be simple. You're gathering because your software is buggy and hard to use. The more complex your gathering software, the more likely it is that the gathering software fails too. Simplify your gathering software by only using a matching version of the gathering tool. This simplifies code and test matrices so that your tool always works. For instance, OpenShift payloads include the exact level of gathering used to debug that payload.  Own your destiny. You own shipping your own software. If you can be trusted to ship your software, you can be trusted to ship your gathering tool to match it, don't let your gathering be gated by another product. It may seems easier to start, but ultimately you'll end up constrained by different motivations, styles, and cadences. If you can ship one image, you can ship a second one. 

- Gathering should exactly match a single version of the product it is inspecting.
- Different products should be responsible for gathering for their own components.
- Introspection for clusteroperators should be largely automatic, covering a broad range of generic use-cases.
- A single, low-arg client command for users.
- In a failing cluster, gathering should be maximized to collect everything it can even when part of it fails.
- CEE should own the gather script itself, since they are the first consumers. 


must-gather
for openshift is a combination of three tools:

- A client-side
inspect
command that works like a super-get. It has semantic understand of some resources and traverses links to get interesting information beyond the current. Pods in namespaces and logs for those pods as a for instance. Currently this isopenshift-must-gather inspect
, but we are porting this tooc adm
as experiemental in 4.3. We may change and extend arguments over time, but the intent of the command will remain. - The openshift-must-gather image, produced from https://github.com/openshift/must-gather. The entry point is a /gather bash script owned by CEE (not the developers) that describes what to gather. It is tightly coupled to the openshift payload and only contains logic to gather information from that payload. We have e2e tests that make sure this functions.
oc adm must-gather --image
which is a client-side tool that runs any must-gather compatible image by creating a pod, running the/usr/bin/gather
binary, and then rsyncing the/must-gather
and includes the logs of the pod. 

oc adm inspect
is a noteworthy command because of the way that it traverses and gathers information. Intead of being
truly generic, it has a generic fallback, but it understands many resources so that you can express an intent like,
"look at this cluster operator".
oc adm inspect clusteroperator/kube-apiserver
does...
- Queue the resource
- Get and dump the resource (clusteroperator)
- Check against a well-known list of resources to do custom logic for
- If custom logic is found, queue more resources to iterate through
- Perform the custom logic.  There are several special cases today.
- clusteroperators
- get all config.openshift.io resources
- queue all clusteroperator's related resources under
.status.relatedObjects
- namespaces
- queue everything in the
all
API category - queue secrets, configmaps, events, and PVCs (these are standard, core kube resources)
- queue everything in the
- routes
- elide secret content from routes
- secrets
- elide secret content from secrets. Some keys are known to be non-secret though (ca.crt or tls.crt, for instance)
- pods
- gets all current and previous container logs
- take a best guess to find a metrics endpoint
- take a best guess to find a healthz endpoint and all sub-healthz endpoints
To provide your own must-gather image, it must.... 
- Must have a zero-arg, executable file at
/usr/bin/gather
that does your default gathering - Must produce data to be copied back at
/must-gather
. The data must not contain any sensitive data. We don't string PII information, only secret information. - Must produce a text
/must-gather/version
that indicates the product (first line) and the version (second line,major.minor.micro.qualifier
), so that programmatic analysis can be developed.  

If the oc adm must-gather
tool's pod cannot be scheduled or run on the cluster, the oc adm must-gather
tool will, after a timeout, fall-back to running oc adm inspect clusteroperators
locally.



 What are the caveats to the implementation? What are some important details that didn't come across above. Go in to as much detail as necessary here. This might be a good place to talk about core concepts and how they releate. 
 What are the risks of this proposal and how do we mitigate. Think broadly. For example, consider both security and how this will impact the larger OKD ecosystem.  How will security be reviewed and by whom? How will UX be reviewed and by whom?  Consider including folks that also work outside your immediate sub-project. 
 This is subject to change, but today we do this by running the must-gather image in an init containers and then we have a container that sleeps forever. We download the result and then delete the namespace to cleanup. 
 The output of a must-gather image is up to the component producing the image. This is how openshift/must-gather is currently organized. 
├── audit_logs
│ ├── kube-apiserver
│ │ ├── zipped audit files from each master here
│ ├── openshift-apiserver
│ │ ├── zipped audit files from each master here
├── cluster-scoped-resources
│ ├── <API_GROUP_NAME>
│ │ ├── <API_RESOURCE_PLURAL>.yaml
│ │ └── <API_RESOURCE_PLURAL>
│ │ └── individually referenced resources here
│ ├── config.openshift.io
│ │ ├── authentications.yaml
│ │ ├── apiservers.yaml
│ │ ├── builds.yaml
│ │ ├── clusteroperators.yaml
│ │ ├── clusterversions.yaml
│ │ ├── consoles.yaml
│ │ ├── dnses.yaml
│ │ ├── featuregates.yaml
│ │ ├── images.yaml
│ │ ├── infrastructures.yaml
│ │ ├── ingresses.yaml
│ │ ├── networks.yaml
│ │ ├── oauths.yaml
│ │ ├── projects.yaml
│ │ ├── schedulers.yaml
│ │ └── support.yaml
│ ├── core
│ │ └── nodes
│ ├── machineconfiguration.openshift.io
│ │ ├── machineconfigpools
│ │ └── machineconfigs
│ ├── network.openshift.io
│ │ ├── clusternetworks
│ │ └── hostsubnets
│ ├── oauth.openshift.io
│ │ └── oauthclients
│ ├── operator.openshift.io
│ │ ├── authentications
│ │ ├── consoles
│ │ ├── kubeapiservers
│ │ ├── kubecontrollermanagers
│ │ ├── kubeschedulers
│ │ ├── openshiftapiservers
│ │ ├── openshiftcontrollermanagers
│ │ ├── servicecas
│ │ └── servicecatalogcontrollermanagers
│ ├── rbac.authorization.k8s.io
│ │ ├── clusterrolebindings
│ │ └── clusterroles
│ ├── samples.operator.openshift.io
│ │ └── configs
│ └── storage.k8s.io
│ └── storageclasses
├── host_service_logs
│ └── masters
│ ├── crio_service.log
│ └── kubelet_service.log
└── namespaces
├── <NAMESPACE>
│ ├── <API_GROUP_NAME>
│ | ├── <API_RESOURCE_PLURAL>.yaml
│ | └── <API_RESOURCE_PLURAL>
│ | └── individually referenced resources here
│ └── pods
│ └── <POD_NAME>
│ ├── <POD_NAME>.yaml
│ └── <CONTAINER_NAME>
│ └── <CONTAINER_NAME>
│ ├── healthz
│ | └── <SUB_HEALTH>
│ ├── logs
│ | ├── current.log
│ | └── previous.log
│ └── metrics.json
├── default
│ ├── apps
│ │ ├── daemonsets.yaml
│ │ ├── deployments.yaml
│ │ ├── replicasets.yaml
│ │ └── statefulsets.yaml
│ ├── apps.openshift.io
│ │ └── deploymentconfigs.yaml
│ ├── autoscaling
│ │ └── horizontalpodautoscalers.yaml
│ ├── batch
│ │ ├── cronjobs.yaml
│ │ └── jobs.yaml
│ ├── build.openshift.io
│ │ ├── buildconfigs.yaml
│ │ └── builds.yaml
│ ├── core
│ │ ├── configmaps.yaml
│ │ ├── events.yaml
│ │ ├── pods.yaml
│ │ ├── replicationcontrollers.yaml
│ │ ├── secrets.yaml
│ │ └── services.yaml
│ ├── default.yaml
│ ├── image.openshift.io
│ │ └── imagestreams.yaml
│ └── route.openshift.io
│ └── routes.yaml
...
There is an e2e test that makes sure the command always exits successfully and that certain apsects of the content are always present.
The image is included in the payload, but has no content running in a cluster to upgrade.
The oc
command must skew +/- one like normal commands.