-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ci-operator/templates/openshift/installer/cluster-launch-installer-e2e: Gather node console logs on AWS #6189
Conversation
+1, this would be extremely useful debugging nodes that are not rejoining the cluster after a reboot |
Hmm, doesn't seem to have worked. Maybe I need to pull off the |
6fdc542
to
8f45015
Compare
Pushed 6fdc5420a -> 8f450154f adding a |
We might need the --latest option. Does not explain the empty file though. |
Hmm, must not be installing into the
|
8f45015
to
4e1c32b
Compare
Pushed 8f450154f -> 4e1c32b98 setting |
…e: Gather node console logs on AWS To help debug things like [1]: Dec 2 16:31:41.298: INFO: cluster upgrade is Failing: Cluster operator kube-apiserver is reporting a failure: NodeControllerDegraded: The master node(s) "ip-10-0-136-232.ec2.internal" not ready ... Kubelet stopped posting node status. where a node goes down but does not come back up far enough to reconnect as a node. Eventually, we'll address this with machine-health checks, killing the non-responsive machine and automatically replacing it with a new one. That's currently waiting on an etcd operator that can handle reconnecting control-plane machines automatically. But in the short term, and possibly still in the long term, it's nice to collect what we can from the broken machine to understand why it didn't come back up. This code isn't specific to broken machines, but collecting console logs from all nodes should cover us in the broken-machine case as well. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1778904
4e1c32b
to
e102a16
Compare
Closer :). Pushed 4e1c32b98 -> e102a16 to address `You must specify a region. You can also configure your region by running "aws configure". |
/retest |
/retest |
@wking: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
export PATH="${HOME}/.local/bin:${PATH}" | ||
easy_install --user pip # our Python 2.7.5 is even too old for ensurepip | ||
pip install --user awscli | ||
export AWS_REGION="$(python -c 'import json; data = json.load(open("/tmp/artifacts/installer/metadata.json")); print(data["aws"]["region"])')" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps AWS_REGION
should be stored during setup
in /tmp/artifacts
somewhere so that other containers would not re-discover it (e.g. line 323
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: vrutkovs, wking The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@wking: Updated the following 3 configmaps:
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
wait, was this working? Latest CRI-O run still had:
|
oh snap, the bot labelled this as lgtm, I didn't know my github approval works same as Mind making a new PR with a fix then? |
…e2e: Set AWS_DEFAULT_REGION The command prefers that form [1], and doesn't work with AWS_REGION [2]. [1]: https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-envvars.html [2]: openshift#6189 (comment)
Bringing over a number of changes which have landed in ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml as of openshift/release@016eb4ed27 (Merge pull request openshift/release#6505 from hongkailiu/clusterReaders, 2019-12-19). One series was improved kill logic: * openshift/release@9cd158adf3 (template: Use a more correct kill command, 2019-12-03, openshift/release#6223). * openshift/release@d0744e520d (exit with 0 even if kill failed, 2019-12-09, openshift/release#6295) Another series was around AWS instance console logs: * openshift/release@e102a16d89 (ci-operator/templates/openshift/installer/cluster-launch-installer-e2e: Gather node console logs on AWS, 2019-12-02, openshift/release#6189). * openshift/release@26fde70045 (ci-operator/templates/openshift/installer/cluster-launch-installer-e2e: Set AWS_DEFAULT_REGION, 2019-12-04, openshift/release#6249). And there was also: * openshift/release@cdf97164aa (templates: Add large and xlarge variants, 2019-11-25, openshift/release#6081). * openshift/release@8cbef5e4a7 (ci-operator/templates/openshift/installer/cluster-launch-installer-e2e: Error-catching for Google OAuth pokes, 2019-12-02, openshift/release#6190). * openshift/release@ad29eda8dd (template: Gather the prometheus target metadata during teardown, 2019-12-12, openshift/release#6379).
History of this logic: * Initially landed for nodes in e102a16 (ci-operator/templates/openshift/installer/cluster-launch-installer-e2e: Gather node console logs on AWS, 2019-12-02, openshift#6189). * Grew --text in 6ec5bf3 (installer artifacts: keep text version of instance output, 2020-01-02, openshift#6536). * Grew machine handling in a469f53 (ci-operator/templates/openshift/installer/cluster-launch-installer-e2e: Gather console logs from Machine providerID too, 2020-01-29, openshift#6906). * The node-provider-IDs creation was ported to steps in e2fd5c7 (step-registry: update deprovision step, 2020-01-30, openshift#6708), but without any consumer for the collected file. The aws-instance-ids.txt injection allows install-time steps to register additional instances for later console collection (for a proxy instance, bootstrap instance, etc.). Approvers are: * Myself and Vadim, who have touched this logic in the past. * Alberto, representing the machine-API space that needs console logs to debug failed-to-boot issues. * Colin, representing the RHCOS/machine-config space that needs console logs to debug RHCOS issues. TMPDIR documentation is based on POSIX [1]. [1]: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap08.html#tag_08_03
To help debug things like rhbz#1778904:
where a node goes down but does not come back up far enough to reconnect as a node.
Eventually, we'll address this with machine-health checks, killing the non-responsive machine and automatically replacing it with a new one. That's currently waiting on an etcd operator that can handle reconnecting control-plane machines automatically. But in the short term, and possibly still in the long term, it's nice to collect what we can from the broken machine to understand why it didn't come back up. This code isn't specific to broken machines, but collecting console logs from all nodes should cover us in the broken-machine case as well.