Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pull-kubernetes-kubemark-e2e-gce-scale fails when it should not and gathers too little info #24349

Closed
MikeSpreitzer opened this issue Nov 15, 2021 · 8 comments · Fixed by #24353
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.

Comments

@MikeSpreitzer
Copy link
Member

MikeSpreitzer commented Nov 15, 2021

What happened:
I submitted a PR (kubernetes/kubernetes#106325) that changes only a unit test and asked for the pull-kubernetes-kubemark-e2e-gce-scale job to be run on that PR. This job timed out after 20 hours and gathered much less information than expected. #24303 helped, but there is still stuff missing (such as apiserver logs). For an example, look at https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/106325/pull-kubernetes-kubemark-e2e-gce-scale/1459459749048225792 . Near the end of build-log.txt we see that both run-e2e.sh and log-dump.sh failed.

2021/11/14 06:02:22 main.go:331: Something went wrong: encountered 2 errors: [error during /home/prow/go/src/k8s.io/perf-tests/run-e2e.sh cluster-loader2 --nodes=5000 --provider=kubemark --report-dir=/logs/artifacts --testconfig=testing/load/config.yaml --testconfig=testing/access-tokens/config.yaml (interrupted): exit status 1 error during /workspace/log-dump.sh /logs/artifacts gs://sig-scalability-logs/pull-kubernetes-kubemark-e2e-gce-scale/1459459749048225792 (interrupted): exit status 1]

Before that there are a lot of complaints about stuff not found.

For another example, see https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/106085/pull-kubernetes-kubemark-e2e-gce-scale/1459763740323876864 (for kubernetes/kubernetes#106085). This one failed after a bit less than 13 hours, and also gathered less than usual.

What you expected to happen:
I expected the test to pass on harmless PRs and always to get a lot more clues about what happened.

How to reproduce it (as minimally and precisely as possible):
Look at the results of any recent run of that job. Try it on any PR.

Please provide links to example occurrences, if any:
See above for an easy one.

Anything else we need to know?:

@MikeSpreitzer MikeSpreitzer added the kind/bug Categorizes issue or PR as related to a bug. label Nov 15, 2021
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Nov 15, 2021
@MikeSpreitzer
Copy link
Member Author

@kubernetes/sig-scalability

@MikeSpreitzer
Copy link
Member Author

/sig scalability

@k8s-ci-robot k8s-ci-robot added sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 15, 2021
@MikeSpreitzer MikeSpreitzer changed the title pull-kubernetes-kubemark-e2e-gce-scale still fails to collect a lot of information pull-kubernetes-kubemark-e2e-gce-scale fails when it should not and gathers too little info Nov 15, 2021
@aojea
Copy link
Member

aojea commented Nov 15, 2021

Analyzing the linked job run, it starts at:

2021/11/13 09:56:45 main.go:344: Limiting testing to 20h0m0s

however, during the test it suddenly looses the connection against the environment:

E1113 14:05:29.891404 347222 wait_for_controlled_pods.go:568] WaitForControlledPodsRunning: test-jm7xw0-9/small-deployment-101 timed out
error dialing prow@34.75.88.222:22: 'dial tcp 34.75.88.222:22: connect: connection timed out', retrying
W1113 14:07:08.081363 347222 etcd_metrics.go:208] empty etcd cert or key, using http
E1113 14:08:11.569245 347222 kubemark.go:51] error when trying to SSH to master machine. Skipping probe. error getting SSH client to prow@34.75.88.222:22: 'dial tcp 34.75.88.222:22: connect: connection timed out'
error dialing prow@34.75.88.222:22: 'dial tcp 34.75.88.222:22: connect: connection timed out', retrying
error dialing prow@34.75.88.222:22: 'dial tcp 34.75.88.222:22: connect: connection timed out', retrying
E1113 14:11:34.321274 347222 etcd_metrics.go:147] EtcdMetrics: failed to collect etcd database size

the rest of the logs are just timeouts,
@wojtek-t @spiffxp is it possible that something is cleaning up those environments or breaking the communications?

@wojtek-t
Copy link
Member

The project is shared across couple optional presubmits.

It's possible that someone triggerred one of those presubmits at that time, which brings down the previous test...

@wojtek-t
Copy link
Member

@marseel @mborsz - FYI

@marseel
Copy link
Member

marseel commented Nov 15, 2021

I found logs that confirm master was deleted around time when connection timed out errors started:

compute.googleapis.com
v1.compute.instances.delete
…ces/e2e-e4d7a2f0f1-93f0a-kubemark-master
pr-kubekins@kubernetes-je…
audit_log, method: "v1.compute.instances.delete", principal_email: "pr-kubekins@kubernetes-jenkins-pull.iam.gserviceaccount.com"

I will migrate this jobs today to new infrastructure and we will have boskos pool so it shouldn't happen again.

/assign @marseel

@wojtek-t
Copy link
Member

Thanks @marseel !

@MikeSpreitzer
Copy link
Member Author

MikeSpreitzer commented Nov 15, 2021

Was everybody collectively responsible for having only one pull-kubernetes-kubemark-e2e-gce-scale job running at a time? If so then it was a well-kept secret. Are there any other secrets like that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants