pull-kubernetes-kubemark-e2e-gce-scale fails when it should not and gathers too little info #24349

MikeSpreitzer · 2021-11-15T05:58:07Z

What happened:
I submitted a PR (kubernetes/kubernetes#106325) that changes only a unit test and asked for the pull-kubernetes-kubemark-e2e-gce-scale job to be run on that PR. This job timed out after 20 hours and gathered much less information than expected. #24303 helped, but there is still stuff missing (such as apiserver logs). For an example, look at https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/106325/pull-kubernetes-kubemark-e2e-gce-scale/1459459749048225792 . Near the end of build-log.txt we see that both run-e2e.sh and log-dump.sh failed.

2021/11/14 06:02:22 main.go:331: Something went wrong: encountered 2 errors: [error during /home/prow/go/src/k8s.io/perf-tests/run-e2e.sh cluster-loader2 --nodes=5000 --provider=kubemark --report-dir=/logs/artifacts --testconfig=testing/load/config.yaml --testconfig=testing/access-tokens/config.yaml (interrupted): exit status 1 error during /workspace/log-dump.sh /logs/artifacts gs://sig-scalability-logs/pull-kubernetes-kubemark-e2e-gce-scale/1459459749048225792 (interrupted): exit status 1]

Before that there are a lot of complaints about stuff not found.

For another example, see https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/106085/pull-kubernetes-kubemark-e2e-gce-scale/1459763740323876864 (for kubernetes/kubernetes#106085). This one failed after a bit less than 13 hours, and also gathered less than usual.

What you expected to happen:
I expected the test to pass on harmless PRs and always to get a lot more clues about what happened.

How to reproduce it (as minimally and precisely as possible):
Look at the results of any recent run of that job. Try it on any PR.

Please provide links to example occurrences, if any:
See above for an easy one.

Anything else we need to know?:

The text was updated successfully, but these errors were encountered:

MikeSpreitzer · 2021-11-15T05:59:53Z

@kubernetes/sig-scalability

MikeSpreitzer · 2021-11-15T06:00:24Z

/sig scalability

aojea · 2021-11-15T08:16:39Z

Analyzing the linked job run, it starts at:

2021/11/13 09:56:45 main.go:344: Limiting testing to 20h0m0s

however, during the test it suddenly looses the connection against the environment:

E1113 14:05:29.891404 347222 wait_for_controlled_pods.go:568] WaitForControlledPodsRunning: test-jm7xw0-9/small-deployment-101 timed out
error dialing prow@34.75.88.222:22: 'dial tcp 34.75.88.222:22: connect: connection timed out', retrying
W1113 14:07:08.081363 347222 etcd_metrics.go:208] empty etcd cert or key, using http
E1113 14:08:11.569245 347222 kubemark.go:51] error when trying to SSH to master machine. Skipping probe. error getting SSH client to prow@34.75.88.222:22: 'dial tcp 34.75.88.222:22: connect: connection timed out'
error dialing prow@34.75.88.222:22: 'dial tcp 34.75.88.222:22: connect: connection timed out', retrying
error dialing prow@34.75.88.222:22: 'dial tcp 34.75.88.222:22: connect: connection timed out', retrying
E1113 14:11:34.321274 347222 etcd_metrics.go:147] EtcdMetrics: failed to collect etcd database size

the rest of the logs are just timeouts,
@wojtek-t @spiffxp is it possible that something is cleaning up those environments or breaking the communications?

wojtek-t · 2021-11-15T08:56:08Z

The project is shared across couple optional presubmits.

It's possible that someone triggerred one of those presubmits at that time, which brings down the previous test...

wojtek-t · 2021-11-15T08:56:24Z

@marseel @mborsz - FYI

marseel · 2021-11-15T10:54:07Z

I found logs that confirm master was deleted around time when connection timed out errors started:

compute.googleapis.com
v1.compute.instances.delete
…ces/e2e-e4d7a2f0f1-93f0a-kubemark-master
pr-kubekins@kubernetes-je…
audit_log, method: "v1.compute.instances.delete", principal_email: "pr-kubekins@kubernetes-jenkins-pull.iam.gserviceaccount.com"

I will migrate this jobs today to new infrastructure and we will have boskos pool so it shouldn't happen again.

/assign @marseel

wojtek-t · 2021-11-15T11:08:06Z

Thanks @marseel !

MikeSpreitzer · 2021-11-15T15:03:49Z

Was everybody collectively responsible for having only one pull-kubernetes-kubemark-e2e-gce-scale job running at a time? If so then it was a well-kept secret. Are there any other secrets like that?

MikeSpreitzer added the kind/bug Categorizes issue or PR as related to a bug. label Nov 15, 2021

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Nov 15, 2021

MikeSpreitzer mentioned this issue Nov 15, 2021

apf: kubemark-500 scale test fails with mutating request estimator enabled kubernetes/kubernetes#105804

Closed

k8s-ci-robot added sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 15, 2021

MikeSpreitzer changed the title ~~pull-kubernetes-kubemark-e2e-gce-scale still fails to collect a lot of information~~ pull-kubernetes-kubemark-e2e-gce-scale fails when it should not and gathers too little info Nov 15, 2021

k8s-ci-robot assigned marseel Nov 15, 2021

marseel mentioned this issue Nov 15, 2021

Migrate pull-kubernetes-kubemark-e2e-gce-scale to community-owned infrastructure #24353

Merged

k8s-ci-robot closed this as completed in #24353 Nov 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pull-kubernetes-kubemark-e2e-gce-scale fails when it should not and gathers too little info #24349

pull-kubernetes-kubemark-e2e-gce-scale fails when it should not and gathers too little info #24349

MikeSpreitzer commented Nov 15, 2021 •

edited

Loading

MikeSpreitzer commented Nov 15, 2021

MikeSpreitzer commented Nov 15, 2021

aojea commented Nov 15, 2021 •

edited

Loading

wojtek-t commented Nov 15, 2021

wojtek-t commented Nov 15, 2021

marseel commented Nov 15, 2021

wojtek-t commented Nov 15, 2021

MikeSpreitzer commented Nov 15, 2021 •

edited

Loading

pull-kubernetes-kubemark-e2e-gce-scale fails when it should not and gathers too little info #24349

pull-kubernetes-kubemark-e2e-gce-scale fails when it should not and gathers too little info #24349

Comments

MikeSpreitzer commented Nov 15, 2021 • edited Loading

MikeSpreitzer commented Nov 15, 2021

MikeSpreitzer commented Nov 15, 2021

aojea commented Nov 15, 2021 • edited Loading

wojtek-t commented Nov 15, 2021

wojtek-t commented Nov 15, 2021

marseel commented Nov 15, 2021

wojtek-t commented Nov 15, 2021

MikeSpreitzer commented Nov 15, 2021 • edited Loading

MikeSpreitzer commented Nov 15, 2021 •

edited

Loading

aojea commented Nov 15, 2021 •

edited

Loading

MikeSpreitzer commented Nov 15, 2021 •

edited

Loading