Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release-1.22] pull-kubernetes-integration is failing #105436

Closed
ialidzhikov opened this issue Oct 3, 2021 · 45 comments
Closed

[release-1.22] pull-kubernetes-integration is failing #105436

ialidzhikov opened this issue Oct 3, 2021 · 45 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/testing Categorizes an issue or PR as relevant to SIG Testing. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@ialidzhikov
Copy link
Contributor

Which jobs are failing:

pull-kubernetes-integration

See the runs related to #105139 - https://prow.k8s.io/pr-history/?org=kubernetes&repo=kubernetes&pr=105139

Which test(s) are failing:

Not clear. Even for some of the runs I don't see a particular test failure.

Since when has it been failing:

Not clear.

Testgrid link:

See the runs related to #105139 - https://prow.k8s.io/pr-history/?org=kubernetes&repo=kubernetes&pr=105139

Reason for failure:

Not clear.

Relevant SIG

/sig testing

@ialidzhikov ialidzhikov added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Oct 3, 2021
@k8s-ci-robot k8s-ci-robot added sig/testing Categorizes an issue or PR as relevant to SIG Testing. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 3, 2021
@ehashman
Copy link
Member

ehashman commented Oct 4, 2021

/priority critical-urgent
/triage accepted

https://storage.googleapis.com/k8s-triage/index.html?ci=0&pr=1&job=pull-kubernetes-integration

#105452 (comment)

@k8s-ci-robot k8s-ci-robot added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 4, 2021
@spiffxp
Copy link
Member

spiffxp commented Oct 4, 2021

Maybe related, we saw the presubmit timing out a bunch back in July, seems like that mostly stopped by 2020-07-09. But we never closed out investigating, and I suspect these are timeouts based on spyglass lying (it's UI shows pass but the job is failing), ref: #103512

@ehashman
Copy link
Member

ehashman commented Oct 4, 2021

Cold this be networking related?

W1004 19:01:13.392630  118232 reflector.go:441] k8s.io/client-go/informers/factory.go:134: watch of *v1beta1.PriorityLevelConfiguration ended with: an error on the server ("unable to decode a
n event from the watch stream: http2: client connection lost") has prevented the request from succeeding
E1004 19:01:07.379813  118232 reflector.go:138] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Endpoints: Get "https://127.0.0.1:44191/api/v1/endpoints?allowWatchBookmarks=tru
e&resourceVersion=19603&timeout=6m30s&timeoutSeconds=390&watch=true": http2: client connection lost
E1004 19:01:22.793215  118232 reflector.go:138] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.ServiceAccount: Get "https://127.0.0.1:42523/api/v1/serviceaccounts?allowWatchBo
okmarks=true&resourceVersion=20170&timeout=5m19s&timeoutSeconds=319&watch=true": http2: client connection lost
E1004 19:01:16.434912  118232 reflector.go:138] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.ServiceAccount: Get "https://127.0.0.1:44191/api/v1/serviceaccounts?allowWatchBo
okmarks=true&resourceVersion=19610&timeout=7m57s&timeoutSeconds=477&watch=true": http2: client connection lost
W1004 19:01:14.622372  118232 reflector.go:441] k8s.io/kube-aggregator/pkg/client/informers/externalversions/factory.go:117: watch of *v1.APIService ended with: an error on the server ("unabl
e to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
E1004 19:01:37.310019  118232 reflector.go:138] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Pod: Get "https://127.0.0.1:35693/api/v1/pods?allowWatchBookmarks=true&resourceV
ersion=24561&timeout=5m33s&timeoutSeconds=333&watch=true": http2: client connection lost
E1004 19:01:25.106278  118232 reflector.go:138] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.ValidatingWebhookConfiguration: Get "https://127.0.0.1:43013/apis/admissionregis
tration.k8s.io/v1/validatingwebhookconfigurations?allowWatchBookmarks=true&resourceVersion=18618&timeout=5m3s&timeoutSeconds=303&watch=true": http2: client connection lost
E1004 19:01:33.548156  118232 reflector.go:138] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Service: Get "https://127.0.0.1:37093/api/v1/services?allowWatchBookmarks=true&r
esourceVersion=21826&timeout=6m39s&timeoutSeconds=399&watch=true": http2: client connection lost
E1004 19:01:33.127381  118232 reflector.go:138] k8s.io/kubernetes/pkg/controlplane/controller/clusterauthenticationtrust/cluster_authentication_trust_controller.go:444: Failed to watch *v1.Co
nfigMap: Get "https://127.0.0.1:45715/api/v1/namespaces/kube-system/configmaps?allowWatchBookmarks=true&resourceVersion=818&timeout=8m30s&timeoutSeconds=510&watch=true": http2: client connect
ion lost
E1004 19:01:43.283532  118232 reflector.go:138] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.ValidatingWebhookConfiguration: Get "https://127.0.0.1:37093/apis/admissionregis
tration.k8s.io/v1/validatingwebhookconfigurations?allowWatchBookmarks=true&resourceVersion=21114&timeout=7m16s&timeoutSeconds=436&watch=true": http2: client connection lost
E1004 19:01:41.620012  118232 reflector.go:138] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.MutatingWebhookConfiguration: Get "https://127.0.0.1:43031/apis/admissionregistr
ation.k8s.io/v1/mutatingwebhookconfigurations?allowWatchBookmarks=true&resourceVersion=19493&timeout=7m1s&timeoutSeconds=421&watch=true": http2: client connection lost
E1004 19:01:49.585794  118232 reflector.go:138] k8s.io/kube-aggregator/pkg/client/informers/externalversions/factory.go:117: Failed to watch *v1.APIService: Get "https://127.0.0.1:43013/apis/
apiregistration.k8s.io/v1/apiservices?allowWatchBookmarks=true&resourceVersion=18814&timeout=7m42s&timeoutSeconds=462&watch=true": http2: client connection lost
W1004 19:01:42.463442  118232 reflector.go:441] k8s.io/client-go/informers/factory.go:134: watch of *v1.Namespace ended with: an error on the server ("unable to decode an event from the watch
 stream: http2: client connection lost") has prevented the request from succeeding
E1004 19:01:45.587331  118232 reflector.go:138] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.ServiceAccount: Get "https://127.0.0.1:39851/api/v1/serviceaccounts?allowWatchBo
okmarks=true&resourceVersion=13750&timeout=5m17s&timeoutSeconds=317&watch=true": http2: client connection lost
W1004 19:01:59.024099  118232 reflector.go:441] k8s.io/client-go/informers/factory.go:134: watch of *v1beta1.FlowSchema ended with: an error on the server ("unable to decode an event from the
 watch stream: http2: client connection lost") has prevented the request from succeeding
W1004 19:02:01.132622  118232 clientconn.go:1326] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport
: Error while dialing dial tcp 127.0.0.1:2379: i/o timeout". Reconnecting...
W1004 19:01:54.388525  118232 reflector.go:441] k8s.io/client-go/informers/factory.go:134: watch of *v1.StorageClass ended with: an error on the server ("unable to decode an event from the wa
tch stream: http2: client connection lost") has prevented the request from succeeding
SIGQUIT: quit
PC=0x475ee1 m=0 sigcode=0

<giant stacktrace>

from https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/105452/pull-kubernetes-integration/1445094410348924928/build-log.txt (note filesize is 114MB)

@ehashman
Copy link
Member

ehashman commented Oct 4, 2021

It looks like === FAIL: test/integration/apiserver/admissionwebhook TestWebhookTimeoutWithoutWatchCache (unknown) is timing out and everything afterwards is breaking

build-log.txt

This is a reasonable sized snippet of the 110+ MB build log from https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/105452/pull-kubernetes-integration/1445094410348924928/build-log.txt

@ehashman
Copy link
Member

ehashman commented Oct 4, 2021

/sig api-machinery
cc @kubernetes/sig-api-machinery-bugs

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. kind/bug Categorizes issue or PR as related to a bug. labels Oct 4, 2021
@spiffxp
Copy link
Member

spiffxp commented Oct 4, 2021

tl;dr suggest trying to see what changed ~2021-09-27 since that seems to be around when the job stopped passing

Scraped last 800 runs of pull-kubernetes-integration, filtered down to those that ran against release-1.22. First few entries from that list

1440397390203850752, 1632252591, 1632254857, SUCCESS, presubmit, release-1.22, https://github.com/kubernetes/kubernetes/pull/105154, https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/105154/pull-kubernetes-integration/1440397390203850752
1440505394534289408, 1632278338, 1632280852, FAILURE, presubmit, release-1.22, https://github.com/kubernetes/kubernetes/pull/105137, https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/105137/pull-kubernetes-integration/1440505394534289408
1440676558653100032, 1632319159, 1632321510, SUCCESS, presubmit, release-1.22, https://github.com/kubernetes/kubernetes/pull/105137, https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/105137/pull-kubernetes-integration/1440676558653100032
1440911029000212480, 1632375145, 1632376170, FAILURE, presubmit, release-1.22, https://github.com/kubernetes/kubernetes/pull/105139, https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/105139/pull-kubernetes-integration/1440911029000212480
1442364975170457600, 1632721783, 1632724274, FAILURE, presubmit, release-1.22, https://github.com/kubernetes/kubernetes/pull/104988, https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/104988/pull-kubernetes-integration/1442364975170457600
1442415842519158784, 1632733833, 1632736174, SUCCESS, presubmit, release-1.22, https://github.com/kubernetes/kubernetes/pull/104988, https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/104988/pull-kubernetes-integration/1442415842519158784
1442486689749536768, 1632750738, 1632752926, SUCCESS, presubmit, release-1.22, https://github.com/kubernetes/kubernetes/pull/105277, https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/105277/pull-kubernetes-integration/1442486689749536768
1442958977880035328, 1632863313, 1632865629, FAILURE, presubmit, release-1.22, https://github.com/kubernetes/kubernetes/pull/105139, https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/105139/pull-kubernetes-integration/1442958977880035328
1442974113156042752, 1632867067, 1632869682, FAILURE, presubmit, release-1.22, https://github.com/kubernetes/kubernetes/pull/105331, https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/105331/pull-kubernetes-integration/1442974113156042752
1443111120528216064, 1632899587, 1632902313, FAILURE, presubmit, release-1.22, https://github.com/kubernetes/kubernetes/pull/105139, https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/105139/pull-kubernetes-integration/1443111120528216064

#105277 had the last successful run, https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/105277/pull-kubernetes-integration/1442486689749536768, started at 2021-09-27 13:52:18 +0000 UTC

#104988 is the closest PR to merge to that at Mon Sep 27 02:51:36 2021 -0700, but if I understand UTC offsets correctly, that is a few hours before the last successful run

#105154 is the next PR but appears unrelated, updates storage e2es

#105139 has the first failing run, https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/105139/pull-kubernetes-integration/1442958977880035328 started at 2021-09-28 21:08:33 +0000 UTC

@liggitt
Copy link
Member

liggitt commented Oct 4, 2021

https://testgrid.k8s.io/sig-release-1.22-blocking#integration-1.22&width=5&graph-metrics=test-duration-minutes shows test variance increasing a lot around 9/27 - 9/28 timeframe

@liggitt
Copy link
Member

liggitt commented Oct 4, 2021

comparing the last successful and first failed job's podinfo, the first failed job has:

				"prow.k8s.io/retest": "true",

what does that do?

@spiffxp
Copy link
Member

spiffxp commented Oct 4, 2021

@liggitt

https://testgrid.k8s.io/sig-release-1.22-blocking#integration-1.22&width=5&graph-metrics=test-duration-minutes shows test variance increasing a lot around 9/27 - 9/28 timeframe

I think that's the integration tests becoming less I/O limited, that was around the time we migrated k8s-infra-prow-build to a local-ssd-backed nodepool, ref: kubernetes/k8s.io#1187 (comment)

@liggitt
Copy link
Member

liggitt commented Oct 4, 2021

				"prow.k8s.io/retest": "true",

what does that do?

nm, just records whether it was triggered by a retest (https://github.com/kubernetes/test-infra/blob/master/prow/kube/prowjob.go#L64-L65)

@liggitt
Copy link
Member

liggitt commented Oct 4, 2021

kubekins jumped from gcr.io/k8s-staging-test-infra/kubekins-e2e:v20210917-ee1e7c845b-1.22 to gcr.io/k8s-staging-test-infra/kubekins-e2e:v20210928-2a55334641-1.22

no other notable changes between the job podinfos

@spiffxp
Copy link
Member

spiffxp commented Oct 4, 2021

@liggitt
Copy link
Member

liggitt commented Oct 4, 2021

opened a test PR with the last-merged PR to release-1.22 reverted at #105473

I'm not sure how that would have passed the presubmit to get in if that was the issue, but just to double check

edit: test-integration failed in the revert, doesn't look like the change at HEAD is related

@spiffxp
Copy link
Member

spiffxp commented Oct 4, 2021

Looking through kubernetes/test-infra@ee1e7c8...2a55334 I'm pretty sure kubernetes/test-infra#23757 would have found its way into the image, which mangles iptables a bit if docker-in-docker is enabled.

I'm not sure that actually takes effect for this job though, and I'd be confused why it only affects the release-1.22 branch.

@spiffxp
Copy link
Member

spiffxp commented Oct 4, 2021

It does take effect, I see configure iptables to set MTU at the top of https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/105139/pull-kubernetes-integration/1442958977880035328

The Google Cloud incident that motivated this change is over (kubernetes/test-infra#23741), so theoretically we should be safe to revert this change. I had been dragging my feet because I didn't have confidence we weren't going to run into it again.

I remain confused why only release-1.22, if this is the culprit

EDIT: I opened a PR to start us toward reverting the change: kubernetes/test-infra#23885

Will need to build a kubekins from the new bootstrap image, and can then selectively disable the workaround for the integration job via an env var change to see if that makes a difference

Tomorrow I would be willing to set the env var to false to disable the workaround for all jobs

@spiffxp
Copy link
Member

spiffxp commented Oct 4, 2021

Another angle to take: why is the periodic / CI job passing where the presubmit is failing?

  • The CI job has a memory limit of 20Gi where the presubmit has a memory limit of 15Gi
  • The CI job runs with env SHORT=--short=false

I would be more inclined to lend credence to these differences if I saw presubmit/periodic differences in other release branches

@liggitt
Copy link
Member

liggitt commented Oct 6, 2021

  • drop parallelism by setting KUBE_INTEGRATION_TEST_MAX_CONCURRENCY=1, failed with a 357MB junit.xml

that's really surprising to me ... do you have a link to that run? the parallel runs/tests were what I saw exhausting ephemeral ports in the past

@spiffxp
Copy link
Member

spiffxp commented Oct 6, 2021

5197769 - dropping concurrency down to 1 ended up with a 357MB (!!) junit xml file which spyglass is refusing to parse (as it should, that's way too large) (ref: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/105488/pull-kubernetes-integration/1445512318795386880)

Grepping for ^FAIL in the log I pulled out:

test/integration/client TestExecPluginRotationViaInformer
test/integration/controlplane TestStatus
test/integration/pods TestPodEphemeralContainersDisabled

@liggitt
Copy link
Member

liggitt commented Oct 6, 2021

looking at the stdout in the serial run, I see panic.go only in TestQuota:

grep panic.go serialrun.stdout 
{"Time":"2021-10-05T22:36:33.364456722Z","Action":"output","Package":"k8s.io/kubernetes/test/integration/quota","Test":"TestQuota","Output":"I1005 22:36:33.364376   64799 panic.go:965] \"HTTP\" verb=\"POST\" URI=\"/api/v1/namespaces/quotaed/pods\" latency=\"2m25.369478025s\" userAgent=\"quota.test/v0.0.0 (linux/amd64) kubernetes/$Format\" audit-ID=\"a6ad587a-1fe9-48f0-afba-248a92b86aae\" srcIP=\"127.0.0.1:41978\" resp=500 statusStack=\"\\ngoroutine 7281 [running]:\\nk8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog.(*respLogger).recordStatus(0xc004351c80, 0x1f4)\\n\\t/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog/httplog.go:274 +0xcf\\nk8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog.(*respLogger).WriteHeader(0xc004351c80, 0x1f4)\\n\\t/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog/httplog.go:252 +0x35\\nk8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.(*baseTimeoutWriter).WriteHeader(0xc004882200, 0x1f4)\\n\\t/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/a"}
{"Time":"2021-10-05T22:36:33.365107218Z","Action":"output","Package":"k8s.io/kubernetes/test/integration/quota","Test":"TestQuota","Output":"I1005 22:36:33.365019   64799 panic.go:965] \"HTTP\" verb=\"POST\" URI=\"/api/v1/namespaces/quotaed/pods\" latency=\"2m25.282338743s\" userAgent=\"quota.test/v0.0.0 (linux/amd64) kubernetes/$Format\" audit-ID=\"75de8fd4-2715-452c-9072-f6bf349befc7\" srcIP=\"127.0.0.1:41950\" resp=500 statusStack=\"\\ngoroutine 7320 [running]:\\nk8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog.(*respLogger).recordStatus(0xc004e24100, 0x1f4)\\n\\t/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog/httplog.go:274 +0xcf\\nk8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog.(*respLogger).WriteHeader(0xc004e24100, 0x1f4)\\n\\t/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog/httplog.go:252 +0x35\\nk8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.(*baseTimeoutWriter).WriteHeader(0xc004882300, 0x1f4)\\n\\t/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/a"}
{"Time":"2021-10-05T22:36:33.439372328Z","Action":"output","Package":"k8s.io/kubernetes/test/integration/quota","Test":"TestQuota","Output":"I1005 22:36:33.395482   64799 panic.go:965] \"HTTP\" verb=\"POST\" URI=\"/api/v1/namespaces/quotaed/pods\" latency=\"2m25.323783326s\" userAgent=\"quota.test/v0.0.0 (linux/amd64) kubernetes/$Format\" audit-ID=\"6ed4cc53-a55e-492c-8968-9ba4e0d628c5\" srcIP=\"127.0.0.1:41966\" resp=500 statusStack=\"\\ngoroutine 7317 [running]:\\nk8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog.(*respLogger).recordStatus(0xc004e24000, 0x1f4)\\n\\t/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog/httplog.go:274 +0xcf\\nk8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog.(*respLogger).WriteHeader(0xc004e24000, 0x1f4)\\n\\t/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog/httplog.go:252 +0x35\\nk8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.(*baseTimeoutWriter).WriteHeader(0xc004882280, 0x1f4)\\n\\t/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/a"}

and it looks like TestQuota output is 50% of the total output (~700mb):

grep TestQuota serialrun.stdout  | wc -c
 730919143

it looks like the resource quota controller goes nuts at about 22:33:39.331724:

{"Time":"2021-10-05T22:33:39.33173509Z","Action":"output","Package":"k8s.io/kubernetes/test/integration/quota","Test":"TestQuota","Output":"I1005 22:33:39.331724   64799 resource_quota_controller.go:194] Resource quota controller queued all resource quota for full calculation of usage\n"}
{"Time":"2021-10-05T22:33:39.331789854Z","Action":"output","Package":"k8s.io/kubernetes/test/integration/quota","Test":"TestQuota","Output":"I1005 22:33:39.331759   64799 resource_quota_controller.go:194] Resource quota controller queued all resource quota for full calculation of usage\n"}
{"Time":"2021-10-05T22:33:39.331807336Z","Action":"output","Package":"k8s.io/kubernetes/test/integration/quota","Test":"TestQuota","Output":"I1005 22:33:39.331796   64799 resource_quota_controller.go:194] Resource quota controller queued all resource quota for full calculation of usage\n"}
{"Time":"2021-10-05T22:33:39.331815556Z","Action":"output","Package":"k8s.io/kubernetes/test/integration/quota","Test":"TestQuota","Output":"I1005 22:33:39.331807   64799 resource_quota_controller.go:194] Resource quota controller queued all resource quota for full calculation of usage\n"}
{"Time":"2021-10-05T22:33:39.331824314Z","Action":"output","Package":"k8s.io/kubernetes/test/integration/quota","Test":"TestQuota","Output":"I1005 22:33:39.331819   64799 resource_quota_controller.go:194] Resource quota controller queued all resource quota for full calculation of usage\n"}
{"Time":"2021-10-05T22:33:39.331832705Z","Action":"output","Package":"k8s.io/kubernetes/test/integration/quota","Test":"TestQuota","Output":"I1005 22:33:39.331828   64799 resource_quota_controller.go:194] Resource quota controller queued all resource quota for full calculation of usage\n"}
{"Time":"2021-10-05T22:33:39.331918832Z","Action":"output","Package":"k8s.io/kubernetes/test/integration/quota","Test":"TestQuota","Output":"I1005 22:33:39.331850   64799 resource_quota_controller.go:194] Resource quota controller queued all resource quota for full calculation of usage\n"}

that's logged about 2.5 million times in about 3.5 minutes, apparently in a hotloop

@liggitt
Copy link
Member

liggitt commented Oct 6, 2021

it looks like integration tests set up the quota controller with a 0-duration replenishment resync period, which the controller interprets as "resync all quotas as fast as possible all the time"... that didn't change recently, but some additional quota tests were added in 1.22 around July that might have made this pathological behavior overwhelm CI... it doesn't explain why this is different in master or in periodic jobs though, or the sudden shift around 9/28 in 1.22

@liggitt
Copy link
Member

liggitt commented Oct 6, 2021

fwiw, from scraping stdout artifacts from last passing runs, tests with >10k output lines:

master:

10547 TestUnschedulablePodBecomesSchedulable
15303 TestAdoption
18485 TestConfigConsumerFight
18719 TestPersistentVolumeControllerStartup
18792 TestPreemptionRaces
26886 TestPersistentVolumeMultiPVsPVCs
29415 TestPersistentVolumeProvisionMultiPVCs
31166 TestVolumeBindingDynamicStressFast
41283 TestVolumeBindingDynamicStressSlow
41588 TestVolumeBindingStressWithSchedulerResync
43103 TestVolumeBindingStress
298114 TestQuotaLimitService
4181892 TestQuota

1.22:

13097 TestAdoption
15897 TestPersistentVolumeMultiPVsPVCs
18218 TestPersistentVolumeControllerStartup
18533 TestPreemptionRaces
22433 TestPersistentVolumeProvisionMultiPVCs
24636 TestVolumeBindingDynamicStressFast
28755 TestVolumeBindingStress
33113 TestVolumeBindingDynamicStressSlow
38043 TestVolumeBindingStressWithSchedulerResync
144523 TestQuotaLimitService
1812887 TestQuota

@spiffxp
Copy link
Member

spiffxp commented Oct 6, 2021

it's perplexing what made this so dramatically turn red on presubmits on this branch while staying ~green on master and in periodic runs

Here's my pet theory. The improvement in IOPS provided by the build cluster using local-ssd instead of pd-ssd has allowed the nature of the tests' performance to change, thus allowing more of them to run more quickly, and thus hitting whatever issue there is between etcd and apiserver.

We don't see this in CI because it runs with --short=false, meaning some of the slow/expensive tests that are skipped during presubmit are run, which happens to space things out just enough to avoid this.

We don't see this in master because it runs with go1.17 and different code, which may change the performance of the tests.

We don't see this in previous release branches because they don't have the podsecurity tests.

We could confirm this theory by creating a node pool backed by pd-ssd instead of local-ssd, and pinning the 1.22 integration canary presubmit to that w/ taints/tolerances. That's a fair amount of babysitting, and I'm not sure what it gains in root causing the underlying issue, which is why I haven't done it yet. But, if you think it would help to confirm why just presubmit and why just release-1.22, I can perform this experiment tomorrow.

@aojea
Copy link
Member

aojea commented Oct 7, 2021

We don't see this in master because it runs with go1.17 and different code, which may change the performance of the tests.

How difficult is to discard the golang version diff?

@liggitt
Copy link
Member

liggitt commented Oct 7, 2021

if test-integration is on the edge, I think it's more likely master is working because it dropped a lot of v1beta1 integration tests (#105506 (comment))

to unblock 1.22, I'd probably recommend switching the presubmit config to match the periodic config just for 1.22, which seems to be green (bump memory slightly, run all integration tests instead of skipping slow ones)

@aojea
Copy link
Member

aojea commented Oct 7, 2021

memory on the canary is already the same

https://github.com/kubernetes/test-infra/blob/897043f76a81e13483ebf2e44eecdf13345c1aa1/config/jobs/kubernetes/sig-release/release-branch-jobs/1.22.yaml#L1754-L1772

missing the env variable kubernetes/test-infra#23932, let's see if that makes the difference

@liggitt
Copy link
Member

liggitt commented Oct 7, 2021

it looks like that resolved the canary runs on the open PRs I saw

@spiffxp
Copy link
Member

spiffxp commented Oct 7, 2021

https://prow.k8s.io/job-history/gs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-integration-1-22-canary - OK I'm willing to port to the integration job with a comment linking to this issue to explain why 1.22 gets a different job config

@spiffxp
Copy link
Member

spiffxp commented Oct 7, 2021

Opened kubernetes/test-infra#23936

@spiffxp
Copy link
Member

spiffxp commented Oct 7, 2021

I swept through an arbitrary batch of 1.22 PRs with approve, lgtm, cherrypick-approved and retested. The first batch I did passed, so I've removed and triggered retest on all of them.

I have not touched any PRs without those labels

@spiffxp
Copy link
Member

spiffxp commented Oct 8, 2021

/close
The job is passing now. kubernetes/test-infra#23936 sort of summarizes the string of comments in here as far as what we did and why.

Ultimately though, I believe the integration tests remain brittle and that the underlying connection issue between etcd/apiserver (which I'm going to keep linking to #103512 for) should be addressed if this crops up again.

@k8s-ci-robot
Copy link
Contributor

@spiffxp: Closing this issue.

In response to this:

/close
The job is passing now. kubernetes/test-infra#23936 sort of summarizes the string of comments in here as far as what we did and why.

Ultimately though, I believe the integration tests remain brittle and that the underlying connection issue between etcd/apiserver (which I'm going to keep linking to #103512 for) should be addressed if this crops up again.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/testing Categorizes an issue or PR as relevant to SIG Testing. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

7 participants