-
Notifications
You must be signed in to change notification settings - Fork 39.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[release-1.22] pull-kubernetes-integration is failing #105436
Comments
/priority critical-urgent https://storage.googleapis.com/k8s-triage/index.html?ci=0&pr=1&job=pull-kubernetes-integration |
Tough to confirm when this started since there's no easy mechanism to filter history for presubmits by base branch I'm trying to script something together to scrape GCS, so far I see the following have had failed pull-kubernetes-integration runs
|
Maybe related, we saw the presubmit timing out a bunch back in July, seems like that mostly stopped by 2020-07-09. But we never closed out investigating, and I suspect these are timeouts based on spyglass lying (it's UI shows pass but the job is failing), ref: #103512 |
Cold this be networking related?
from https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/105452/pull-kubernetes-integration/1445094410348924928/build-log.txt (note filesize is 114MB) |
It looks like This is a reasonable sized snippet of the 110+ MB build log from https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/105452/pull-kubernetes-integration/1445094410348924928/build-log.txt |
/sig api-machinery |
tl;dr suggest trying to see what changed ~2021-09-27 since that seems to be around when the job stopped passing Scraped last 800 runs of pull-kubernetes-integration, filtered down to those that ran against release-1.22. First few entries from that list
#105277 had the last successful run, https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/105277/pull-kubernetes-integration/1442486689749536768, started at #104988 is the closest PR to merge to that at #105154 is the next PR but appears unrelated, updates storage e2es #105139 has the first failing run, https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/105139/pull-kubernetes-integration/1442958977880035328 started at |
https://testgrid.k8s.io/sig-release-1.22-blocking#integration-1.22&width=5&graph-metrics=test-duration-minutes shows test variance increasing a lot around 9/27 - 9/28 timeframe |
comparing the last successful and first failed job's podinfo, the first failed job has:
what does that do? |
I think that's the integration tests becoming less I/O limited, that was around the time we migrated k8s-infra-prow-build to a local-ssd-backed nodepool, ref: kubernetes/k8s.io#1187 (comment) |
nm, just records whether it was triggered by a retest (https://github.com/kubernetes/test-infra/blob/master/prow/kube/prowjob.go#L64-L65) |
kubekins jumped from gcr.io/k8s-staging-test-infra/kubekins-e2e:v20210917-ee1e7c845b-1.22 to gcr.io/k8s-staging-test-infra/kubekins-e2e:v20210928-2a55334641-1.22 no other notable changes between the job podinfos |
|
opened a test PR with the last-merged PR to release-1.22 reverted at #105473 I'm not sure how that would have passed the presubmit to get in if that was the issue, but just to double check edit: test-integration failed in the revert, doesn't look like the change at HEAD is related |
Looking through kubernetes/test-infra@ee1e7c8...2a55334 I'm pretty sure kubernetes/test-infra#23757 would have found its way into the image, which mangles iptables a bit if docker-in-docker is enabled. I'm not sure that actually takes effect for this job though, and I'd be confused why it only affects the release-1.22 branch. |
It does take effect, I see The Google Cloud incident that motivated this change is over (kubernetes/test-infra#23741), so theoretically we should be safe to revert this change. I had been dragging my feet because I didn't have confidence we weren't going to run into it again. I remain confused why only release-1.22, if this is the culprit EDIT: I opened a PR to start us toward reverting the change: kubernetes/test-infra#23885 Will need to build a kubekins from the new bootstrap image, and can then selectively disable the workaround for the integration job via an env var change to see if that makes a difference Tomorrow I would be willing to set the env var to false to disable the workaround for all jobs |
Another angle to take: why is the periodic / CI job passing where the presubmit is failing?
I would be more inclined to lend credence to these differences if I saw presubmit/periodic differences in other release branches |
that's really surprising to me ... do you have a link to that run? the parallel runs/tests were what I saw exhausting ephemeral ports in the past |
5197769 - dropping concurrency down to 1 ended up with a 357MB (!!) junit xml file which spyglass is refusing to parse (as it should, that's way too large) (ref: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/105488/pull-kubernetes-integration/1445512318795386880) Grepping for ^FAIL in the log I pulled out:
|
looking at the stdout in the serial run, I see panic.go only in TestQuota:
and it looks like TestQuota output is 50% of the total output (~700mb):
it looks like the resource quota controller goes nuts at about 22:33:39.331724:
that's logged about 2.5 million times in about 3.5 minutes, apparently in a hotloop |
it looks like integration tests set up the quota controller with a 0-duration replenishment resync period, which the controller interprets as "resync all quotas as fast as possible all the time"... that didn't change recently, but some additional quota tests were added in 1.22 around July that might have made this pathological behavior overwhelm CI... it doesn't explain why this is different in master or in periodic jobs though, or the sudden shift around 9/28 in 1.22 |
fwiw, from scraping stdout artifacts from last passing runs, tests with >10k output lines: master:
1.22:
|
Here's my pet theory. The improvement in IOPS provided by the build cluster using local-ssd instead of pd-ssd has allowed the nature of the tests' performance to change, thus allowing more of them to run more quickly, and thus hitting whatever issue there is between etcd and apiserver. We don't see this in CI because it runs with We don't see this in master because it runs with go1.17 and different code, which may change the performance of the tests. We don't see this in previous release branches because they don't have the podsecurity tests. We could confirm this theory by creating a node pool backed by pd-ssd instead of local-ssd, and pinning the 1.22 integration canary presubmit to that w/ taints/tolerances. That's a fair amount of babysitting, and I'm not sure what it gains in root causing the underlying issue, which is why I haven't done it yet. But, if you think it would help to confirm why just presubmit and why just release-1.22, I can perform this experiment tomorrow. |
How difficult is to discard the golang version diff? |
if test-integration is on the edge, I think it's more likely master is working because it dropped a lot of v1beta1 integration tests (#105506 (comment)) to unblock 1.22, I'd probably recommend switching the presubmit config to match the periodic config just for 1.22, which seems to be green (bump memory slightly, run all integration tests instead of skipping slow ones) |
memory on the canary is already the same missing the env variable kubernetes/test-infra#23932, let's see if that makes the difference |
it looks like that resolved the canary runs on the open PRs I saw |
https://prow.k8s.io/job-history/gs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-integration-1-22-canary - OK I'm willing to port to the integration job with a comment linking to this issue to explain why 1.22 gets a different job config |
Opened kubernetes/test-infra#23936 |
I swept through I have not touched any PRs without those labels |
/close Ultimately though, I believe the integration tests remain brittle and that the underlying connection issue between etcd/apiserver (which I'm going to keep linking to #103512 for) should be addressed if this crops up again. |
@spiffxp: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Which jobs are failing:
pull-kubernetes-integration
See the runs related to #105139 - https://prow.k8s.io/pr-history/?org=kubernetes&repo=kubernetes&pr=105139
Which test(s) are failing:
Not clear. Even for some of the runs I don't see a particular test failure.
Since when has it been failing:
Not clear.
Testgrid link:
See the runs related to #105139 - https://prow.k8s.io/pr-history/?org=kubernetes&repo=kubernetes&pr=105139
Reason for failure:
Not clear.
Relevant SIG
/sig testing
The text was updated successfully, but these errors were encountered: