node e2e tests - projects out of quota #17714

karan · 2020-05-26T18:53:07Z

Many sig-node tests are failing consistently due to CPU exhaustion:

I0526 15:56:55.801] unable to create gce instance with running docker daemon for image ubuntu-gke-1804-1-16-v20200330.  could not create instance tmp-node-e2e-076c4ac7-ubuntu-gke-1804-1-16-v20200330: [&{Code:QUOTA_EXCEEDED Location: Message:Quota 'CPUS' exceeded.  Limit: 100.0 in region us-west1. ForceSendFields:[] NullFields:[]}]

They are tracked in https://docs.google.com/spreadsheets/d/1mEU8B2_PmMwwgp-_xnyp7QYMBwcLoA9NNlHwDyMvO0Y/edit#gid=0.

For one of the projects, looks like it's at 100 instances already:

$ gcloud compute instances list --project cri-containerd-node-e2e | wc -l
100

A good starting point will be to:

get a list of all these projects
inspect all tests using these projects and how many CPUs they need
come up with a plan for getting out of this problem (we should definitely increase quota and/or shard across more projects and/or reduce resource usage by tests)

The text was updated successfully, but these errors were encountered:

karan · 2020-05-26T18:53:21Z

/kind cleanup
/sig node
/sig testing

helenfeng737 · 2020-05-26T18:53:50Z

/cc @ZhiFeng1993

karan · 2020-05-26T18:55:28Z

/assign karan

MHBauer · 2020-05-26T19:09:55Z

Also addresses exhaustion.

karan · 2020-05-26T21:30:10Z

Looking at the following:

Job	Link	Project
containerd-node-e2e-1.2	https://testgrid.k8s.io/sig-node-containerd#containerd-node-e2e-1.2	cri-containerd-node-e2e
containerd-node-e2e-1.3	https://testgrid.k8s.io/sig-node-containerd#containerd-node-e2e-1.3	cri-containerd-node-e2e
containerd-node-features	https://testgrid.k8s.io/sig-node-containerd#containerd-node-features	cri-containerd-node-e2e
containerd-node-e2e-features-1.2	https://testgrid.k8s.io/sig-node-containerd#containerd-node-e2e-features-1.2	cri-containerd-node-e2e
containerd-node-e2e-features-1.3	https://testgrid.k8s.io/sig-node-containerd#containerd-node-e2e-features-1.3	cri-containerd-node-e2e
image-validation-node-e2e	https://testgrid.k8s.io/sig-node-containerd#image-validation-node-e2e	cri-containerd-node-e2e
image-validation-node-features	https://testgrid.k8s.io/sig-node-containerd#image-validation-node-features	cri-containerd-node-e2e
node-e2e	https://testgrid.k8s.io/sig-node-containerd#node-e2e	cri-containerd-node-e2e
node-e2e-benchmark	https://testgrid.k8s.io/sig-node-containerd#node-e2e-benchmark	cri-containerd-node-e2e
node-e2e-features	https://testgrid.k8s.io/sig-node-containerd#node-e2e-features	cri-containerd-node-e2e
node-e2e-flaky	https://testgrid.k8s.io/sig-node-containerd#node-e2e-flaky	cri-containerd-node-e2e
node-e2e-serial	https://testgrid.k8s.io/sig-node-containerd#node-e2e-serial	cri-containerd-node-e2e

All of them use the same project and the same region (us-west1-b). All of these run with n1-standard-1 and it seems like in the last couple of hours, it's been constantly at 100 CPU usage.

An easy fix would be to spread the tests over other regions as well (us-central1 has another 100 CPU quota.)

spiffxp · 2020-05-26T23:52:47Z

I would confirm whether those instances are from live running tests, or whether they're detritus left over from tests that couldn't ssh to the nodes they'd created and forgot to cleanup. I LGTM'ed the PR to try a new region, if the reason is the latter, you may hit quota again.

See kubernetes/kubernetes#89892 (comment) where we tested this out by clearing out VM's for a project that appeared to have network issues

MHBauer · 2020-05-27T15:28:17Z

That's a great comment to look at, thanks!

…

On 5/26/20 16:52:59, Aaron Crickenberger wrote: I would confirm whether those instances are from live running tests, or whether they're detritus left over from tests that couldn't ssh to the nodes they'd created and forgot to cleanup. I LGTM'ed the PR to try a new region, if the reason is the latter, you may hit quota again. See kubernetes/kubernetes#89892 (comment) <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_kubernetes_kubernetes_issues_89892-23issuecomment-2D613207624&d=DwMCaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=gSVznMk4DixLVQ8USuNSH0RuT_4psswDPWAf-45M48c&m=bk7r-C8QcQlCtcUzIk-nH5zOtiJgQJyC-kq1uVg0MZ8&s=ZeiTfw8m5A9FqN-ZJEpZ982727DS9DBLQzeou-XMFcY&e=> where we tested this out by clearing out VM's for a project that appeared to have network issues — You are receiving this because you commented. Reply to this email directly, view it on GitHub <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_kubernetes_test-2Dinfra_issues_17714-23issuecomment-2D634340589&d=DwMCaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=gSVznMk4DixLVQ8USuNSH0RuT_4psswDPWAf-45M48c&m=bk7r-C8QcQlCtcUzIk-nH5zOtiJgQJyC-kq1uVg0MZ8&s=_UzVER2RQL38GCa372AOJl1AKDMCrFFt7-S97lA56mo&e=>, or unsubscribe <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ADFYGAIC6335F3BKFND4RDTRTRI5XANCNFSM4NK5GIVA&d=DwMCaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=gSVznMk4DixLVQ8USuNSH0RuT_4psswDPWAf-45M48c&m=bk7r-C8QcQlCtcUzIk-nH5zOtiJgQJyC-kq1uVg0MZ8&s=8H1EBM0PgKpvIdU_IhJYMRzVoVCsagRBcaqKCQ5nQFY&e=>.

karan · 2020-06-02T22:13:47Z

Seems like the quota issue is resolved and I'm seeing 4-6 VMs in the project right now. So doesn't seem like VMs are being orphaned.

The new issue issue is around docker not running (for which I'll cut a new issue).

MHBauer · 2020-06-04T20:11:40Z

I'm not sure this is quite finished, for https://testgrid.k8s.io/sig-node-containerd#pull-node-e2e
I see inuse addreses quota exceeded.

``
I0604 19:06:58.636] >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
I0604 19:06:58.636] > START TEST >
I0604 19:06:58.636] >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
I0604 19:06:58.637] Start Test Suite on Host
I0604 19:06:58.637]
I0604 19:06:58.637] Failure Finished Test Suite on Host
I0604 19:06:58.637] unable to create gce instance with running docker daemon for image ubuntu-gke-1804-1-15-v20200602. could not create instance tmp-node-e2e-f2333f3a-ubuntu-gke-1804-1-15-v20200602: [&{Code:QUOTA_EXCEEDED Location: Message:Quota 'IN_USE_ADDRESSES' exceeded. Limit: 69.0 in region us-west1. ForceSendFields:[] NullFields:[]}]
I0604 19:06:58.638] <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
I0604 19:06:58.638] < FINISH TEST <
I0604 19:06:58.638] <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
I0604 19:06:58.638]
W0604 19:06:58.739] E0604 19:06:58.635696 5017 run_remote.go:779] Error deleting instance "tmp-node-e2e-f2333f3a-ubuntu-gke-1804-1-15-v20200602": googleapi: Error 404: The resource 'projects/k8s-c8d-pr-node-e2e/zones/us-west1-b/instances/tmp-node-e2e-f2333f3a-ubuntu-gke-1804-1-15-v20200602' was not found, notFound
W0604 19:06:58.836] E0604 19:06:58.836212 5017 run_remote.go:779] Error deleting instance "tmp-node-e2e-f2333f3a-cos-77-12371-284-0": googleapi: Error 404: The resource 'projects/k8s-c8d-pr-node-e2e/zones/us-west1-b/instances/tmp-node-e2e-f2333f3a-cos-77-12371-284-0' was not found, notFound
I0604 19:06:58.937]
I0604 19:06:58.937] >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
I0604 19:06:58.937] > START TEST >
I0604 19:06:58.937] >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
I0604 19:06:58.938] Start Test Suite on Host
I0604 19:06:58.938]
I0604 19:06:58.938] Failure Finished Test Suite on Host
I0604 19:06:58.938] unable to create gce instance with running docker daemon for image cos-77-12371-284-0. could not create instance tmp-node-e2e-f2333f3a-cos-77-12371-284-0: [&{Code:QUOTA_EXCEEDED Location: Message:Quota 'IN_USE_ADDRESSES' exceeded. Limit: 69.0 in region us-west1. ForceSendFields:[] NullFields:[]}]
I0604 19:06:58.938] <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
I0604 19:06:58.939] < FINISH TEST <
I0604 19:06:58.939] <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

karan · 2020-06-04T20:18:46Z

--gcp-project=k8s-c8d-pr-node-e2e

We should cut a new issue IMO and check all projects we use and their IP quota.

If we allow up to 100 VMs in a project, we should have at least that many IPs in quota as well.

MHBauer · 2020-06-04T20:43:05Z

I do not have any idea of how to do this project checking.

dims · 2020-06-04T20:52:54Z

We should get guidance from @spiffxp and @BenTheElder

BenTheElder · 2020-06-04T21:24:37Z

Aaron and I don't have much to do with this project currently, when SIGs ask for GCE quota we ask them to use the boskos pools to rent a project for one test run at a time, which makes cleanup very straightforward (destroy all project contents) and capacity planning easy (monitor how full / empty the pools are).

Using a single project is a bit of an anti-pattern.

MHBauer · 2020-06-04T21:50:23Z

So we need to figure out why it's tied to a specific gcp project, and figure out a way to nullify that requirement. Does that sound right?

MHBauer · 2020-06-04T21:56:43Z

And move to using a boskos pool for (any?/all?) tests that currently have a specific project/region? Cause yea, it would make sense to me that we don't care where these things run, just that they do.

MHBauer · 2020-06-04T21:59:15Z

Okay, thanks for the hint:
I see #7769 and I will use #8165 as reference and try to follow up.

karan added the kind/bug Categorizes issue or PR as related to a bug. label May 26, 2020

k8s-ci-robot added kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels May 26, 2020

karan changed the title ~~node e2e tests - project out of quota~~ node e2e tests - projects out of quota May 26, 2020

k8s-ci-robot assigned karan May 26, 2020

karan mentioned this issue May 26, 2020

Run some containerd tests in us-central1-b #17717

Merged

karan closed this as completed Jun 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

node e2e tests - projects out of quota #17714

node e2e tests - projects out of quota #17714

karan commented May 26, 2020 •

edited

Loading

karan commented May 26, 2020

helenfeng737 commented May 26, 2020 •

edited

Loading

karan commented May 26, 2020

MHBauer commented May 26, 2020

karan commented May 26, 2020

spiffxp commented May 26, 2020

MHBauer commented May 27, 2020 via email

karan commented Jun 2, 2020

MHBauer commented Jun 4, 2020

karan commented Jun 4, 2020

MHBauer commented Jun 4, 2020

dims commented Jun 4, 2020

BenTheElder commented Jun 4, 2020

MHBauer commented Jun 4, 2020

MHBauer commented Jun 4, 2020

MHBauer commented Jun 4, 2020

node e2e tests - projects out of quota #17714

node e2e tests - projects out of quota #17714

Comments

karan commented May 26, 2020 • edited Loading

karan commented May 26, 2020

helenfeng737 commented May 26, 2020 • edited Loading

karan commented May 26, 2020

MHBauer commented May 26, 2020

karan commented May 26, 2020

spiffxp commented May 26, 2020

MHBauer commented May 27, 2020 via email

karan commented Jun 2, 2020

MHBauer commented Jun 4, 2020

karan commented Jun 4, 2020

MHBauer commented Jun 4, 2020

dims commented Jun 4, 2020

BenTheElder commented Jun 4, 2020

MHBauer commented Jun 4, 2020

MHBauer commented Jun 4, 2020

MHBauer commented Jun 4, 2020

karan commented May 26, 2020 •

edited

Loading

helenfeng737 commented May 26, 2020 •

edited

Loading