Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node e2e tests - projects out of quota #17714

Closed
karan opened this issue May 26, 2020 · 16 comments
Closed

node e2e tests - projects out of quota #17714

karan opened this issue May 26, 2020 · 16 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing.

Comments

@karan
Copy link
Contributor

karan commented May 26, 2020

Many sig-node tests are failing consistently due to CPU exhaustion:

I0526 15:56:55.801] unable to create gce instance with running docker daemon for image ubuntu-gke-1804-1-16-v20200330.  could not create instance tmp-node-e2e-076c4ac7-ubuntu-gke-1804-1-16-v20200330: [&{Code:QUOTA_EXCEEDED Location: Message:Quota 'CPUS' exceeded.  Limit: 100.0 in region us-west1. ForceSendFields:[] NullFields:[]}]

They are tracked in https://docs.google.com/spreadsheets/d/1mEU8B2_PmMwwgp-_xnyp7QYMBwcLoA9NNlHwDyMvO0Y/edit#gid=0.

For one of the projects, looks like it's at 100 instances already:

$ gcloud compute instances list --project cri-containerd-node-e2e | wc -l
100

A good starting point will be to:

  • get a list of all these projects
  • inspect all tests using these projects and how many CPUs they need
  • come up with a plan for getting out of this problem (we should definitely increase quota and/or shard across more projects and/or reduce resource usage by tests)
@karan karan added the kind/bug Categorizes issue or PR as related to a bug. label May 26, 2020
@karan
Copy link
Contributor Author

karan commented May 26, 2020

/kind cleanup
/sig node
/sig testing

@k8s-ci-robot k8s-ci-robot added kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels May 26, 2020
@helenfeng737
Copy link
Contributor

helenfeng737 commented May 26, 2020

/cc @ZhiFeng1993

@karan karan changed the title node e2e tests - project out of quota node e2e tests - projects out of quota May 26, 2020
@karan
Copy link
Contributor Author

karan commented May 26, 2020

/assign karan

@MHBauer
Copy link
Contributor

MHBauer commented May 26, 2020

Also addresses exhaustion.

@karan
Copy link
Contributor Author

karan commented May 26, 2020

Looking at the following:

Job Link Project
containerd-node-e2e-1.2 https://testgrid.k8s.io/sig-node-containerd#containerd-node-e2e-1.2 cri-containerd-node-e2e
containerd-node-e2e-1.3 https://testgrid.k8s.io/sig-node-containerd#containerd-node-e2e-1.3 cri-containerd-node-e2e
containerd-node-features https://testgrid.k8s.io/sig-node-containerd#containerd-node-features cri-containerd-node-e2e
containerd-node-e2e-features-1.2 https://testgrid.k8s.io/sig-node-containerd#containerd-node-e2e-features-1.2 cri-containerd-node-e2e
containerd-node-e2e-features-1.3 https://testgrid.k8s.io/sig-node-containerd#containerd-node-e2e-features-1.3 cri-containerd-node-e2e
image-validation-node-e2e https://testgrid.k8s.io/sig-node-containerd#image-validation-node-e2e cri-containerd-node-e2e
image-validation-node-features https://testgrid.k8s.io/sig-node-containerd#image-validation-node-features cri-containerd-node-e2e
node-e2e https://testgrid.k8s.io/sig-node-containerd#node-e2e cri-containerd-node-e2e
node-e2e-benchmark https://testgrid.k8s.io/sig-node-containerd#node-e2e-benchmark cri-containerd-node-e2e
node-e2e-features https://testgrid.k8s.io/sig-node-containerd#node-e2e-features cri-containerd-node-e2e
node-e2e-flaky https://testgrid.k8s.io/sig-node-containerd#node-e2e-flaky cri-containerd-node-e2e
node-e2e-serial https://testgrid.k8s.io/sig-node-containerd#node-e2e-serial cri-containerd-node-e2e

All of them use the same project and the same region (us-west1-b). All of these run with n1-standard-1 and it seems like in the last couple of hours, it's been constantly at 100 CPU usage.

An easy fix would be to spread the tests over other regions as well (us-central1 has another 100 CPU quota.)

@spiffxp
Copy link
Member

spiffxp commented May 26, 2020

I would confirm whether those instances are from live running tests, or whether they're detritus left over from tests that couldn't ssh to the nodes they'd created and forgot to cleanup. I LGTM'ed the PR to try a new region, if the reason is the latter, you may hit quota again.

See kubernetes/kubernetes#89892 (comment) where we tested this out by clearing out VM's for a project that appeared to have network issues

@MHBauer
Copy link
Contributor

MHBauer commented May 27, 2020 via email

@karan
Copy link
Contributor Author

karan commented Jun 2, 2020

Seems like the quota issue is resolved and I'm seeing 4-6 VMs in the project right now. So doesn't seem like VMs are being orphaned.

The new issue issue is around docker not running (for which I'll cut a new issue).

@karan karan closed this as completed Jun 2, 2020
@MHBauer
Copy link
Contributor

MHBauer commented Jun 4, 2020

I'm not sure this is quite finished, for https://testgrid.k8s.io/sig-node-containerd#pull-node-e2e
I see inuse addreses quota exceeded.

``
I0604 19:06:58.636] >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
I0604 19:06:58.636] > START TEST >
I0604 19:06:58.636] >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
I0604 19:06:58.637] Start Test Suite on Host
I0604 19:06:58.637]
I0604 19:06:58.637] Failure Finished Test Suite on Host
I0604 19:06:58.637] unable to create gce instance with running docker daemon for image ubuntu-gke-1804-1-15-v20200602. could not create instance tmp-node-e2e-f2333f3a-ubuntu-gke-1804-1-15-v20200602: [&{Code:QUOTA_EXCEEDED Location: Message:Quota 'IN_USE_ADDRESSES' exceeded. Limit: 69.0 in region us-west1. ForceSendFields:[] NullFields:[]}]
I0604 19:06:58.638] <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
I0604 19:06:58.638] < FINISH TEST <
I0604 19:06:58.638] <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
I0604 19:06:58.638]
W0604 19:06:58.739] E0604 19:06:58.635696 5017 run_remote.go:779] Error deleting instance "tmp-node-e2e-f2333f3a-ubuntu-gke-1804-1-15-v20200602": googleapi: Error 404: The resource 'projects/k8s-c8d-pr-node-e2e/zones/us-west1-b/instances/tmp-node-e2e-f2333f3a-ubuntu-gke-1804-1-15-v20200602' was not found, notFound
W0604 19:06:58.836] E0604 19:06:58.836212 5017 run_remote.go:779] Error deleting instance "tmp-node-e2e-f2333f3a-cos-77-12371-284-0": googleapi: Error 404: The resource 'projects/k8s-c8d-pr-node-e2e/zones/us-west1-b/instances/tmp-node-e2e-f2333f3a-cos-77-12371-284-0' was not found, notFound
I0604 19:06:58.937]
I0604 19:06:58.937] >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
I0604 19:06:58.937] > START TEST >
I0604 19:06:58.937] >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
I0604 19:06:58.938] Start Test Suite on Host
I0604 19:06:58.938]
I0604 19:06:58.938] Failure Finished Test Suite on Host
I0604 19:06:58.938] unable to create gce instance with running docker daemon for image cos-77-12371-284-0. could not create instance tmp-node-e2e-f2333f3a-cos-77-12371-284-0: [&{Code:QUOTA_EXCEEDED Location: Message:Quota 'IN_USE_ADDRESSES' exceeded. Limit: 69.0 in region us-west1. ForceSendFields:[] NullFields:[]}]
I0604 19:06:58.938] <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
I0604 19:06:58.939] < FINISH TEST <
I0604 19:06:58.939] <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

@karan
Copy link
Contributor Author

karan commented Jun 4, 2020

--gcp-project=k8s-c8d-pr-node-e2e

We should cut a new issue IMO and check all projects we use and their IP quota.

If we allow up to 100 VMs in a project, we should have at least that many IPs in quota as well.

@MHBauer
Copy link
Contributor

MHBauer commented Jun 4, 2020

I do not have any idea of how to do this project checking.

@dims
Copy link
Member

dims commented Jun 4, 2020

We should get guidance from @spiffxp and @BenTheElder

@BenTheElder
Copy link
Member

Aaron and I don't have much to do with this project currently, when SIGs ask for GCE quota we ask them to use the boskos pools to rent a project for one test run at a time, which makes cleanup very straightforward (destroy all project contents) and capacity planning easy (monitor how full / empty the pools are).

Using a single project is a bit of an anti-pattern.

@MHBauer
Copy link
Contributor

MHBauer commented Jun 4, 2020

So we need to figure out why it's tied to a specific gcp project, and figure out a way to nullify that requirement. Does that sound right?

@MHBauer
Copy link
Contributor

MHBauer commented Jun 4, 2020

And move to using a boskos pool for (any?/all?) tests that currently have a specific project/region? Cause yea, it would make sense to me that we don't care where these things run, just that they do.

@MHBauer
Copy link
Contributor

MHBauer commented Jun 4, 2020

Okay, thanks for the hint:
I see #7769 and I will use #8165 as reference and try to follow up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Projects
None yet
Development

No branches or pull requests

7 participants