Deploy webhook by default and enable e2e #2066

bpradipt · 2024-09-27T06:39:26Z

No description provided.

bpradipt · 2024-09-27T14:25:38Z

@wainersm @stevenhorsman this is an attempt to deploy webhook as part of the installation so that number of peer pods that can be created is constrained by the per node limit. A docker provider specific test is added as well. I can enable the test for other providers but before doing it wanted to get your feedback on the approach.

stevenhorsman · 2024-09-27T14:51:24Z

Do you think this is needed before 0.10.0 release, or can it wait until after?

bpradipt · 2024-09-27T15:27:27Z

@stevenhorsman this can wait.

src/cloud-api-adaptor/test/e2e/assessment_helpers.go

src/cloud-api-adaptor/test/e2e/docker_test.go

Deploy webhook as part of install. Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>

bpradipt · 2024-10-09T12:58:16Z

@stevenhorsman @wainersm I added a commit to use const for some test images and using test images from quay to avoid the rate limiting. I can create a separate PR as well but thought of including with this as it's being reviewed.
Let me know if you want me to create a separate PR for the above change..

stevenhorsman · 2024-10-09T13:05:08Z

@stevenhorsman @wainersm I added a commit to use const for some test images and using test images from quay to avoid the rate limiting. I can create a separate PR as well but thought of including with this as it's being reviewed. Let me know if you want me to create a separate PR for the above change..

I'm not against it, but I'm pretty disappointed that I made a similar change back in July and you didn't approve it as you wanted them moved out to a separate file and then when I worked on it and hit problems I was ignored 🤷

bpradipt · 2024-10-09T13:41:35Z

@stevenhorsman @wainersm I added a commit to use const for some test images and using test images from quay to avoid the rate limiting. I can create a separate PR as well but thought of including with this as it's being reviewed. Let me know if you want me to create a separate PR for the above change..

I'm not against it, but I'm pretty disappointed that I made a similar change back in July and you didn't approve it as you wanted them moved out to a separate file and then when I worked on it and hit problems I was ignored 🤷

Oh sorry about it. I faced so much issues with the docker rate limit that I had to manually create and push the image to quay. I'll remove the changes

stevenhorsman · 2024-10-09T13:51:18Z

Oh sorry about it. I faced so much issues with the docker rate limit that I had to manually create and push the image to quay. I'll remove the changes

For what it's worth, I think the image updates should go in and I think it's better than our current code, which is why I made a similar PR, I regret that it was blocked earlier, not that you've done it here.

bpradipt · 2024-10-09T13:53:41Z

@stevenhorsman @wainersm I added a commit to use const for some test images and using test images from quay to avoid the rate limiting. I can create a separate PR as well but thought of including with this as it's being reviewed. Let me know if you want me to create a separate PR for the above change..

I'm not against it, but I'm pretty disappointed that I made a similar change back in July and you didn't approve it as you wanted them moved out to a separate file and then when I worked on it and hit problems I was ignored 🤷

Oh sorry about it. I faced so much issues with the docker rate limit that I had to manually create and push the image to quay. I'll remove the changes

I have removed the commit. And apologise again. I may have thought about moving the image names to a separate file and instead added it as constants to the same file

stevenhorsman

LGTM. Thanks!

mkulke

afaiu, there is no CI-enabled libvirt test that would tell us whether the changes work as expected? If we install the webhook as default part of the provisioner, would it make sense to execute a simple test in the libvirt e2e suite?

bpradipt · 2024-10-15T07:54:43Z

afaiu, there is no CI-enabled libvirt test that would tell us whether the changes work as expected? If we install the webhook as default part of the provisioner, would it make sense to execute a simple test in the libvirt e2e suite?

Yeah, makes sense. I'll add one test for libvirt

mkulke · 2024-10-16T08:46:40Z

there seems to be a problem with the cert-manager installation, see test run

stevenhorsman · 2024-10-16T13:11:55Z

there seems to be a problem with the cert-manager installation, see test run

I tried running this manually and it seemed to work, so I'll re-run it just incase there was just some slowness in the CI

stevenhorsman · 2024-10-17T12:48:34Z

So in my gh hosted runner trial, I added trace output and got the following error:

time="2024-10-17T12:43:04Z" level=info msg="Installing cert-manager"
time="2024-10-17T12:43:07Z" level=trace msg="/usr/bin/make -C ../webhook deploy-cert-manager, output: make[1]: Entering directory '/home/runner/work/cloud-api-adaptor/cloud-api-adaptor/src/webhook'\ncurl -fsSL -o cmctl [https://github.com/cert-manager/cmctl/releases/latest/download/cmctl_linux_amd64\nchmod](https://github.com/cert-manager/cmctl/releases/latest/download/cmctl_linux_amd64/nchmod) +x cmctl\n# Deploy cert-manager\nkubectl apply -f [https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml\nerror:](https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml/nerror:) error validating \"[https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml\](https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml/)": error validating data: failed to download openapi: Get \"http://localhost:8080/openapi/v2?timeout=32s\": dial tcp [::1]:8080: connect: connection refused; if you choose to ignore these errors, turn validation off with --validate=false\nmake[1]: *** [Makefile:130: deploy-cert-manager] Error 1\nmake[1]: Leaving directory '/home/runner/work/cloud-api-adaptor/cloud-api-adaptor/src/webhook'\n"
F1017 12:43:07.161828   18745 env.go:369] Setup failure: exit status 2
FAIL	github.com/confidential-containers/cloud-api-adaptor/src/cloud-api-adaptor/test/e2e	419.964s

however I'm not sure if this is helpful and the same issue we might be hitting here as the gh-runner is tight on disk space

bpradipt · 2024-10-17T17:51:48Z

So in my gh hosted runner trial, I added trace output and got the following error:

time="2024-10-17T12:43:04Z" level=info msg="Installing cert-manager"
time="2024-10-17T12:43:07Z" level=trace msg="/usr/bin/make -C ../webhook deploy-cert-manager, output: make[1]: Entering directory '/home/runner/work/cloud-api-adaptor/cloud-api-adaptor/src/webhook'\ncurl -fsSL -o cmctl [https://github.com/cert-manager/cmctl/releases/latest/download/cmctl_linux_amd64\nchmod](https://github.com/cert-manager/cmctl/releases/latest/download/cmctl_linux_amd64/nchmod) +x cmctl\n# Deploy cert-manager\nkubectl apply -f [https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml\nerror:](https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml/nerror:) error validating \"[https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml\](https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml/)": error validating data: failed to download openapi: Get \"http://localhost:8080/openapi/v2?timeout=32s\": dial tcp [::1]:8080: connect: connection refused; if you choose to ignore these errors, turn validation off with --validate=false\nmake[1]: *** [Makefile:130: deploy-cert-manager] Error 1\nmake[1]: Leaving directory '/home/runner/work/cloud-api-adaptor/cloud-api-adaptor/src/webhook'\n"
F1017 12:43:07.161828   18745 env.go:369] Setup failure: exit status 2
FAIL	github.com/confidential-containers/cloud-api-adaptor/src/cloud-api-adaptor/test/e2e	419.964s

however I'm not sure if this is helpful and the same issue we might be hitting here as the gh-runner is tight on disk space

May be something is different in the way gh action is running, that's why we see the error only in gh action workflow.
I have added a commit to install cert-manager via kcli_cluster.sh. Let's see.

bpradipt · 2024-10-18T02:50:47Z

With this different flow, at least the error seems to be clear

mutatingwebhookconfiguration.admissionregistration.k8s.io/cert-manager-webhook created
validatingwebhookconfiguration.admissionregistration.k8s.io/cert-manager-webhook created
error: timed out waiting for the condition on endpoints/cert-manager

Retrying with increased timeout for cert-manager deployment

bpradipt · 2024-10-18T07:23:24Z

With this different flow, at least the error seems to be clear
mutatingwebhookconfiguration.admissionregistration.k8s.io/cert-manager-webhook created
validatingwebhookconfiguration.admissionregistration.k8s.io/cert-manager-webhook created
error: timed out waiting for the condition on endpoints/cert-manager
Retrying with increased timeout for cert-manager deployment

Initial install of cert-manager succeeded. However when running make -C ../webhook deploy-cert-manager via provision.go failed. This mostly reapplies the yamls which is pretty quick and should have succeeded. Increased the kubectl wait timeout and rechecking.

bpradipt · 2024-10-18T11:42:02Z

@stevenhorsman Now I see this error "Error from server: error when creating "https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml": etcdserver: request timed out"
Is installed cert-manager overwhelming the k8s cluster and the config needs to be changed?

stevenhorsman · 2024-10-18T11:52:37Z

@stevenhorsman Now I see this error "Error from server: error when creating "https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml": etcdserver: request timed out" Is installed cert-manager overwhelming the k8s cluster and the config needs to be changed?

I think it's more likely that the runner is struggling - the default kcli set-up is 4 vCPU and 6GB RAM for each of the nodes, but the az-ubuntu-2204 runner only has 4vCPUs available, so maybe the extra stuff means kcli is starting to try and use more that is available?

bpradipt · 2024-10-18T14:49:17Z

@stevenhorsman @mkulke tried few options and nothing works in CI w.r.to setting up cert-manager with libvirt. Looks like it's something to do with resource availability (see earlier errors on etcd server timeout). No issues in local runs.
What's the way forward ?

stevenhorsman · 2024-10-18T14:53:55Z

@stevenhorsman @mkulke tried few options and nothing works in CI w.r.to setting up cert-manager with libvirt. Looks like it's something to do with resource availability (see earlier errors on etcd server timeout). No issues in local runs. What's the way forward ?

Could we try using a Standard_D8_v4 VM instead to see if that helps as it might help us rule in/out the resource pressure theory. I guess that would need to updated in garm?

We good also try testing it locally on a 4 vCPU VM to see if we reproduce the errors there?

bpradipt · 2024-10-18T15:31:58Z

@stevenhorsman @mkulke tried few options and nothing works in CI w.r.to setting up cert-manager with libvirt. Looks like it's something to do with resource availability (see earlier errors on etcd server timeout). No issues in local runs. What's the way forward ?

Could we try using a Standard_D8_v4 VM instead to see if that helps as it might help us rule in/out the resource pressure theory. I guess that would need to updated in garm?

We good also try testing it locally on a 4 vCPU VM to see if we reproduce the errors there?

Do you have the cpu and memory spec of the runner?

stevenhorsman · 2024-10-18T15:33:24Z

Do you have the cpu and memory spec of the runner?

The current runner is: https://cloudprice.net/vm/Standard_D4s_v4

mkulke · 2024-10-18T17:17:31Z

yes, I guess we can do that. but I'm not sure we want the instance to be bumped to 2x the size, I think the same runner pool is also used by a different jobs. we could create a discrete pool az-ubuntu2204-large or something and switch the runner type on the libvirt workflow.

but that puts our plans to switch to github-hosted runners on hold I suppose? I'm a bit surprise to see that we cannot make it work with 16gb of ram.

stevenhorsman · 2024-10-18T17:21:56Z

yes, I guess we can do that. but I'm not sure we want the instance to be bumped to 2x the size, I think the same runner pool is also used by a different jobs. we could create a discrete pool az-ubuntu2204-large or something and switch the runner type on the libvirt workflow.

but that puts our plans to switch to github-hosted runners on hold I suppose? I'm a bit surprise to see that we cannot make it work with 16gb of ram.

I don't know whether the CPU, or RAM is the bottleneck, so I figured just trying 8 and 32GB would let us know if it's resource at all and then (if we can find matching profiles) trying 8 vCPU and 16GB and/or 4 vCPU and 32GB RAM would let us know which one is under pressure. I can try those combinations next week with VMs manually though to see if that gives more info.

mkulke · 2024-10-21T09:39:48Z

I don't know whether the CPU, or RAM is the bottleneck, so I figured just trying 8 and 32GB would let us know if it's resource at all and then (if we can find matching profiles) trying 8 vCPU and 16GB and/or 4 vCPU and 32GB RAM would let us know which one is under pressure. I can try those combinations next week with VMs manually though to see if that gives more info.

I bumped the pools instance type to Standard_D8s_v4

mkulke · 2024-10-21T10:06:40Z

that didn't help when re-running the test. I would suggest to copy the workflow, throw everything coco-related out and just try to install cert-manager after kubeadm set up the cluster.

stevenhorsman · 2024-10-21T10:50:25Z

I also checked it locally on a 4 CPU, 16GB VM locally and the cert-manager install worked fine there, so maybe the resource usage idea is a bust?

bpradipt · 2024-10-21T13:02:32Z

I also checked it locally on a 4 CPU, 16GB VM locally and the cert-manager install worked fine there, so maybe the resource usage idea is a bust?

I have also been unable to recreate this locally.
The reason for guessing it's due to resource availability is because of etcd server timeout error:
#2066 (comment)

@mkulke in one of the previous runs, I had installed cert-manager just after kubeadm cluster install. But the CI provisioner still failed during checking for the service endpoints
#2066 (comment)

I added more logs and some hacks to retry. Let's see..

bpradipt · 2024-10-21T14:05:15Z

I logged the errors instead of using Trace, and see this. It's trying to access localhost which might indicate issue with kubeconfig.

time="2024-10-21T13:55:30Z" level=info msg="Error  in install cert-manager: exit status 2: make[1]: Entering directory '/home/runner/actions-runner/_work/cloud-api-adaptor/cloud-api-adaptor/src/webhook'\ncurl -fsSL -o cmctl [https://github.com/cert-manager/cmctl/releases/latest/download/cmctl_linux_amd64\nchmod](https://github.com/cert-manager/cmctl/releases/latest/download/cmctl_linux_amd64/nchmod) +x cmctl\n# Deploy cert-manager\nkubectl apply -f [https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml\nerror:](https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml/nerror:) error validating \"[https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml\](https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml/)": error validating data: failed to download openapi: Get \"http://localhost:8080/openapi/v2?timeout=32s\": dial tcp 127.0.0.1:8080: connect: connection refused; if you choose to ignore these errors, turn validation off with --validate=false\nmake[1]: *** [Makefile:130: deploy-cert-manager] Error 1\nmake[1]: Leaving directory '/home/runner/actions-runner/_work/cloud-api-adaptor/cloud-api-adaptor/src/webhook'\n"

Enable it for docker provider Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>

TestLibvirtCreateWithCpuAndMemRequestLimit to check if peerpod resource is added by the webhook Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>

This is specifically for slow systems where the wait time can be higher Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>

bpradipt · 2024-10-21T16:19:06Z

@mkulke @stevenhorsman the KUBECONFIG setting was the culprit. The previous run was successful. Redoing it again by tidying up the commits.

bpradipt marked this pull request as ready for review September 27, 2024 14:21

bpradipt requested a review from a team as a code owner September 27, 2024 14:21

bpradipt force-pushed the webhook-e2e branch from e3961b1 to 256a0fb Compare September 27, 2024 14:23

stevenhorsman reviewed Oct 3, 2024

View reviewed changes

src/cloud-api-adaptor/test/e2e/assessment_helpers.go Show resolved Hide resolved

src/cloud-api-adaptor/test/e2e/docker_test.go Outdated Show resolved Hide resolved

src/cloud-api-adaptor/test/e2e/docker_test.go Show resolved Hide resolved

install: Deploy webhook

01c43f2

Deploy webhook as part of install. Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>

bpradipt force-pushed the webhook-e2e branch from 256a0fb to b27622f Compare October 9, 2024 12:54

bpradipt force-pushed the webhook-e2e branch from b27622f to 5f0f398 Compare October 9, 2024 13:44

bpradipt force-pushed the webhook-e2e branch 2 times, most recently from 25e3e74 to 60d8c46 Compare October 9, 2024 15:04

stevenhorsman approved these changes Oct 9, 2024

View reviewed changes

bpradipt requested review from mkulke, wainersm and a team October 9, 2024 16:27

mkulke reviewed Oct 15, 2024

View reviewed changes

mkulke added the test_e2e_libvirt Run Libvirt e2e tests label Oct 16, 2024

bpradipt force-pushed the webhook-e2e branch from 640cd6c to b86abf7 Compare October 18, 2024 02:50

bpradipt force-pushed the webhook-e2e branch 2 times, most recently from 109dac7 to 6acd28b Compare October 18, 2024 13:17

bpradipt force-pushed the webhook-e2e branch from 6acd28b to 018b2c6 Compare October 21, 2024 12:58

bpradipt force-pushed the webhook-e2e branch from 018b2c6 to 460bae1 Compare October 21, 2024 14:02

bpradipt added 3 commits October 21, 2024 21:41

test-e2e: Add test to verify podvm resource addition

3caa73e

Enable it for docker provider Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>

test-e2e: Add libvirt test with cpu and mem request/limit

20ec0c8

TestLibvirtCreateWithCpuAndMemRequestLimit to check if peerpod resource is added by the webhook Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>

webhook: Increase kubectl wait timeout

b156ce1

This is specifically for slow systems where the wait time can be higher Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>

bpradipt force-pushed the webhook-e2e branch from 460bae1 to b156ce1 Compare October 21, 2024 16:11

bpradipt requested a review from mkulke October 21, 2024 17:42

mkulke approved these changes Oct 21, 2024

View reviewed changes

stevenhorsman merged commit ea8518a into confidential-containers:main Oct 21, 2024
28 checks passed

bpradipt deleted the webhook-e2e branch October 22, 2024 05:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deploy webhook by default and enable e2e #2066

Deploy webhook by default and enable e2e #2066

bpradipt commented Sep 27, 2024

bpradipt commented Sep 27, 2024

stevenhorsman commented Sep 27, 2024

bpradipt commented Sep 27, 2024

bpradipt commented Oct 9, 2024

stevenhorsman commented Oct 9, 2024 •

edited

Loading

bpradipt commented Oct 9, 2024

stevenhorsman commented Oct 9, 2024

bpradipt commented Oct 9, 2024

stevenhorsman left a comment

mkulke left a comment

bpradipt commented Oct 15, 2024

mkulke commented Oct 16, 2024

stevenhorsman commented Oct 16, 2024

stevenhorsman commented Oct 17, 2024

bpradipt commented Oct 17, 2024

bpradipt commented Oct 18, 2024

bpradipt commented Oct 18, 2024

bpradipt commented Oct 18, 2024

stevenhorsman commented Oct 18, 2024

bpradipt commented Oct 18, 2024

stevenhorsman commented Oct 18, 2024 •

edited

Loading

bpradipt commented Oct 18, 2024

stevenhorsman commented Oct 18, 2024

mkulke commented Oct 18, 2024

stevenhorsman commented Oct 18, 2024

mkulke commented Oct 21, 2024

mkulke commented Oct 21, 2024

stevenhorsman commented Oct 21, 2024

bpradipt commented Oct 21, 2024

bpradipt commented Oct 21, 2024

bpradipt commented Oct 21, 2024

Deploy webhook by default and enable e2e #2066

Deploy webhook by default and enable e2e #2066

Conversation

bpradipt commented Sep 27, 2024

bpradipt commented Sep 27, 2024

stevenhorsman commented Sep 27, 2024

bpradipt commented Sep 27, 2024

bpradipt commented Oct 9, 2024

stevenhorsman commented Oct 9, 2024 • edited Loading

bpradipt commented Oct 9, 2024

stevenhorsman commented Oct 9, 2024

bpradipt commented Oct 9, 2024

stevenhorsman left a comment

Choose a reason for hiding this comment

mkulke left a comment

Choose a reason for hiding this comment

bpradipt commented Oct 15, 2024

mkulke commented Oct 16, 2024

stevenhorsman commented Oct 16, 2024

stevenhorsman commented Oct 17, 2024

bpradipt commented Oct 17, 2024

bpradipt commented Oct 18, 2024

bpradipt commented Oct 18, 2024

bpradipt commented Oct 18, 2024

stevenhorsman commented Oct 18, 2024

bpradipt commented Oct 18, 2024

stevenhorsman commented Oct 18, 2024 • edited Loading

bpradipt commented Oct 18, 2024

stevenhorsman commented Oct 18, 2024

mkulke commented Oct 18, 2024

stevenhorsman commented Oct 18, 2024

mkulke commented Oct 21, 2024

mkulke commented Oct 21, 2024

stevenhorsman commented Oct 21, 2024

bpradipt commented Oct 21, 2024

bpradipt commented Oct 21, 2024

bpradipt commented Oct 21, 2024

stevenhorsman commented Oct 9, 2024 •

edited

Loading

stevenhorsman commented Oct 18, 2024 •

edited

Loading