Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deploy webhook by default and enable e2e #2066

Merged
merged 4 commits into from
Oct 21, 2024

Conversation

bpradipt
Copy link
Member

No description provided.

@bpradipt bpradipt marked this pull request as ready for review September 27, 2024 14:21
@bpradipt bpradipt requested a review from a team as a code owner September 27, 2024 14:21
@bpradipt
Copy link
Member Author

@wainersm @stevenhorsman this is an attempt to deploy webhook as part of the installation so that number of peer pods that can be created is constrained by the per node limit. A docker provider specific test is added as well. I can enable the test for other providers but before doing it wanted to get your feedback on the approach.

@stevenhorsman
Copy link
Member

Do you think this is needed before 0.10.0 release, or can it wait until after?

@bpradipt
Copy link
Member Author

@stevenhorsman this can wait.

Deploy webhook as part of install.

Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
@bpradipt
Copy link
Member Author

bpradipt commented Oct 9, 2024

@stevenhorsman @wainersm I added a commit to use const for some test images and using test images from quay to avoid the rate limiting. I can create a separate PR as well but thought of including with this as it's being reviewed.
Let me know if you want me to create a separate PR for the above change..

@stevenhorsman
Copy link
Member

stevenhorsman commented Oct 9, 2024

@stevenhorsman @wainersm I added a commit to use const for some test images and using test images from quay to avoid the rate limiting. I can create a separate PR as well but thought of including with this as it's being reviewed. Let me know if you want me to create a separate PR for the above change..

I'm not against it, but I'm pretty disappointed that I made a similar change back in July and you didn't approve it as you wanted them moved out to a separate file and then when I worked on it and hit problems I was ignored 🤷

@bpradipt
Copy link
Member Author

bpradipt commented Oct 9, 2024

@stevenhorsman @wainersm I added a commit to use const for some test images and using test images from quay to avoid the rate limiting. I can create a separate PR as well but thought of including with this as it's being reviewed. Let me know if you want me to create a separate PR for the above change..

I'm not against it, but I'm pretty disappointed that I made a similar change back in July and you didn't approve it as you wanted them moved out to a separate file and then when I worked on it and hit problems I was ignored 🤷

Oh sorry about it. I faced so much issues with the docker rate limit that I had to manually create and push the image to quay. I'll remove the changes

@stevenhorsman
Copy link
Member

Oh sorry about it. I faced so much issues with the docker rate limit that I had to manually create and push the image to quay. I'll remove the changes

For what it's worth, I think the image updates should go in and I think it's better than our current code, which is why I made a similar PR, I regret that it was blocked earlier, not that you've done it here.

@bpradipt
Copy link
Member Author

bpradipt commented Oct 9, 2024

@stevenhorsman @wainersm I added a commit to use const for some test images and using test images from quay to avoid the rate limiting. I can create a separate PR as well but thought of including with this as it's being reviewed. Let me know if you want me to create a separate PR for the above change..

I'm not against it, but I'm pretty disappointed that I made a similar change back in July and you didn't approve it as you wanted them moved out to a separate file and then when I worked on it and hit problems I was ignored 🤷

Oh sorry about it. I faced so much issues with the docker rate limit that I had to manually create and push the image to quay. I'll remove the changes

I have removed the commit. And apologise again. I may have thought about moving the image names to a separate file and instead added it as constants to the same file

@bpradipt bpradipt force-pushed the webhook-e2e branch 2 times, most recently from 25e3e74 to 60d8c46 Compare October 9, 2024 15:04
Copy link
Member

@stevenhorsman stevenhorsman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@bpradipt bpradipt requested review from mkulke, wainersm and a team October 9, 2024 16:27
Copy link
Collaborator

@mkulke mkulke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afaiu, there is no CI-enabled libvirt test that would tell us whether the changes work as expected? If we install the webhook as default part of the provisioner, would it make sense to execute a simple test in the libvirt e2e suite?

@bpradipt
Copy link
Member Author

afaiu, there is no CI-enabled libvirt test that would tell us whether the changes work as expected? If we install the webhook as default part of the provisioner, would it make sense to execute a simple test in the libvirt e2e suite?

Yeah, makes sense. I'll add one test for libvirt

@mkulke mkulke added the test_e2e_libvirt Run Libvirt e2e tests label Oct 16, 2024
@mkulke
Copy link
Collaborator

mkulke commented Oct 16, 2024

there seems to be a problem with the cert-manager installation, see test run

@stevenhorsman
Copy link
Member

there seems to be a problem with the cert-manager installation, see test run

I tried running this manually and it seemed to work, so I'll re-run it just incase there was just some slowness in the CI

@stevenhorsman
Copy link
Member

So in my gh hosted runner trial, I added trace output and got the following error:

time="2024-10-17T12:43:04Z" level=info msg="Installing cert-manager"
time="2024-10-17T12:43:07Z" level=trace msg="/usr/bin/make -C ../webhook deploy-cert-manager, output: make[1]: Entering directory '/home/runner/work/cloud-api-adaptor/cloud-api-adaptor/src/webhook'\ncurl -fsSL -o cmctl [https://github.com/cert-manager/cmctl/releases/latest/download/cmctl_linux_amd64\nchmod](https://github.com/cert-manager/cmctl/releases/latest/download/cmctl_linux_amd64/nchmod) +x cmctl\n# Deploy cert-manager\nkubectl apply -f [https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml\nerror:](https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml/nerror:) error validating \"[https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml\](https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml/)": error validating data: failed to download openapi: Get \"http://localhost:8080/openapi/v2?timeout=32s\": dial tcp [::1]:8080: connect: connection refused; if you choose to ignore these errors, turn validation off with --validate=false\nmake[1]: *** [Makefile:130: deploy-cert-manager] Error 1\nmake[1]: Leaving directory '/home/runner/work/cloud-api-adaptor/cloud-api-adaptor/src/webhook'\n"
F1017 12:43:07.161828   18745 env.go:369] Setup failure: exit status 2
FAIL	github.com/confidential-containers/cloud-api-adaptor/src/cloud-api-adaptor/test/e2e	419.964s

however I'm not sure if this is helpful and the same issue we might be hitting here as the gh-runner is tight on disk space

@bpradipt
Copy link
Member Author

So in my gh hosted runner trial, I added trace output and got the following error:

time="2024-10-17T12:43:04Z" level=info msg="Installing cert-manager"
time="2024-10-17T12:43:07Z" level=trace msg="/usr/bin/make -C ../webhook deploy-cert-manager, output: make[1]: Entering directory '/home/runner/work/cloud-api-adaptor/cloud-api-adaptor/src/webhook'\ncurl -fsSL -o cmctl [https://github.com/cert-manager/cmctl/releases/latest/download/cmctl_linux_amd64\nchmod](https://github.com/cert-manager/cmctl/releases/latest/download/cmctl_linux_amd64/nchmod) +x cmctl\n# Deploy cert-manager\nkubectl apply -f [https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml\nerror:](https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml/nerror:) error validating \"[https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml\](https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml/)": error validating data: failed to download openapi: Get \"http://localhost:8080/openapi/v2?timeout=32s\": dial tcp [::1]:8080: connect: connection refused; if you choose to ignore these errors, turn validation off with --validate=false\nmake[1]: *** [Makefile:130: deploy-cert-manager] Error 1\nmake[1]: Leaving directory '/home/runner/work/cloud-api-adaptor/cloud-api-adaptor/src/webhook'\n"
F1017 12:43:07.161828   18745 env.go:369] Setup failure: exit status 2
FAIL	github.com/confidential-containers/cloud-api-adaptor/src/cloud-api-adaptor/test/e2e	419.964s

however I'm not sure if this is helpful and the same issue we might be hitting here as the gh-runner is tight on disk space

May be something is different in the way gh action is running, that's why we see the error only in gh action workflow.
I have added a commit to install cert-manager via kcli_cluster.sh. Let's see.

@bpradipt
Copy link
Member Author

With this different flow, at least the error seems to be clear

mutatingwebhookconfiguration.admissionregistration.k8s.io/cert-manager-webhook created
validatingwebhookconfiguration.admissionregistration.k8s.io/cert-manager-webhook created
error: timed out waiting for the condition on endpoints/cert-manager

Retrying with increased timeout for cert-manager deployment

@bpradipt
Copy link
Member Author

With this different flow, at least the error seems to be clear

mutatingwebhookconfiguration.admissionregistration.k8s.io/cert-manager-webhook created
validatingwebhookconfiguration.admissionregistration.k8s.io/cert-manager-webhook created
error: timed out waiting for the condition on endpoints/cert-manager

Retrying with increased timeout for cert-manager deployment

Initial install of cert-manager succeeded. However when running make -C ../webhook deploy-cert-manager via provision.go failed. This mostly reapplies the yamls which is pretty quick and should have succeeded. Increased the kubectl wait timeout and rechecking.

@bpradipt
Copy link
Member Author

@stevenhorsman Now I see this error "Error from server: error when creating "https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml": etcdserver: request timed out"
Is installed cert-manager overwhelming the k8s cluster and the config needs to be changed?

@stevenhorsman
Copy link
Member

@stevenhorsman Now I see this error "Error from server: error when creating "https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml": etcdserver: request timed out" Is installed cert-manager overwhelming the k8s cluster and the config needs to be changed?

I think it's more likely that the runner is struggling - the default kcli set-up is 4 vCPU and 6GB RAM for each of the nodes, but the az-ubuntu-2204 runner only has 4vCPUs available, so maybe the extra stuff means kcli is starting to try and use more that is available?

@bpradipt bpradipt force-pushed the webhook-e2e branch 2 times, most recently from 109dac7 to 6acd28b Compare October 18, 2024 13:17
@bpradipt
Copy link
Member Author

@stevenhorsman @mkulke tried few options and nothing works in CI w.r.to setting up cert-manager with libvirt. Looks like it's something to do with resource availability (see earlier errors on etcd server timeout). No issues in local runs.
What's the way forward ?

@stevenhorsman
Copy link
Member

stevenhorsman commented Oct 18, 2024

@stevenhorsman @mkulke tried few options and nothing works in CI w.r.to setting up cert-manager with libvirt. Looks like it's something to do with resource availability (see earlier errors on etcd server timeout). No issues in local runs. What's the way forward ?

Could we try using a Standard_D8_v4 VM instead to see if that helps as it might help us rule in/out the resource pressure theory. I guess that would need to updated in garm?

We good also try testing it locally on a 4 vCPU VM to see if we reproduce the errors there?

@bpradipt
Copy link
Member Author

@stevenhorsman @mkulke tried few options and nothing works in CI w.r.to setting up cert-manager with libvirt. Looks like it's something to do with resource availability (see earlier errors on etcd server timeout). No issues in local runs. What's the way forward ?

Could we try using a Standard_D8_v4 VM instead to see if that helps as it might help us rule in/out the resource pressure theory. I guess that would need to updated in garm?

We good also try testing it locally on a 4 vCPU VM to see if we reproduce the errors there?

Do you have the cpu and memory spec of the runner?

@stevenhorsman
Copy link
Member

Do you have the cpu and memory spec of the runner?

The current runner is: https://cloudprice.net/vm/Standard_D4s_v4

@mkulke
Copy link
Collaborator

mkulke commented Oct 18, 2024

yes, I guess we can do that. but I'm not sure we want the instance to be bumped to 2x the size, I think the same runner pool is also used by a different jobs. we could create a discrete pool az-ubuntu2204-large or something and switch the runner type on the libvirt workflow.

but that puts our plans to switch to github-hosted runners on hold I suppose? I'm a bit surprise to see that we cannot make it work with 16gb of ram.

@stevenhorsman
Copy link
Member

yes, I guess we can do that. but I'm not sure we want the instance to be bumped to 2x the size, I think the same runner pool is also used by a different jobs. we could create a discrete pool az-ubuntu2204-large or something and switch the runner type on the libvirt workflow.

but that puts our plans to switch to github-hosted runners on hold I suppose? I'm a bit surprise to see that we cannot make it work with 16gb of ram.

I don't know whether the CPU, or RAM is the bottleneck, so I figured just trying 8 and 32GB would let us know if it's resource at all and then (if we can find matching profiles) trying 8 vCPU and 16GB and/or 4 vCPU and 32GB RAM would let us know which one is under pressure. I can try those combinations next week with VMs manually though to see if that gives more info.

@mkulke
Copy link
Collaborator

mkulke commented Oct 21, 2024

I don't know whether the CPU, or RAM is the bottleneck, so I figured just trying 8 and 32GB would let us know if it's resource at all and then (if we can find matching profiles) trying 8 vCPU and 16GB and/or 4 vCPU and 32GB RAM would let us know which one is under pressure. I can try those combinations next week with VMs manually though to see if that gives more info.

I bumped the pools instance type to Standard_D8s_v4

@mkulke
Copy link
Collaborator

mkulke commented Oct 21, 2024

that didn't help when re-running the test. I would suggest to copy the workflow, throw everything coco-related out and just try to install cert-manager after kubeadm set up the cluster.

image

@stevenhorsman
Copy link
Member

I also checked it locally on a 4 CPU, 16GB VM locally and the cert-manager install worked fine there, so maybe the resource usage idea is a bust?

@bpradipt
Copy link
Member Author

I also checked it locally on a 4 CPU, 16GB VM locally and the cert-manager install worked fine there, so maybe the resource usage idea is a bust?

I have also been unable to recreate this locally.
The reason for guessing it's due to resource availability is because of etcd server timeout error:
#2066 (comment)

@mkulke in one of the previous runs, I had installed cert-manager just after kubeadm cluster install. But the CI provisioner still failed during checking for the service endpoints
#2066 (comment)

I added more logs and some hacks to retry. Let's see..

@bpradipt
Copy link
Member Author

I logged the errors instead of using Trace, and see this. It's trying to access localhost which might indicate issue with kubeconfig.

time="2024-10-21T13:55:30Z" level=info msg="Error  in install cert-manager: exit status 2: make[1]: Entering directory '/home/runner/actions-runner/_work/cloud-api-adaptor/cloud-api-adaptor/src/webhook'\ncurl -fsSL -o cmctl [https://github.com/cert-manager/cmctl/releases/latest/download/cmctl_linux_amd64\nchmod](https://github.com/cert-manager/cmctl/releases/latest/download/cmctl_linux_amd64/nchmod) +x cmctl\n# Deploy cert-manager\nkubectl apply -f [https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml\nerror:](https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml/nerror:) error validating \"[https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml\](https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml/)": error validating data: failed to download openapi: Get \"http://localhost:8080/openapi/v2?timeout=32s\": dial tcp 127.0.0.1:8080: connect: connection refused; if you choose to ignore these errors, turn validation off with --validate=false\nmake[1]: *** [Makefile:130: deploy-cert-manager] Error 1\nmake[1]: Leaving directory '/home/runner/actions-runner/_work/cloud-api-adaptor/cloud-api-adaptor/src/webhook'\n"

Enable it for docker provider

Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
TestLibvirtCreateWithCpuAndMemRequestLimit to check if
peerpod resource is added by the webhook

Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
This is specifically for slow systems where the wait time
can be higher

Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
@bpradipt
Copy link
Member Author

@mkulke @stevenhorsman the KUBECONFIG setting was the culprit. The previous run was successful. Redoing it again by tidying up the commits.

@bpradipt bpradipt requested a review from mkulke October 21, 2024 17:42
@stevenhorsman stevenhorsman merged commit ea8518a into confidential-containers:main Oct 21, 2024
28 checks passed
@bpradipt bpradipt deleted the webhook-e2e branch October 22, 2024 05:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test_e2e_libvirt Run Libvirt e2e tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants