-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deploy webhook by default and enable e2e #2066
Deploy webhook by default and enable e2e #2066
Conversation
e3961b1
to
256a0fb
Compare
@wainersm @stevenhorsman this is an attempt to deploy webhook as part of the installation so that number of peer pods that can be created is constrained by the per node limit. A docker provider specific test is added as well. I can enable the test for other providers but before doing it wanted to get your feedback on the approach. |
Do you think this is needed before 0.10.0 release, or can it wait until after? |
@stevenhorsman this can wait. |
Deploy webhook as part of install. Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
@stevenhorsman @wainersm I added a commit to use const for some test images and using test images from quay to avoid the rate limiting. I can create a separate PR as well but thought of including with this as it's being reviewed. |
I'm not against it, but I'm pretty disappointed that I made a similar change back in July and you didn't approve it as you wanted them moved out to a separate file and then when I worked on it and hit problems I was ignored 🤷 |
Oh sorry about it. I faced so much issues with the docker rate limit that I had to manually create and push the image to quay. I'll remove the changes |
For what it's worth, I think the image updates should go in and I think it's better than our current code, which is why I made a similar PR, I regret that it was blocked earlier, not that you've done it here. |
I have removed the commit. And apologise again. I may have thought about moving the image names to a separate file and instead added it as constants to the same file |
25e3e74
to
60d8c46
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
afaiu, there is no CI-enabled libvirt test that would tell us whether the changes work as expected? If we install the webhook as default part of the provisioner, would it make sense to execute a simple test in the libvirt e2e suite?
Yeah, makes sense. I'll add one test for libvirt |
there seems to be a problem with the cert-manager installation, see test run |
I tried running this manually and it seemed to work, so I'll re-run it just incase there was just some slowness in the CI |
So in my gh hosted runner trial, I added trace output and got the following error:
however I'm not sure if this is helpful and the same issue we might be hitting here as the gh-runner is tight on disk space |
May be something is different in the way gh action is running, that's why we see the error only in gh action workflow. |
With this different flow, at least the error seems to be clear
Retrying with increased timeout for cert-manager deployment |
Initial install of cert-manager succeeded. However when running |
@stevenhorsman Now I see this error "Error from server: error when creating "https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml": etcdserver: request timed out" |
I think it's more likely that the runner is struggling - the default kcli set-up is 4 vCPU and 6GB RAM for each of the nodes, but the az-ubuntu-2204 runner only has 4vCPUs available, so maybe the extra stuff means kcli is starting to try and use more that is available? |
109dac7
to
6acd28b
Compare
@stevenhorsman @mkulke tried few options and nothing works in CI w.r.to setting up cert-manager with libvirt. Looks like it's something to do with resource availability (see earlier errors on etcd server timeout). No issues in local runs. |
Could we try using a Standard_D8_v4 VM instead to see if that helps as it might help us rule in/out the resource pressure theory. I guess that would need to updated in garm? We good also try testing it locally on a 4 vCPU VM to see if we reproduce the errors there? |
Do you have the cpu and memory spec of the runner? |
The current runner is: https://cloudprice.net/vm/Standard_D4s_v4 |
yes, I guess we can do that. but I'm not sure we want the instance to be bumped to 2x the size, I think the same runner pool is also used by a different jobs. we could create a discrete pool az-ubuntu2204-large or something and switch the runner type on the libvirt workflow. but that puts our plans to switch to github-hosted runners on hold I suppose? I'm a bit surprise to see that we cannot make it work with 16gb of ram. |
I don't know whether the CPU, or RAM is the bottleneck, so I figured just trying 8 and 32GB would let us know if it's resource at all and then (if we can find matching profiles) trying 8 vCPU and 16GB and/or 4 vCPU and 32GB RAM would let us know which one is under pressure. I can try those combinations next week with VMs manually though to see if that gives more info. |
I bumped the pools instance type to Standard_D8s_v4 |
I also checked it locally on a 4 CPU, 16GB VM locally and the cert-manager install worked fine there, so maybe the resource usage idea is a bust? |
I have also been unable to recreate this locally. @mkulke in one of the previous runs, I had installed cert-manager just after kubeadm cluster install. But the CI provisioner still failed during checking for the service endpoints I added more logs and some hacks to retry. Let's see.. |
I logged the errors instead of using Trace, and see this. It's trying to access localhost which might indicate issue with kubeconfig.
|
Enable it for docker provider Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
TestLibvirtCreateWithCpuAndMemRequestLimit to check if peerpod resource is added by the webhook Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
This is specifically for slow systems where the wait time can be higher Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
@mkulke @stevenhorsman the KUBECONFIG setting was the culprit. The previous run was successful. Redoing it again by tidying up the commits. |
No description provided.