-
-
Notifications
You must be signed in to change notification settings - Fork 232
fix(lxc): race condition in container clone / reboot operations #2320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
add waiting for status "running" after update that triggered reboot Signed-off-by: Stanislav Shamilov <shamilovstas@protonmail.com>
It seems that some tests running in parallel and using the same container were causing some of the lock timeouts on the proxmox host used for testing. This commit splits a single test, parts of which were run in parallel and used the same container, into several independent tests using their own containers. This should reduce the overall flakiness of the tests. Besides, a few tests were fixed. The tests for ipv4/ipv6 were using linux bridge "vmbr1", which wasn't created beforehand. Also, removed some of the WaitForContainerStatus calls done after updating the container since previous commit introduced a wait in the `updateContainer` method which waits for the container to be rebooted. Also, the template used for running containers was changed from Ubuntu 24.04 to Alpine 3.22. This makes the downloading of the template faster as the Alpine template is much smaller. Signed-off-by: Stanislav Shamilov <shamilovstas@protonmail.com>
| if e := containerAPI.WaitForContainerStatus(ctx, "running"); e != nil { | ||
| return diag.FromErr(e) | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried that before, but it didn’t change anything, the container is reported as “running” almost immediately after the reboot call. I guess it depends on the PVE host though. On slower hosts / IO or in some edge cases that might actually help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So do you think I should leave it or delete it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this adds stability in your tests, then let's leave it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still racy tho:
=== CONT TestAccResourceContainerIpv4Ipv6
=== CONT TestAccResourceContainerMountPoint
=== NAME TestAccResourceContainerDnsBlock
resource_container_test.go:329: Step 1/6 error: Error running apply: exit status 1
Error: error starting container: error starting container: received an HTTP 500 response - Reason: CT 121253 already running
with proxmox_virtual_environment_container.test_container,
on terraform_plugin_test.tf line 25, in resource "proxmox_virtual_environment_container" "test_container":
25: resource "proxmox_virtual_environment_container" "test_container" {
--- PASS: TestAccResourceContainerIpv4Ipv6 (6.61s)
--- PASS: TestAccResourceContainerMountPoint (10.95s)
=== NAME TestAccResourceContainer
resource_container_test.go:50: Step 3/3 error: Error running apply: exit status 1
Error: error updating container: received an HTTP 500 response - Reason: can't lock file '/run/lock/lxc/pve-config-135546.lock' - got timeout
with proxmox_virtual_environment_container.test_container,
on terraform_plugin_test.tf line 25, in resource "proxmox_virtual_environment_container" "test_container":
25: resource "proxmox_virtual_environment_container" "test_container" {
=== NAME TestAccResourceContainerHostname
resource_container_test.go:565: Error running post-test destroy, there may be dangling resources: exit status 1
Error: error waiting for container shut down: task "UPID:pve:001BABE4:02948904:69132015:vzshutdown:104154:root@pam:" failed to complete with exit code: can't lock file '/run/lock/lxc/pve-config-104154.lock' - got timeout
--- FAIL: TestAccResourceContainerHostname (21.57s)
--- FAIL: TestAccResourceContainerDnsBlock (21.97s)
--- PASS: TestAccResourceContainerClone (27.57s)
I think WaitForContainerStatus(...) is still needed in tests. It's possible that when we call it right after a reboot, the actual reboot hasn't completed yet, it seems to be asynchronous. So the call in the update might return "running" for a container that hasn't actually been rebooted just yet.
No harm in keeping it there tho.
I'll make the changes in this PR as I'm testing this branch atm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for the WaitContainerForStatus, I removed it from the tests as the "update" method of the resource now waits for the container if it was rebooted.
As for the flakiness thing, I described it here: #2313 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, my changes didn't remove flakiness completely, but seem to have reduced it. Before it, I couldn't run the tests successfully even once. The remaining flakiness might be a symptom of some bug in "update" method of the resource, but I'm not sure about that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alright, the proper fix here is to wait for reboot / create / etc task to complete
Signed-off-by: Pavel Boldyrev <627562+bpg@users.noreply.github.com>
bpg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for raising this @shamilovstas!
I've played with the tests a bit and I think I've fixed the root cause of the flakiness -- a missing wait on container's reboot / clones tasks.
Seems to be working well now 🤞🏼
LGTM! 🚀
Contributor's Note
/docsfor any user-facing features or additions./fwprovider/testsfor any new or updated resources / data sources.make exampleto verify that the change works as expected.It seems that some tests running in parallel and using the same container were causing some of the lock timeouts on the proxmox host used for testing. The single acceptance test, parts of which were run in parallel and used the same container, into several independent tests using their own containers. This should reduce the overall flakiness of the tests.
Besides, a few tests were fixed. The tests for ipv4/ipv6 were using linux bridge "vmbr1", which wasn't created beforehand.
Also, removed some of the WaitForContainerStatus calls done after updating the container since
containerUpdatemethod now waits for the container to be rebooted if the reboot was triggered.Also, the template used for running containers was changed from Ubuntu 24.04 to Alpine 3.22. This makes the downloading of the template faster as the Alpine template is much smaller.
Proof Of Work
Community Note
Closes #2313