Skip to content

Conversation

@shamilovstas
Copy link
Contributor

@shamilovstas shamilovstas commented Nov 4, 2025

Contributor's Note

  • I have added / updated documentation in /docs for any user-facing features or additions.
  • I have added / updated acceptance tests in /fwprovider/tests for any new or updated resources / data sources.
  • I have ran make example to verify that the change works as expected.

It seems that some tests running in parallel and using the same container were causing some of the lock timeouts on the proxmox host used for testing. The single acceptance test, parts of which were run in parallel and used the same container, into several independent tests using their own containers. This should reduce the overall flakiness of the tests.

Besides, a few tests were fixed. The tests for ipv4/ipv6 were using linux bridge "vmbr1", which wasn't created beforehand.
Also, removed some of the WaitForContainerStatus calls done after updating the container since containerUpdate method now waits for the container to be rebooted if the reboot was triggered.

Also, the template used for running containers was changed from Ubuntu 24.04 to Alpine 3.22. This makes the downloading of the template faster as the Alpine template is much smaller.

Proof Of Work

❯ for i in {1..10}; do echo "Run $i/10"; ./testacc TestAccResourceContainer; done
Run 1/10
ok  	github.com/bpg/terraform-provider-proxmox/fwprovider/test	54.947s
Run 2/10
ok  	github.com/bpg/terraform-provider-proxmox/fwprovider/test	55.139s
Run 3/10
ok  	github.com/bpg/terraform-provider-proxmox/fwprovider/test	52.196s
Run 4/10
ok  	github.com/bpg/terraform-provider-proxmox/fwprovider/test	46.814s
Run 5/10
ok  	github.com/bpg/terraform-provider-proxmox/fwprovider/test	51.870s
Run 6/10
ok  	github.com/bpg/terraform-provider-proxmox/fwprovider/test	53.305s
Run 7/10
ok  	github.com/bpg/terraform-provider-proxmox/fwprovider/test	51.909s
Run 8/10
ok  	github.com/bpg/terraform-provider-proxmox/fwprovider/test	49.757s
Run 9/10
ok  	github.com/bpg/terraform-provider-proxmox/fwprovider/test	51.239s
Run 10/10
ok  	github.com/bpg/terraform-provider-proxmox/fwprovider/test	55.078s

Community Note

  • Please vote on this pull request by adding a 👍 reaction to the original pull request comment to help the community and maintainers prioritize this request
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for pull request followers and do not help prioritize the request

Closes #2313

add waiting for status "running" after update that triggered reboot

Signed-off-by: Stanislav Shamilov <shamilovstas@protonmail.com>
It seems that some tests running in parallel and using the same container were causing some of the lock timeouts on the proxmox host used for testing. This commit splits a single test, parts of which were run in parallel and used the same container, into several independent tests using their own containers. This should reduce the overall flakiness of the tests.
Besides, a few tests were fixed. The tests for ipv4/ipv6 were using linux bridge "vmbr1", which wasn't created beforehand. Also, removed some of the WaitForContainerStatus calls done after updating the container since previous commit introduced a wait in the `updateContainer` method which waits for the container to be rebooted.
Also, the template used for running containers was changed from Ubuntu 24.04 to Alpine 3.22. This makes the downloading of the template faster as the Alpine template is much smaller.

Signed-off-by: Stanislav Shamilov <shamilovstas@protonmail.com>
if e := containerAPI.WaitForContainerStatus(ctx, "running"); e != nil {
return diag.FromErr(e)
}
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that before, but it didn’t change anything, the container is reported as “running” almost immediately after the reboot call. I guess it depends on the PVE host though. On slower hosts / IO or in some edge cases that might actually help.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So do you think I should leave it or delete it?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this adds stability in your tests, then let's leave it.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still racy tho:

=== CONT  TestAccResourceContainerIpv4Ipv6
=== CONT  TestAccResourceContainerMountPoint
=== NAME  TestAccResourceContainerDnsBlock
    resource_container_test.go:329: Step 1/6 error: Error running apply: exit status 1

        Error: error starting container: error starting container: received an HTTP 500 response - Reason: CT 121253 already running

          with proxmox_virtual_environment_container.test_container,
          on terraform_plugin_test.tf line 25, in resource "proxmox_virtual_environment_container" "test_container":
          25: 				resource "proxmox_virtual_environment_container" "test_container" {

--- PASS: TestAccResourceContainerIpv4Ipv6 (6.61s)
--- PASS: TestAccResourceContainerMountPoint (10.95s)
=== NAME  TestAccResourceContainer
    resource_container_test.go:50: Step 3/3 error: Error running apply: exit status 1

        Error: error updating container: received an HTTP 500 response - Reason: can't lock file '/run/lock/lxc/pve-config-135546.lock' - got timeout

          with proxmox_virtual_environment_container.test_container,
          on terraform_plugin_test.tf line 25, in resource "proxmox_virtual_environment_container" "test_container":
          25: 				resource "proxmox_virtual_environment_container" "test_container" {

=== NAME  TestAccResourceContainerHostname
    resource_container_test.go:565: Error running post-test destroy, there may be dangling resources: exit status 1

        Error: error waiting for container shut down: task "UPID:pve:001BABE4:02948904:69132015:vzshutdown:104154:root@pam:" failed to complete with exit code: can't lock file '/run/lock/lxc/pve-config-104154.lock' - got timeout

--- FAIL: TestAccResourceContainerHostname (21.57s)
--- FAIL: TestAccResourceContainerDnsBlock (21.97s)
--- PASS: TestAccResourceContainerClone (27.57s)

I think WaitForContainerStatus(...) is still needed in tests. It's possible that when we call it right after a reboot, the actual reboot hasn't completed yet, it seems to be asynchronous. So the call in the update might return "running" for a container that hasn't actually been rebooted just yet.

No harm in keeping it there tho.

I'll make the changes in this PR as I'm testing this branch atm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for the WaitContainerForStatus, I removed it from the tests as the "update" method of the resource now waits for the container if it was rebooted.

As for the flakiness thing, I described it here: #2313 (comment)

Copy link
Contributor Author

@shamilovstas shamilovstas Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, my changes didn't remove flakiness completely, but seem to have reduced it. Before it, I couldn't run the tests successfully even once. The remaining flakiness might be a symptom of some bug in "update" method of the resource, but I'm not sure about that

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alright, the proper fix here is to wait for reboot / create / etc task to complete

@shamilovstas shamilovstas marked this pull request as ready for review November 11, 2025 11:20
@bpg bpg changed the title Lxc acceptance tests fix(lxc): race condition in container create / update operations Nov 11, 2025
@bpg bpg changed the title fix(lxc): race condition in container create / update operations fix(lxc): race condition in container clone / reboot operations Nov 11, 2025
Signed-off-by: Pavel Boldyrev <627562+bpg@users.noreply.github.com>
bpg
bpg previously approved these changes Nov 11, 2025
Copy link
Owner

@bpg bpg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for raising this @shamilovstas!

I've played with the tests a bit and I think I've fixed the root cause of the flakiness -- a missing wait on container's reboot / clones tasks.
Seems to be working well now 🤞🏼

LGTM! 🚀

Signed-off-by: Pavel Boldyrev <627562+bpg@users.noreply.github.com>
@bpg bpg merged commit ce358d5 into bpg:main Nov 11, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

LXC flaky acceptance tests

2 participants