Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2.9.13 - Terraform hangs on "still creating" when deploying multiple VMs #705

Closed
deep-blue-pulsar opened this issue Feb 27, 2023 · 14 comments

Comments

@deep-blue-pulsar
Copy link

Hi folks. Apologies if I'm missing any information as I'm still learning the ropes with Terraform. I've been trying for the past day to deploy 4 cloudinit VMs to Proxmox but no mater how much I tweak things I can't get the provider to progress further than the first VM.

This is what my main.tf looks like:

resource "proxmox_vm_qemu" "control_plane" {
  count             = 1
  name              = "control-plane-${count.index}"
  target_node       = "${var.pm_node}"

  clone             = "ubuntu-2004-cloudinit-template"
  full_clone        = "true"

  os_type           = "cloud-init"
  cores             = 4
  sockets           = "1"
  cpu               = "host"
  memory            = 2048
  scsihw            = "virtio-scsi-pci"
  bootdisk          = "scsi0"
  agent             = 1

  disk {
    slot            = 0
    size            = "20G"
    type            = "scsi"
    storage         = "local-lvm"
    iothread        = 1
  }

  network {
    model           = "virtio"
    bridge          = "vmbr0"
    tag             = 20
  }

  # cloud-init settings
  ipconfig0         = "ip=10.10.20.3${count.index}/24,gw=10.10.20.1"
  nameserver        = "10.10.20.52,10.10.20.53"
  sshkeys = file("${var.ssh_key_file}")
}

resource "proxmox_vm_qemu" "worker_nodes" {
  count             = 3
  name              = "worker-${count.index}"
  target_node       = "${var.pm_node}"

  clone             = "ubuntu-2004-cloudinit-template"
  full_clone        = "true"

  os_type           = "cloud-init"
  cores             = 4
  sockets           = "1"
  cpu               = "host"
  memory            = 4096
  scsihw            = "virtio-scsi-pci"
  bootdisk          = "scsi0"

  disk {
    slot            = 0
    size            = "20G"
    type            = "scsi"
    storage         = "local-lvm"
    iothread        = 1
  }

  network {
    model           = "virtio"
    bridge          = "vmbr0"
    tag             = 20
  }

  # cloud-init settings
  ipconfig0         = "ip=10.10.20.4${count.index}/24,gw=10.10.20.1"
  nameserver        = "10.10.20.52,10.10.20.53"
  sshkeys = file("${var.ssh_key_file}")
}

I've provisioned the template to have the qemu agent installed. Whenever I run the plan, it starts creating the first VM but never goes past that. Also, when it tries to boot the VM it does so 3 times and errors out because the VM is already running from the first request it sent:

2023-02-27T18:16:35.110-0500 [ERROR] provider.terraform-provider-proxmox_v2.9.13: Response contains error diagnostic: @caller=github.com/hashicorp/terraform-plugin-go@v0.14.3/tfprotov5/internal/diag/diagnostics.go:55 diagnostic_summary="VM 104 already running" tf_req_id=5907450b-2183-b8a3-9f60-8ed35e8aa6fb tf_provider_addr=registry.terraform.io/telmate/proxmox tf_resource_type=proxmox_vm_qemu @module=sdk.proto diagnostic_detail= diagnostic_severity=ERROR tf_proto_version=5.3 tf_rpc=ApplyResourceChange timestamp=2023-02-27T18:16:35.110-0500
2023-02-27T18:16:35.120-0500 [ERROR] vertex "proxmox_vm_qemu.worker_nodes[0]" error: VM 104 already running

Anyone else having a similar issue?

@mantony9000
Copy link

mantony9000 commented Mar 5, 2023

yep, this is a huge blocker for both lxc and vm creation atm,
what version of proxmox and terraform are you using?
I'm using proxmox 7.1-7
and terraform 1.3.3

module.proxmox.proxmox_lxc.pihole: Still creating... [10s elapsed]
module.proxmox.proxmox_lxc.pihole: Still creating... [20s elapsed]
module.proxmox.proxmox_lxc.pihole: Still creating... [30s elapsed]
module.proxmox.proxmox_lxc.pihole: Still creating... [40s elapsed]
module.proxmox.proxmox_lxc.pihole: Still creating... [50s elapsed]
module.proxmox.proxmox_lxc.pihole: Still creating... [1m0s elapsed]
module.proxmox.proxmox_lxc.pihole: Still creating... [1m10s elapsed]
module.proxmox.proxmox_lxc.pihole: Still creating... [1m20s elapsed]
module.proxmox.proxmox_lxc.pihole: Still creating... [1m30s elapsed]
module.proxmox.proxmox_lxc.pihole: Still creating... [1m40s elapsed]
module.proxmox.proxmox_lxc.pihole: Still creating... [1m50s elapsed]
module.proxmox.proxmox_lxc.pihole: Still creating... [2m0s elapsed]```


getting the same for vm creation

@mantony9000
Copy link

mantony9000 commented Mar 5, 2023

I'm a moron, change the template and try again it'll work.
mine was failing due to using the alpine template (provisioners failed due to no ssh).

use a updated template: https://github.com/TechByTheNerd/cloud-image-for-proxmox/tree/main/ubuntu

@rterbush
Copy link

rterbush commented Mar 22, 2023

This actually appears to be a real problem. I am unable to deploy more than 1 container resource. This appears to be because Terraform uses parallelism by default and PVE cannot handle multiple requests at the same time. Perhaps a lock issue on a clone?

Here is an interesting thread, putting the responsibility on the provider to manage this.
hashicorp/terraform-plugin-sdk#67

I've tried both a for_each as well as count to do this. Only way to get this to work at least partially is by setting --parallelism=1 for Terraform.

Would love to be proved wrong here.

resource "proxmox_lxc" "data_lxc" {
  count         = 2
  hostname      = "data-${count.index + 1}"
  target_node   = "pve3"
  password      = var.container_password
  clone         = 501
  full          = true
  cores         = 4
  memory        = 2048
  swap          = 1024
  start         = true
  onboot        = true
  unprivileged  = true
  hastate       = "ignored"
  vmid          = count.index + 221

  rootfs {
    storage = "containers"
    size = "10G"
  }

  mountpoint {
    key     = "0"
    slot    = 0
    storage = "salt-states"
    mp      = "/srv"
    size    = "100M"
    shared	= true
  }

  mountpoint {
    key     = "1"
    slot    = 1
    storage = "data-library"
    mp      = "/mnt/data"
    size    = "100G"
    shared	= true
  }

  network {
    name = "eth0"
    bridge = "vmbr0"
    ip = "10.10.9.${count.index + 221}/24"
    gw = "10.10.9.254"
  }

  provisioner "remote-exec" {
    script = "provision.sh"

    connection {
      type = "ssh"
      user = "root"
      host = "10.10.9.${count.index + 221}"
      private_key = var.ssh_private_key
    }
  }
}

@rterbush
Copy link

Just discovered pm_parallel parameter. This does avoid the need to set this on the command line, but the bigger issue is not being able to run these container deployments in parallel. Not sure if a delay might solve this or workaround getting away from using clone.

@rterbush
Copy link

rterbush commented Mar 22, 2023

And I can confirm that pm_parallel does not work as expected. #310
Also related #173

@mantony9000
Copy link

mantony9000 commented Mar 26, 2023

@rterbush you cannot clone in parallel : https://forum.proxmox.com/threads/parallel-cloning.75902/
it needs to lock the storage. Might be a proxmox api limitation?

@rterbush
Copy link

rterbush commented Mar 26, 2023

@CaptainPizzaPirate thanks for the link. It does explain the issue on the Proxmox side.

Point of my earlier comment is that setting pm_parallel in the TF does not limit parallel processing of the deployment. Seems it is required to use the command line flag to set --parallelism=1 in order to get this to work so I too, (as did the OP), question if the provider setting works as intended.

Actually, was the poster in one of the other referenced issues that made the statement that they did not believe this works as intended.

@mantony9000
Copy link

@rterbush & @deep-blue-pulsar
I understand the pm_parallel is having issues,
however can you confirm if the qemu-guest-agent is installed on your disk image?
please add these vars to your provider to debug the trace:

  pm_log_file         = "terraform-provider-proxmox.log"
  pm_debug            = true

can you post the log trace in the file terraform-provider-proxmox.log?

@rterbush
Copy link

@CaptainPizzaPirate in my case, I am deploying lxc containers, so not applicable.

@mantony9000
Copy link

mantony9000 commented Apr 30, 2023

@rterbush I understand that, however if the clone template does not have the agent it will also stall. you can verify this by enabling the logs to confirm its not actually the image issue.
it can also stall if your image does not has openssh, we have no idea what container template you are using, for eg if its alpine it does not ship with ssh and will stall too, because it can't run the provisioners

clone = 501

@rterbush
Copy link

@CaptainPizzaPirate maybe I am not understanding you completely.

I am not running qemu in these deployments. The containers are cloned from lxc templates. No qemu involved to install and run the qemu agent.

A lot of water has gone under the bridge since I reported/confirmed this behavior with not being able to control parallelism in the proxmox provider config. I acknowledge that there is a limitation in the Proxmox API with regard to locking clones. However, setting pm_parallel=1 does not work around this issue. I must set --parallelism=1 at the terraform command line. https://registry.terraform.io/providers/Telmate/proxmox/2.7.4/docs#pm_parallel

My config has also changed a lot in order to work around other bugs such as #753, so I am now also specifying hwaddr for each container provisioned. Not clear if I can easily recreate this without substantially reverting my config. Will give a try without the command line var and report back if I can recreate. But again, this has less to do with the container config and more to do with an upstream issue in the proxmox provider config itself.

@mantony9000
Copy link

@rterbush
I'm saying theres something wrong with the lxc template, and logs will confirm that

@github-actions
Copy link

This issue is stale because it has been open for 60 days with no activity. Please update the provider to the latest version and, in the issue persist, provide full configuration and debug logs

@github-actions
Copy link

github-actions bot commented Jul 6, 2023

This issue was closed because it has been inactive for 5 days since being marked as stale.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants