Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create multiple azure vms with async and some vms get error "Failed to determine PowerState of virtual machine" ? #1365

Closed
dgiesselbach opened this issue Dec 8, 2023 · 7 comments · Fixed by #1412
Labels
has_pr PR fixes have been made medium_priority Medium priority question Further information is requested

Comments

@dgiesselbach
Copy link

dgiesselbach commented Dec 8, 2023

SUMMARY

We create several azure vms at the same time (via loop and loop_control: pause 2) with async 1800 and poll 0 (fire and forget) and then check with ansible.builtin.async_status whether the vms are ready to be deployed.

If we deploy 250 vms we get the error message "msg" from at least 3 vms: "Failed to determine PowerState of virtual machine vm name" . Unfortunately, automatic retry doesn't work this way because the error message is not output under "stderr_lines" or "stderr".

Is this a bug? Could this message move to stderr_lines or stderr so that the async retry would kick in and capture the vm status in the next api request?

ISSUE TYPE
  • Bug Report
COMPONENT NAME

azure.azcollection.azure_rm_virtualmachine

ANSIBLE VERSION
Collection         Version
------------------ -------
azure.azcollection 2.0.0
COLLECTION VERSION
ansible [core 2.15.4]
  config file = /etc/ansible/ansible.cfg
  configured module search path = ['/runner/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python3.9/site-packages/ansible
  ansible collection location = /runner/.ansible/collections:/usr/share/ansible/collections
  executable location = /usr/bin/ansible
  python version = 3.9.16 (main, May 31 2023, 12:21:58) [GCC 8.5.0 20210514 (Red Hat 8.5.0-18)] (/usr/bin/python3.9)
  jinja version = 3.1.2
  libyaml = True
CONFIGURATION
CONFIG_FILE() = /etc/ansible/ansible.cfg
OS / ENVIRONMENT
STEPS TO REPRODUCE

Large deployment with 250 vms via {{ vmCollection }}

- name: Create VMs
      azure.azcollection.azure_rm_virtualmachine:
        resource_group: "rg...."
        name: "vm-test-vm"
        short_hostname: "test-vm"
        vm_size: "Standard_D4s_v5"
        admin_username: "admin"
        admin_password: "..."
        availability_set: "avail...."
        network_interfaces: "nic..."
        managed_disk_type: "Premium_LRS"
        os_disk_name: "os..."
        os_disk_caching: "ReadWrite"
        os_type: "ReadWrite"
        license_type: "Windows_Client"
        image:
          name: "img..."
          resource_group: "rg..."
        tags:
        started: true
      loop: "{{ vmCollection }}"
      loop_control:
        pause: 5
      register: create_vm_instances
      async: 1800
      poll: 0  # Fire and continue (never poll)

    - name: Pause for 3 minutes to create vms on azure backend
      ansible.builtin.pause:
        minutes: 3

    - ansible.builtin.async_status:
        jid: "{{ item.ansible_job_id }}"
      register: job_result
      until: job_result.finished
      retries: 300
      delay: 15
      loop: "{{ create_vm_instances.results }}"
      loop_control:
        pause: 1
EXPECTED RESULTS

msg message move to stderr_lines or stderr?
{
"started": 1,
"finished": 1,
"stdout": "",
"stderr": "Failed to determine PowerState of virtual machine vm-test-vm",
"stdout_lines": [],
"stderr_lines": [Failed to determine PowerState of virtual machine vm-test-vm],
....

ACTUAL RESULTS

{
"started": 1,
"finished": 1,
"stdout": "",
"stderr": "",
"stdout_lines": [],
"stderr_lines": [],
"ansible_job_id": "j380889400388.25999",
"results_file": "/root/.ansible_async/j380889400388.25999",
"msg": "Failed to determine PowerState of virtual machine vm-test-vm",
"invocation": {
"module_args": {
"resource_group": "rg...",
"name": "vm-test-vm",
"short_hostname": "test-vm",
"vm_size": "Standard_D4s_v5",
"admin_username": "admin",
"admin_password": "VALUE_SPECIFIED_IN_NO_LOG_PARAMETER",
"availability_set": "avail....",
"network_interfaces": "nic-test-vm",
"managed_disk_type": "Premium_LRS",
"os_disk_name": "disk.....",
"os_disk_caching": "ReadWrite",
"os_type": "Windows",
"license_type": "Windows_Client",
"image": {
"name": "img....",
"resource_group": "rg....."
},
"tags": {
},
"started": true,
"network_interface_names": [
"nic-test-vm"
],
"auth_source": "auto",
"cloud_environment": "AzureCloud",
"api_profile": "latest",
"append_tags": true,
"state": "present",
"max_price": -1,
"ssh_password_enabled": true,
"storage_container_name": "vhds",
"public_ip_allocation_method": "Static",
"remove_on_absent": [
"all"
],
"created_nsg": true,
"allocated": true,
"restarted": false,
"force": false,
"generalized": false,
"accept_terms": false,
"profile": null,
"subscription_id": null,
"client_id": null,
"secret": null,
"tenant": null,
"ad_user": null,
"password": null,
"cert_validation_mode": null,
"adfs_authority_url": null,
"log_mode": null,
"log_path": null,
"x509_certificate_path": null,
"thumbprint": null,
"custom_data": null,
"location": null,
"priority": null,
"eviction_policy": null,
"ssh_public_keys": null,
"storage_account_name": null,
"storage_blob_name": null,
"os_disk_size_gb": null,
"os_disk_encryption_set": null,
"proximity_placement_group": null,
"open_ports": null,
"virtual_network_resource_group": null,
"virtual_network_name": null,
"subnet_name": null,
"data_disks": null,
"plan": null,
"zones": null,
"vm_identity": null,
"winrm": null,
"boot_diagnostics": null,
"ephemeral_os_disk": null,
"windows_config": null,
"linux_config": null,
"security_profile": null
}
},
"_ansible_no_log": null,
"attempts": 1,
"changed": false,
"item": {
"failed": 0,
"started": 1,
"finished": 0,
"ansible_job_id": "j380889400388.25999",
"results_file": "/root/.ansible_async/j380889400388.25999",
"changed": true,
"item": {
"AvailSetNumber": 1,
"Name": "test-vm"
},
"ansible_loop_var": "item"
},
"ansible_loop_var": "item",
"_ansible_item_label": {
"failed": 0,
"started": 1,
"finished": 0,
"ansible_job_id": "j380889400388.25999",
"results_file": "/root/.ansible_async/j380889400388.25999",
"changed": true,
"item": {
"AvailSetNumber": 1,
"Name": "test-vm"
},
"ansible_loop_var": "item"
}
}

@Fred-sun Fred-sun added medium_priority Medium priority work in In trying to solve, or in working with contributors labels Dec 11, 2023
@Fred-sun
Copy link
Collaborator

Fred-sun commented Jan 8, 2024

@dgiesselbach Sorry for replying you so late! From your error log, you can see that this error occurred because the VM's 'PowerState' was not correct when creating the end-of-VM sequence number VM information. I have repeatedly tested in the local, have not encountered your problem! In addition, PR #1354 has been submitted, after the creation of the VM, the VM state will be detected, so as to avoid some problems caused by the VM state is not fixed. Could you try again? Is there still this problem? Thank you!

@Fred-sun Fred-sun added the question Further information is requested label Jan 8, 2024
@dgiesselbach
Copy link
Author

dgiesselbach commented Jan 10, 2024

@Fred-sun
I tested with azure.azcollection version 2.1.1 and wanted to create 250 vms. With this version it automatically goes to the retry part when we check the job completion with async. That's good, from now on we don't need a failed_when condition anymore!

Unfortunately it's still stuck in a retry loop. It looks like he doesn't make another API request against Azure in the retries to capture the correct PowerState. Can this be? All retry attempts say "Failed to determine PowerState of virtual machine".

image

Here is an example output of the 206th retry attempt:

{
  "started": 1,
  "finished": 1,
  "stdout": "",
  "stderr": "",
  "stdout_lines": [],
  "stderr_lines": [],
  "ansible_job_id": "j110059934240.24505",
  "results_file": "/root/.ansible_async/j110059934240.24505",
  "msg": "Failed to determine PowerState of virtual machine vm-test140",
  "invocation": {
    "module_args": {
      "resource_group": "rg-test",
      "name": "vm-test140",
      "short_hostname": "test140",
      "vm_size": "Standard_D4s_v5",
      "admin_username": "admin",
      "admin_password": "VALUE_SPECIFIED_IN_NO_LOG_PARAMETER",
      "availability_set": "avail-test140",
      "network_interfaces": "nic-test140",
      "managed_disk_type": "Premium_LRS",
      "os_disk_name": "disk-test140-os",
      "os_disk_caching": "ReadWrite",
      "os_type": "Windows",
      "license_type": "Windows_Client",
      "image": {
        "name": "img-test",
        "resource_group": "rg-test"
      },
      "started": true,
      "network_interface_names": [
        "nic-test140"
      ],
      "auth_source": "auto",
      "cloud_environment": "AzureCloud",
      "api_profile": "latest",
      "append_tags": true,
      "state": "present",
      "max_price": -1,
      "ssh_password_enabled": true,
      "storage_container_name": "vhds",
      "public_ip_allocation_method": "Static",
      "remove_on_absent": [
        "all"
      ],
      "created_nsg": true,
      "allocated": true,
      "restarted": false,
      "force": false,
      "generalized": false,
      "accept_terms": false,
      "profile": null,
      "subscription_id": null,
      "client_id": null,
      "secret": null,
      "tenant": null,
      "ad_user": null,
      "password": null,
      "cert_validation_mode": null,
      "adfs_authority_url": null,
      "log_mode": null,
      "log_path": null,
      "x509_certificate_path": null,
      "thumbprint": null,
      "custom_data": null,
      "location": null,
      "priority": null,
      "eviction_policy": null,
      "ssh_public_keys": null,
      "storage_account_name": null,
      "storage_blob_name": null,
      "os_disk_size_gb": null,
      "os_disk_encryption_set": null,
      "proximity_placement_group": null,
      "open_ports": null,
      "virtual_network_resource_group": null,
      "virtual_network_name": null,
      "subnet_name": null,
      "data_disks": null,
      "plan": null,
      "zones": null,
      "vm_identity": null,
      "winrm": null,
      "boot_diagnostics": null,
      "ephemeral_os_disk": null,
      "windows_config": null,
      "linux_config": null,
      "security_profile": null
    }
  },
  "_ansible_no_log": null,
  "attempts": 206,
  "changed": false,
  "retries": 301
}

@Fred-sun
Copy link
Collaborator

@dgiesselbach Can you share the process and mistakes you made to get this error? Thank you!

@dgiesselbach
Copy link
Author

Hi @Fred-sun
what information do you still need? I actually described everything above:

Using azure.azcollection version 2.1.1 to create 250 virtual machines in Azure in a loop:

ansible code:

- name: Create VMs
      azure.azcollection.azure_rm_virtualmachine:
        resource_group: "rg...."
        name: "vm-test-vm"
        short_hostname: "test-vm"
        vm_size: "Standard_D4s_v5"
        admin_username: "admin"
        admin_password: "..."
        availability_set: "avail...."
        network_interfaces: "nic..."
        managed_disk_type: "Premium_LRS"
        os_disk_name: "os..."
        os_disk_caching: "ReadWrite"
        os_type: "ReadWrite"
        license_type: "Windows_Client"
        image:
          name: "img..."
          resource_group: "rg..."
        tags:
        started: true
      loop: "{{ vmCollection }}"
      loop_control:
        pause: 5
      register: create_vm_instances
      async: 1800
      poll: 0  # Fire and continue (never poll)

    - name: Pause for 3 minutes to create vms on azure backend
      ansible.builtin.pause:
        minutes: 3

    - ansible.builtin.async_status:
        jid: "{{ item.ansible_job_id }}"
      register: job_result
      until: job_result.finished
      retries: 300
      delay: 15
      loop: "{{ create_vm_instances.results }}"
      loop_control:
        pause: 1

Almost 30 Vms (always a coincidence) are successfully deployed but then several fail due to PowerState not determined

My guess is that if he has not received the PowerState through the API request and then goes into the retry part, he will no longer make any further API requests to Azure and will therefore never receive the "correct" PowerState again. Hence my question, could you check in the code whether new API requests are being made against Azure?

image

Here is an example output of the 206th retry attempt:

{
  "started": 1,
  "finished": 1,
  "stdout": "",
  "stderr": "",
  "stdout_lines": [],
  "stderr_lines": [],
  "ansible_job_id": "j110059934240.24505",
  "results_file": "/root/.ansible_async/j110059934240.24505",
  "msg": "Failed to determine PowerState of virtual machine vm-test140",
  "invocation": {
    "module_args": {
      "resource_group": "rg-test",
      "name": "vm-test140",
      "short_hostname": "test140",
      "vm_size": "Standard_D4s_v5",
      "admin_username": "admin",
      "admin_password": "VALUE_SPECIFIED_IN_NO_LOG_PARAMETER",
      "availability_set": "avail-test140",
      "network_interfaces": "nic-test140",
      "managed_disk_type": "Premium_LRS",
      "os_disk_name": "disk-test140-os",
      "os_disk_caching": "ReadWrite",
      "os_type": "Windows",
      "license_type": "Windows_Client",
      "image": {
        "name": "img-test",
        "resource_group": "rg-test"
      },
      "started": true,
      "network_interface_names": [
        "nic-test140"
      ],
      "auth_source": "auto",
      "cloud_environment": "AzureCloud",
      "api_profile": "latest",
      "append_tags": true,
      "state": "present",
      "max_price": -1,
      "ssh_password_enabled": true,
      "storage_container_name": "vhds",
      "public_ip_allocation_method": "Static",
      "remove_on_absent": [
        "all"
      ],
      "created_nsg": true,
      "allocated": true,
      "restarted": false,
      "force": false,
      "generalized": false,
      "accept_terms": false,
      "profile": null,
      "subscription_id": null,
      "client_id": null,
      "secret": null,
      "tenant": null,
      "ad_user": null,
      "password": null,
      "cert_validation_mode": null,
      "adfs_authority_url": null,
      "log_mode": null,
      "log_path": null,
      "x509_certificate_path": null,
      "thumbprint": null,
      "custom_data": null,
      "location": null,
      "priority": null,
      "eviction_policy": null,
      "ssh_public_keys": null,
      "storage_account_name": null,
      "storage_blob_name": null,
      "os_disk_size_gb": null,
      "os_disk_encryption_set": null,
      "proximity_placement_group": null,
      "open_ports": null,
      "virtual_network_resource_group": null,
      "virtual_network_name": null,
      "subnet_name": null,
      "data_disks": null,
      "plan": null,
      "zones": null,
      "vm_identity": null,
      "winrm": null,
      "boot_diagnostics": null,
      "ephemeral_os_disk": null,
      "windows_config": null,
      "linux_config": null,
      "security_profile": null
    }
  },
  "_ansible_no_log": null,
  "attempts": 206,
  "changed": false,
  "retries": 301
}

@Fred-sun
Copy link
Collaborator

@dgiesselbach This problem is very beggar, and I cannot copy it for the time being, so I have tested 'powerstate' in PR#1412. You can update to the branch of 1412, will you still encounter this problem? Thank you!

@dgiesselbach
Copy link
Author

dgiesselbach commented Jan 18, 2024

@Fred-sun with branch "modify_vm_test" everything works fine :)

He goes straight to the correct PowerState without going into retry.

Could the fix be integrated into 2.1.1 so that we could use the official version directly?

Nice one!

@Fred-sun
Copy link
Collaborator

@dgiesselbach Thanks for your feedback! I will push this PR merge forward as soon as possible and publish it in the latest version. Thank you!

@Fred-sun Fred-sun added has_pr PR fixes have been made and removed work in In trying to solve, or in working with contributors labels Jan 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
has_pr PR fixes have been made medium_priority Medium priority question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants