Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dynamic timeout for Azure operations in machinery module #2233

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

leoiancu21
Copy link
Contributor

Add dynamic timeout and VM status check for Azure operations in machinery module

Overview

This pull request introduces two main improvements to the Azure machinery module in CAPE:

  1. A dynamic timeout mechanism for specific Azure operations (VMSS creation and reimaging).
  2. A VM status check to ensure machines are fully running before initialization.

These enhancements aim to improve the reliability and efficiency of Azure-related tasks, particularly during scale set operations and long-running processes.

Key Changes

1. Dynamic Timeout Implementation

  • Implemented a new _handle_poller_result method that dynamically adjusts timeout durations based on the state of Azure operations.
  • Added an _are_machines_still_upgrading method to check the status of machines in a scale set.
  • The dynamic timeout is specifically applied to VMSS creation and reimaging operations.
  • Retained the fixed AZURE_TIMEOUT constant for other Azure operations where dynamic timing was not observed to be necessary.

2. VM Status Check During Initialization

  • Added functionality to ensure CAPE only initializes Azure VMs when they are in the 'Running' state.
  • Implemented a waiting mechanism during initialization to check all VMs' status.
  • Modified the machine addition process to only add VMs that are fully running.
  • This change prevents issues arising from initializing machines that are still in the "Updating(running)" state.

Benefits

  • Improved handling of long-running Azure operations for VMSS creation and reimaging, reducing unnecessary timeouts.
  • Better adaptation to varying Azure response times and network conditions for critical VMSS operations.
  • Enhanced reliability for scale set operations, particularly during high-load scenarios.
  • Ensures that analysis tasks are only assigned to fully operational VMs.
  • Maintains compatibility with existing CAPE infrastructure and Azure module usage.
  • Preserves existing timeout behavior for Azure operations not related to VMSS creation and reimaging.

Implementation Details

  • The new timeout logic uses a base timeout of 280 seconds (same as the previous fixed timeout) and a maximum timeout of 30 minutes for VMSS creation and reimaging.
  • It periodically checks if machines are still upgrading and adjusts wait times accordingly for these specific operations.
  • Other Azure operations continue to use the fixed AZURE_TIMEOUT.
  • The implementation respects existing code structure and maintains the use of Azure._azure_api_call() for consistency.
  • New methods _wait_for_vms_running, _are_all_vms_running, and _is_vm_running have been added to manage VM status checks.
  • The _add_machines_to_db method has been updated to only add machines in the 'Running' state.

Testing

  • Tested VMSS creation, and reimaging operations. (not the scaling feature since i don't use it in my environment)
  • Verified improved handling of operations during high-load and network latency conditions.
  • Confirmed that CAPE correctly waits for VMs to be in the 'Running' state before initialization.
  • Ensured that other Azure operations maintain their existing timeout behavior.

Reviewer Notes

Reviewers, please pay particular attention to:

  • The logic in _handle_poller_result and _are_machines_still_upgrading methods, especially their application to VMSS creation and reimaging.
  • The new VM status checking methods and their integration with the initialization process.
  • Any potential impact on existing Azure operations or CAPE workflows.
  • The continued use of AZURE_TIMEOUT for non-VMSS operations and whether this approach is appropriate.
  • Edge cases that might need additional handling.

Conclusion

These changes aim to enhance the robustness of CAPE's Azure integration without introducing significant complexity or changing the overall architecture of the machinery module. They address important issues with timeout handling for specific VMSS operations and VM initialization, leading to a more stable and reliable Azure integration for CAPE, while maintaining existing behavior for other operations.

Your feedback and suggestions are greatly appreciated!

This pull request introduces a dynamic timeout mechanism for Azure operations in the CAPE machinery module. The new feature enhances the reliability and efficiency of Azure-related tasks, particularly during scale set operations and long-running processes.
@ChrisThibodeaux
Copy link
Contributor

Thank you for this work. I am going to run this in my setup now. I use the scaling feature, so I will be able to test that portion for you.

@leoiancu21
Copy link
Contributor Author

Perfect, let me know if you have any issues i will be more than happy to help you

@ChrisThibodeaux
Copy link
Contributor

ChrisThibodeaux commented Jul 17, 2024

@leoiancu21 I had hoped that your new timeout design would help with the issue I am dealing with in the auto-scaling, but it has the same end behavior of locking up the analysis machines. When I force CAPE to remain at a single instance, your timeouts work well for me.

I have found the where the issues are with auto-scaling and am working on a clean fix. When I get that done, I will revisit these changes to give better feedback. Sorry I could not help out more.

@leoiancu21
Copy link
Contributor Author

@leoiancu21 I had hoped that your new timeout design would help with the issue I am dealing with in the auto-scaling, but it has the same end behavior of locking up the analysis machines. When I force CAPE to remain at a single instance, your timeouts work well for me.

I have found the where the issues are with auto-scaling and am working on a clean fix. When I get that done, I will revisit these changes to give better feedback. Sorry I could not help out more.

I was having the same issue today (I also noticed that sometimes when the instances have the Update (Running) status the analysis start breaking the azsniffer), check this parameter on your VMSS
image
This broke my analyses too

@ChrisThibodeaux
Copy link
Contributor

ChrisThibodeaux commented Jul 17, 2024

The scheduler attempts to give a task to the machines before they have the agent up and running. This is an azure specific issue, I believe, because of how we have to start our agent on each reimaging. You will see that the same thing happens with new instances spun up. The self.initializing is going to be false, so they don't hit the _thr_wait_for_ready_machine on startup. At least, that is my current assumption..

I think the scaling options there may be misleading, too. We are technically using a "manual scale" as far as Azure is concerned. We send a signal to update the vmss.sku.capacity, forcing it to go up or down, which changes the number of instances. It is auto scaling inside of CAPE, just not from Azure's perspective.

@ChrisThibodeaux
Copy link
Contributor

ChrisThibodeaux commented Jul 17, 2024

@leoiancu21 Another thing to note, as it is not in any documentation.. If you do not append your pool_tag to the end of your scale set name in az.conf, it will never scale correctly. The instances will be brought up and never be relevant machines. The offending line of code is this, where tag is the value from pool_tag in the config:
return [machine for machine in self.db.list_machines() if tag in machine.label] You can see the problem. machine.label is set to the same value as machine.name, so the tag is never there unless you include it.

Bad naming: cape-vmss
Good naming: cape-vmss-win10x64

I am working on a backwards compatible PR for this now that will work for everyone.

@leoiancu21
Copy link
Contributor Author

@leoiancu21 Another thing to note, as it is not in any documentation.. If you do not append your pool_tag to the end of your scale set name in az.conf, it will never scale correctly. The instances will be brought up and never be relevant machines. The offending line of code is this, where tag is the value from pool_tag in the config:
return [machine for machine in self.db.list_machines() if tag in machine.label] You can see the problem. machine.label is set to the same value as machine.name, so the tag is never there unless you include it.

Bad naming: cape-vmss
Good naming: cape-vmss-win10x64

I am working on a backwards compatible PR for this now that will work for everyone.

Thank you very much for that, on a previous issue this problem was discussed but i never dared to touch that part of the code

@leoiancu21
Copy link
Contributor Author

@ChrisThibodeaux gave a look at azure logs and i can confirm that after a VMSS instance delete (when cape deltes it from the db too) the instance SKU count gets edited too, from what i've seen it works in reverse too so maybe using the same function to rise the instance count with more machines could be our answer, seems like the overprovisioning declared in the configs with this build is not totally respected and since I didn't see this behaviour before my changes I think that it has something to do with what I've added.

image

What do you think about it, did this issue start after using my code or you had previous experiences with it ? Does it still happen when using the right tag naming convention ? Any feedback would be more than helpful

@doomedraven
Copy link
Collaborator

thank you for PR, can you plz fix

modules/machinery/az.py:1155:17: F841 [*] Local variable `time_taken` is assigned to but never used
modules/machinery/az.py:1157:33: F841 [*] Local variable `e` is assigned to but never used

@leoiancu21
Copy link
Contributor Author

thank you for PR, can you plz fix

modules/machinery/az.py:1155:17: F841 [*] Local variable `time_taken` is assigned to but never used
modules/machinery/az.py:1157:33: F841 [*] Local variable `e` is assigned to but never used

Always a pleasure, after i fix the scale set capacity issue described in the previous comments i will commit the code removing those unused variables too

@ChrisThibodeaux
Copy link
Contributor

ChrisThibodeaux commented Jul 18, 2024

@leoiancu21 For the increase/decrease in size of the scale set, look at the try block at this line. This will handle both scaling up and scaling down. What is important is the vmss dict being passed as an argument. This line right before it sets the size to change the scale set to: vmss.sku.capacity = number_of_relevant_machines_required.

It is my take that this is performing perfectly, and as expected.

If you haven't already, you can change this line to be this: return [machine for machine in self.db.list_machines(tags=[tag])]. That fixes any issue with our code not recognizing existing scale set VMs as relevant to tasks with the pool_tag we have. Be warned though, I have not tested this change out a lot. It may cause sqalchemy issues somehow.

The real issue with all of this is the freezing that happens with added instances (beyond the first one) and reimaged instances. They are handed tasks before their agent is ready. I am hitting issues with my reworking of that part, but I may have a solution using locked=True as an arg when adding new machines to the database, like at this line.

The hard part is going to be getting that concept to work when reimaging.

@leoiancu21
Copy link
Contributor Author

@ChrisThibodeaux Love it, thanks, I saw that there is a procedure for scaling down the scaleset that checks for the SKU cores

if usage_to_look_for:
                usage = next((item for item in usages if item.name.value == usage_to_look_for), None)

                if usage:
                    number_of_new_cpus_required = self.instance_type_cpus * (
                        number_of_relevant_machines_required - number_of_machines
                    )
                    # Leaving at least five spaces in the usage quota for a spot VM, let's not push it!
                    number_of_new_cpus_available = int(usage.limit) - usage.current_value - int(self.instance_type_cpus * 5)
                    if number_of_new_cpus_available < 0:
                        number_of_relevant_machines_required = machine_pools[vmss_name]["size"]
                    elif number_of_new_cpus_required > number_of_new_cpus_available:
                        old_number_of_relevant_machines_required = number_of_relevant_machines_required
                        number_of_relevant_machines_required = (
                            number_of_relevant_machines + number_of_new_cpus_available / self.instance_type_cpus
                        )
                        log.debug(
                            f"Quota could be exceeded with projected number of machines ({old_number_of_relevant_machines_required}). Setting new limit to {number_of_relevant_machines_required}"
                        )

This could also be related to our issue since vmss.sku.capacity = number_of_relevant_machines_required takes the altered number_of_relevant_machines_required value. The sku cores value takes some time to be updated by Azure this leading to the downfall of our instance count, I will defenetly try your fix untill i find a better way of handling that limit

@ChrisThibodeaux
Copy link
Contributor

@leoiancu21 Are you using spot instances? I am not currently, as I am too unfamiliar with what they are. That would be the only way to enter that conditional.

@leoiancu21
Copy link
Contributor Author

@ChrisThibodeaux Yep, the only difference between spot instances and normal ones (leaving out the price) is that microsoft can kill your VM if they need some more computing resources, this logic also applies to the time needed to generate a new VM this is why I started working on a dynamic timeout, the price difference is too much to not use it. I'll find a way to make it wait for the updated SKU cores value that Azure provides, I just have to make some tests before

@ChrisThibodeaux
Copy link
Contributor

ChrisThibodeaux commented Jul 18, 2024

@leoiancu21 Don't make the change to return [machine for machine in self.db.list_machines(tags=[tag])]. I have NO idea why, but this breaks sqalchemy and always gives this error:

Edit: My initial comment here was wrong. There is nothing breaking anymore, I believe it had to have been other changes I made.

@ChrisThibodeaux
Copy link
Contributor

@leoiancu21 Update from my end. My assumptions about the agent not being in a ready state before a job is handed to it's machine was wrong. In reality, the ScalingBoundedSemaphore was not being released soon enough after a VMSS instance was reimaged. Once it got to the machine lock, it was getting permanently stuck.

I am putting together a PR today with, hopefully, the fix to this. I have been able to run 30+ tasks in a row with 5 machines up, so there is progress.

@ChrisThibodeaux
Copy link
Contributor

@leoiancu21 Apologies for taking so long on figuring out the last of the bugs. I have a PR with the changes that allow me to use auto-scaling without issue now, here. I placed in a fix for the machines key error, based entirely on the solution you provided me. If there is a cleaner way to do that, please let me know.

@leoiancu21
Copy link
Contributor Author

@leoiancu21 Apologies for taking so long on figuring out the last of the bugs. I have a PR with the changes that allow me to use auto-scaling without issue now, here. I placed in a fix for the machines key error, based entirely on the solution you provided me. If there is a cleaner way to do that, please let me know.

Thank you so much, I'll try this in the next few days, I'll proceed to close this pull request as soon as it works. Thanks again for your work

@doomedraven
Copy link
Collaborator

btw can you fix this 2?

modules/machinery/az.py:1155:17: F841 [*] Local variable `time_taken` is assigned to but never used
modules/machinery/az.py:1157:33: F841 [*] Local variable `e` is assigned to but never used

@doomedraven
Copy link
Collaborator

so is this ready?

@ChrisThibodeaux
Copy link
Contributor

@leoiancu21 I can pull this branch and test it out with the new changes I made in the other PR. Do you want me to send you a diff of any updates I make?

@leoiancu21
Copy link
Contributor Author

Sorry for my absence, I was out of office this week, I'll pull the new changes and fix my code asap

@leoiancu21
Copy link
Contributor Author

@ChrisThibodeaux
I've tested the merged build, I still have the SKU capacity decreasing, since you don't use spot istances could you confirm to me that you don't have the same problem so I can set a patch only for spot instances ?

@ChrisThibodeaux
Copy link
Contributor

@leoiancu21 Sorry, I won't be able to test that for a few more days. Our little one came earlier than expected, so I am out of office for a while. I will try to find time to test this out though.

Is there any way you could push a commit with the merges?

@leoiancu21
Copy link
Contributor Author

leoiancu21 commented Jul 31, 2024

@ChrisThibodeaux I'd wait for it, for what I've seen at the moment when the scaleset is created it loads the machines inside the DB, assigns tasks and processes them, at the moment I also forced the scaleset sku capacity with Azure rules :

image

The main issue now concerns tasks in pending that are assigned to old machines ids (when an analysis is completed a reimage is triggered that causes the machine to change it's id), apparently Azure treats a reimaged machine as a new one and our DB is not updated at that specific point :

2024-07-31 09:04:05,810 [modules.machinery.az] DEBUG: Trying <bound method VirtualMachineScaleSetsOperations.begin_reimage_all of <azure.mgmt.compute.v2024_03_01.operations._operations.VirtualMachineScaleSetsOperations object at 0x7fab7483dae0>>(('CSS-Sandbox-Cape', 'CSS-Sandbox-Cape-VMSS-1', <azure.mgmt.compute.v2024_03_01.models._models_py3.VirtualMachineScaleSetVMInstanceIDs object at 0x7fab747e6e60>),{'polling_interval': 1})
2024-07-31 09:04:06,088 [modules.machinery.az] WARNING: Failed to <bound method VirtualMachineScaleSetsOperations.begin_reimage_all of <azure.mgmt.compute.v2024_03_01.operations._operations.VirtualMachineScaleSetsOperations object at 0x7fab7483dae0>>(('CSS-Sandbox-Cape', 'CSS-Sandbox-Cape-VMSS-1', <azure.mgmt.compute.v2024_03_01.models._models_py3.VirtualMachineScaleSetVMInstanceIDs object at 0x7fab747e6e60>),{'polling_interval': 1}) due to the Azure error '(InvalidParameter) The provided instanceId 44 is not an active Virtual Machine Scale Set VM instanceId.
Code: InvalidParameter
Message: The provided instanceId 44 is not an active Virtual Machine Scale Set VM instanceId.
Target: instanceIds': '(InvalidParameter) The provided instanceId 44 is not an active Virtual Machine Scale Set VM instanceId.
Code: InvalidParameter
Message: The provided instanceId 44 is not an active Virtual Machine Scale Set VM instanceId.
Target: instanceIds'.
2024-07-31 09:04:06,088 [modules.machinery.az] ERROR: CuckooMachineError('(InvalidParameter) The provided instanceId 44 is not an active Virtual Machine Scale Set VM instanceId.\nCode: InvalidParameter\nMessage: The provided instanceId 44 is not an active Virtual Machine Scale Set VM instanceId.\nTarget: instanceIds:(InvalidParameter) The provided instanceId 44 is not an active Virtual Machine Scale Set VM instanceId.\nCode: InvalidParameter\nMessage: The provided instanceId 44 is not an active Virtual Machine Scale Set VM instanceId.\nTarget: instanceIds')
Traceback (most recent call last):
  File "/opt/CAPEv2/modules/machinery/az.py", line 770, in _azure_api_call
    results = operation(*args, **kwargs)
  File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/azure/core/tracing/decorator.py", line 78, in wrapper_use_tracer
    return func(*args, **kwargs)
  File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/azure/mgmt/compute/v2024_03_01/operations/_operations.py", line 9602, in begin_reimage_all
    raw_result = self._reimage_all_initial(  # type: ignore
  File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/azure/mgmt/compute/v2024_03_01/operations/_operations.py", line 9507, in _reimage_all_initial
    raise HttpResponseError(response=response, error_format=ARMErrorFormat)
azure.core.exceptions.HttpResponseError: (InvalidParameter) The provided instanceId 44 is not an active Virtual Machine Scale Set VM instanceId.
Code: InvalidParameter
Message: The provided instanceId 44 is not an active Virtual Machine Scale Set VM instanceId.
Target: instanceIds

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/CAPEv2/modules/machinery/az.py", line 1330, in _thr_reimage_list_reader
    async_reimage_some_machines = Azure._azure_api_call(
  File "/opt/CAPEv2/modules/machinery/az.py", line 782, in _azure_api_call
    raise CuckooMachineError(f"{error}:{exc.message if hasattr(exc, 'message') else repr(exc)}")
lib.cuckoo.common.exceptions.CuckooMachineError: (InvalidParameter) The provided instanceId 44 is not an active Virtual Machine Scale Set VM instanceId.
Code: InvalidParameter
Message: The provided instanceId 44 is not an active Virtual Machine Scale Set VM instanceId.
Target: instanceIds:(InvalidParameter) The provided instanceId 44 is not an active Virtual Machine Scale Set VM instanceId.
Code: InvalidParameter
Message: The provided instanceId 44 is not an active Virtual Machine Scale Set VM instanceId.
Target: instanceIds
2024-07-31 09:04:06,089 [modules.machinery.az] WARNING: Machine CSS-Sandbox-Cape-VMSS-1_44 does not exist anymore. Deleting from database.
2024-07-31 09:04:06,093 [lib.cuckoo.core.database] WARNING: CSS-Sandbox-Cape-VMSS-1_44 does not exist in the database.
2024-07-31 09:04:10,856 [modules.machinery.az] DEBUG: Trying <bound method VirtualMachineScaleSetsOperations.begin_reimage_all of <azure.mgmt.compute.v2024_03_01.operations._operations.VirtualMachineScaleSetsOperations object at 0x7fab817e4070>>(('CSS-Sandbox-Cape', 'CSS-Sandbox-Cape-VMSS-1', <azure.mgmt.compute.v2024_03_01.models._models_py3.VirtualMachineScaleSetVMInstanceIDs object at 0x7fabae1c66b0>),{'polling_interval': 1})
2024-07-31 09:04:11,240 [modules.machinery.az] WARNING: Failed to <bound method VirtualMachineScaleSetsOperations.begin_reimage_all of <azure.mgmt.compute.v2024_03_01.operations._operations.VirtualMachineScaleSetsOperations object at 0x7fab817e4070>>(('CSS-Sandbox-Cape', 'CSS-Sandbox-Cape-VMSS-1', <azure.mgmt.compute.v2024_03_01.models._models_py3.VirtualMachineScaleSetVMInstanceIDs object at 0x7fabae1c66b0>),{'polling_interval': 1}) due to the Azure error '(InvalidParameter) The provided instanceId 42 is not an active Virtual Machine Scale Set VM instanceId.
Code: InvalidParameter
Message: The provided instanceId 42 is not an active Virtual Machine Scale Set VM instanceId.
Target: instanceIds': '(InvalidParameter) The provided instanceId 42 is not an active Virtual Machine Scale Set VM instanceId.
Code: InvalidParameter
Message: The provided instanceId 42 is not an active Virtual Machine Scale Set VM instanceId.
Target: instanceIds'.
2024-07-31 09:04:11,241 [modules.machinery.az] ERROR: CuckooMachineError('(InvalidParameter) The provided instanceId 42 is not an active Virtual Machine Scale Set VM instanceId.\nCode: InvalidParameter\nMessage: The provided instanceId 42 is not an active Virtual Machine Scale Set VM instanceId.\nTarget: instanceIds:(InvalidParameter) The provided instanceId 42 is not an active Virtual Machine Scale Set VM instanceId.\nCode: InvalidParameter\nMessage: The provided instanceId 42 is not an active Virtual Machine Scale Set VM instanceId.\nTarget: instanceIds')
Traceback (most recent call last):
  File "/opt/CAPEv2/modules/machinery/az.py", line 770, in _azure_api_call
    results = operation(*args, **kwargs)
  File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/azure/core/tracing/decorator.py", line 78, in wrapper_use_tracer
    return func(*args, **kwargs)
  File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/azure/mgmt/compute/v2024_03_01/operations/_operations.py", line 9602, in begin_reimage_all
    raw_result = self._reimage_all_initial(  # type: ignore
  File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/azure/mgmt/compute/v2024_03_01/operations/_operations.py", line 9507, in _reimage_all_initial
    raise HttpResponseError(response=response, error_format=ARMErrorFormat)
azure.core.exceptions.HttpResponseError: (InvalidParameter) The provided instanceId 42 is not an active Virtual Machine Scale Set VM instanceId.
Code: InvalidParameter
Message: The provided instanceId 42 is not an active Virtual Machine Scale Set VM instanceId.
Target: instanceIds

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/CAPEv2/modules/machinery/az.py", line 1330, in _thr_reimage_list_reader
    async_reimage_some_machines = Azure._azure_api_call(
  File "/opt/CAPEv2/modules/machinery/az.py", line 782, in _azure_api_call
    raise CuckooMachineError(f"{error}:{exc.message if hasattr(exc, 'message') else repr(exc)}")
lib.cuckoo.common.exceptions.CuckooMachineError: (InvalidParameter) The provided instanceId 42 is not an active Virtual Machine Scale Set VM instanceId.
Code: InvalidParameter
Message: The provided instanceId 42 is not an active Virtual Machine Scale Set VM instanceId.
Target: instanceIds:(InvalidParameter) The provided instanceId 42 is not an active Virtual Machine Scale Set VM instanceId.
Code: InvalidParameter
Message: The provided instanceId 42 is not an active Virtual Machine Scale Set VM instanceId.
Target: instanceIds
2024-07-31 09:04:11,241 [modules.machinery.az] WARNING: Machine CSS-Sandbox-Cape-VMSS-1_42 does not exist anymore. Deleting from database.
2024-07-31 09:05:53,750 [modules.machinery.az] DEBUG: Connecting to Azure for the region 'northeurope'.
2024-07-31 09:05:53,755 [modules.machinery.az] DEBUG: Monitoring the machine pools...
2024-07-31 09:05:53,759 [modules.machinery.az] DEBUG: Trying <bound method UsageOperations.list of <azure.mgmt.compute.v2024_03_01.operations._operations.UsageOperations object at 0x7fabae1df010>>(('northeurope',),{})
2024-07-31 09:05:53,760 [msal.authority] INFO: Initializing with Entra authority: https://login.microsoftonline.com/[redacted]
2024-07-31 09:05:54,562 [modules.machinery.az] DEBUG: Deleting machines from database if they do not exist in the VMSS CSS-Sandbox-Cape-VMSS-1.
2024-07-31 09:05:54,563 [modules.machinery.az] DEBUG: Trying <bound method VirtualMachineScaleSetVMsOperations.list of <azure.mgmt.compute.v2024_03_01.operations._operations.VirtualMachineScaleSetVMsOperations object at 0x7fabae17f9d0>>(('CSS-Sandbox-Cape', 'CSS-Sandbox-Cape-VMSS-1'),{})
2024-07-31 09:10:53,756 [modules.machinery.az] DEBUG: Monitoring the machine pools...
2024-07-31 09:15:53,756 [modules.machinery.az] DEBUG: Monitoring the machine pools...
2024-07-31 09:20:53,757 [modules.machinery.az] DEBUG: Monitoring the machine pools...
2024-07-31 09:25:53,758 [modules.machinery.az] DEBUG: Monitoring the machine pools...
2024-07-31 09:30:53,759 [modules.machinery.az] DEBUG: Monitoring the machine pools...
2024-07-31 09:35:53,752 [modules.machinery.az] DEBUG: Connecting to Azure for the region 'northeurope'.
2024-07-31 09:35:53,759 [modules.machinery.az] DEBUG: Monitoring the machine pools...
2024-07-31 09:40:53,760 [modules.machinery.az] DEBUG: Monitoring the machine pools...

As you can see it proceeds to delete machines that don't match the current scaleset machine ids only after it fails binding the machine ID to the new task, this causes Cape to go into a loop where every 5 minutes checks [modules.machinery.az] DEBUG: Monitoring the machine pools....
Now even if the logs say that the machines have been deleted from the DB when uploading a new sample we still get the same machines we provided at startup :

image

Machines that, as I said and showed inside the debug logs before, don't match with the currently available hosts running in the scaleset (I belive there is something missinig in the update db function when machines are deleted after an analysis and when they are deleted after a failed bind for a new task) :

image

This causes new tasks to be assigned wrongly and either fail the analysis or get stuck inside the pending queue.

image

image

At the moment I'm still investigating which function is not working properly, I'll publish a commit inside this thread as soon as I'm able to make it work, at the moment I don't think that publishing the broken merge would be useful in any way.

So take your time and congrats on your newborn baby

@ChrisThibodeaux
Copy link
Contributor

@leoiancu21 Compare these two sections from your branch and master's az.py

  1. Yours: https://github.com/leoiancu21/CAPEv2/blob/Azure-Dynamic-Timeout-/modules/machinery/az.py#L494-L520
  2. Master's (has my PR changes): https://github.com/kevoreilly/CAPEv2/blob/master/modules/machinery/az.py#L447-L478

I am nearly positive you are experiencing the same issue I was and what lead to me creating that PR in the first place. I can't remember the exact details, but I had instances being deleted after jobs just like you describe.

Rebasing your branch to master (or atleast this commit) will almost certainly fix the issue you are seeing.

@leoiancu21
Copy link
Contributor Author

@leoiancu21 Compare these two sections from your branch and master's az.py

  1. Yours: https://github.com/leoiancu21/CAPEv2/blob/Azure-Dynamic-Timeout-/modules/machinery/az.py#L494-L520
  2. Master's (has my PR changes): https://github.com/kevoreilly/CAPEv2/blob/master/modules/machinery/az.py#L447-L478

I am nearly positive you are experiencing the same issue I was and what lead to me creating that PR in the first place. I can't remember the exact details, but I had instances being deleted after jobs just like you describe.

Rebasing your branch to master (or atleast this commit) will almost certainly fix the issue you are seeing.

Sorry for the long wait, got assigned to other projects after the summer holidays, I'm going to try rebasing az.py right now, I'll let you know what comes out of it

@doomedraven
Copy link
Collaborator

any progres here?

@leoiancu21
Copy link
Contributor Author

any progres here?

At the moment i switched to normal VMs instead of spot due to too many inconsistencies, and i'm working on fixing other small issues, once i'm back on my test environment i will provide some updates

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants