Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cgroups api refactor for v2 #3096

Merged
merged 15 commits into from
Apr 5, 2024

Conversation

maddieford
Copy link
Contributor

@maddieford maddieford commented Mar 16, 2024

Description

This PR refactors the SystemdCgroupApi to be a base abstract class for different versions of the cgroup api. There are two different implementations (SystemdCgroupApiv1 and SystemdCgroupApiv2) which inherit _SystemdCgroupApi. With these changes, the SystemdCgroupApi class should not be instantiated, and get_cgroup_api() should be used as a factory to get the correct api version instead.

There should be no changes to existing v1 behavior with this PR (aside from some small improvements to logging and telemetry). If the agent detects legacy or a hybrid cgroup hierarchy, then the agent will use v1 api. If the agent detects unified cgroup hierarchy, then the agent will use v2. If v2 is chosen, the agent will NOT enable any usage of cgroups for the time being.

Prior to the changes in this PR, the agent supported any v1 cgroup controller mountpoint. With these changes, the agent will only support cgroup mountpoints which are created by systemd ('/sys/fs/cgroup'). If v1 cgroups are mounted elsewhere, the agent will not enable cgroups and clean up any agent-drop in files in case this is a VM which was previously supporting agent cgroups

This PR adds an e2e test on images which use v2 by default to verify that the agent does not enable cgroup usage on these VMs. This test will be removed once the agent supports v2.

Issue #


PR information

  • The title of the PR is clear and informative.
  • There are a small number of commits, each of which has an informative message. This means that previously merged commits do not appear in the history of the PR. For information on cleaning up the commits in your pull request, see this page.
  • If applicable, the PR references the bug/issue that it fixes in the description.
  • New Unit tests were added for the changes made

Quality of Code and Contribution Guidelines

* Initial changes for log collector cgroups v2 support

* Fix pylint issues

* Fix pylint issues

* Fix pylint issues

* Check that both controllers are mounted in the chosen cgroups version for log collector

* Fix regex

* Update test_agent unit tests

* Fix unit tests

* Update format strings

* Fix broken cgroupconfigurator unit tests

* pyling

* Fix cgroups api unit tests

* Ignore unused args

* Ignore unused args

* Add cgroup configurator tests

* v2 required check in parent cgroup

* unit tests is_controller_enabled

* Fix test failure and pylint:

* pylint

* Update agent checks

* Fix controller enable logic and unit tests

* Remove changes to collect logs

* Fix pylint

* Add e2e test for v2
Copy link

codecov bot commented Mar 16, 2024

Codecov Report

Attention: Patch coverage is 82.12766% with 42 lines in your changes are missing coverage. Please review.

Project coverage is 72.02%. Comparing base (ee6eb7d) to head (9501707).

Files Patch % Lines
azurelinuxagent/ga/cgroupconfigurator.py 61.42% 27 Missing ⚠️
azurelinuxagent/ga/cgroupapi.py 89.85% 11 Missing and 3 partials ⚠️
azurelinuxagent/agent.py 94.44% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #3096      +/-   ##
===========================================
+ Coverage    71.89%   72.02%   +0.12%     
===========================================
  Files          110      110              
  Lines        16395    16495     +100     
  Branches      2342     2372      +30     
===========================================
+ Hits         11788    11880      +92     
- Misses        4055     4063       +8     
  Partials       552      552              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

return v2
else:
log_cgroup_warning("CPU and Memory controllers are not mounted in cgroups v1 or v2")
return None
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also considering raising Exception here instead. Let me know what you think

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exception seems appropriate

Cgroup version specific. Returns a tuple with the path of the cpu and memory cgroups for the given unit.
The values returned can be None if the controller is not mounted or enabled.
"""
pass # pylint: disable=W0107
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pylint disable unnecessary pass statement

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should raise NotImplementedError

@@ -120,11 +142,11 @@ def get_daemon_pid():

class SystemdCgroupsApi(CGroupsApi):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SystemdCgroupsApi shouldn't be instantiated directly anymore. get_cgroup_api() should be used instead to get the correct api.

I don't see any implementations of abstract classes (ABC) in the agent. Is there a reason for that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SystemdCgroupsApi shouldn't be instantiated directly anymore.

We should mark it as "private"

I don't see any implementations of abstract classes (ABC) in the agent. Is there a reason for that?

ABC was added on Python 3

# cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,name=systemd)
# cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
# cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
# $ findmnt -t cgroup --noheadings
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to use fndmnt instead of mount

return cpu_cgroup_path, memory_cgroup_path

@staticmethod
def get_cgroup2_controllers():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer needed, this is now implemented by SystemdCgroupsApiv2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this scenario to ensure v2 isn't enabled for agent and extensions unexpectedly

@@ -33,17 +33,3 @@ def create_legacy_agent_cgroup(cgroups_file_system_root, controller, daemon_pid)
fileutil.append_file(os.path.join(legacy_cgroup, "cgroup.procs"), daemon_pid + "\n")
return legacy_cgroup

@staticmethod
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this wasn't being used anywhere

def mock_cgroup_paths(*args, **kwargs):
if args and args[0] == "self":
relative_path = "{0}/{1}".format(cgroupconfigurator.LOGCOLLECTOR_SLICE, logcollector.CGROUPS_UNIT)
return (cgroupconfigurator.LOGCOLLECTOR_SLICE, relative_path)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was incorrectly mocking the relative paths before

CollectLogsHandler.disable_monitor_cgroups_check()

@patch("azurelinuxagent.agent.LogCollector")
def test_doesnt_call_collect_logs_when_controllers_mounted_in_different_hierarchies(self, mock_log_collector):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log collector should only run when both cpu and memory mounted in v1


def assert_cgroups_created(self, extension_cgroups):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this wasn't being used anywhere

@maddieford maddieford marked this pull request as ready for review March 18, 2024 19:48
@codecov-commenter
Copy link

codecov-commenter commented Mar 18, 2024

Codecov Report

Attention: Patch coverage is 80.00000% with 53 lines in your changes are missing coverage. Please review.

Project coverage is 71.97%. Comparing base (782a165) to head (35ca335).

Files Patch % Lines
azurelinuxagent/ga/cgroupconfigurator.py 63.52% 30 Missing and 1 partial ⚠️
azurelinuxagent/ga/cgroupapi.py 86.70% 12 Missing and 9 partials ⚠️
azurelinuxagent/agent.py 95.45% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #3096      +/-   ##
===========================================
+ Coverage    71.87%   71.97%   +0.09%     
===========================================
  Files          110      110              
  Lines        16425    16513      +88     
  Branches      2348     2369      +21     
===========================================
+ Hits         11806    11885      +79     
- Misses        4067     4073       +6     
- Partials       552      555       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

return v2
else:
log_cgroup_warning("CPU and Memory controllers are not mounted in cgroups v1 or v2")
return None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exception seems appropriate

@@ -68,7 +90,7 @@ def track_cgroups(extension_cgroups):
for cgroup in extension_cgroups:
CGroupsTelemetry.track_cgroup(cgroup)
except Exception as exception:
logger.warn("Cannot add cgroup '{0}' to tracking list; resource usage will not be tracked. "
logger.warn("[CGW] Cannot add cgroup '{0}' to tracking list; resource usage will not be tracked. "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"[CGW]" was added only to INFO messages to mark them for promotion to WARN when the feature is stable

azurelinuxagent/ga/cgroupapi.py Show resolved Hide resolved
@@ -120,11 +142,11 @@ def get_daemon_pid():

class SystemdCgroupsApi(CGroupsApi):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SystemdCgroupsApi shouldn't be instantiated directly anymore.

We should mark it as "private"

I don't see any implementations of abstract classes (ABC) in the agent. Is there a reason for that?

ABC was added on Python 3

@@ -120,11 +142,11 @@ def get_daemon_pid():

class SystemdCgroupsApi(CGroupsApi):
"""
Cgroups interface via systemd
Cgroups interface via systemd. Contains common api implementations between cgroups v1 and v2.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"cgroup" (singular)


return False

def get_cgroup_mount_points(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to rename this method and/or rewrite usages of it, since on v2 the mountpoint is just /sys/fs/cgroup

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe rename to get_cgroup_controller_mount_points

In v2 api, get_cgroup_controller_mount_points just calls some get_cgroup_root_mount_point

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"mount point" was appropriate for v1, since the controllers are mounted independently of each other. This is basically the path of the root cgroup, maybe we should change to that?

get_root_cgroups, etc ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the code for v1 is overcomplicated because it keeps track of the paths for each controller separately. This causes a lot of code repetition in the code, logs & telemetry, making the code and logs harder to read.

v2 has a simpler model.

Instead of making v2 as complicated as v1, shouldn't we make v1 as simple as v2? Can you explore this option?

Initially the code was using the paths for each controller separately, since it was using the file system API directly and, for example, it needed to write the PID of the Agent to both the cpu and memory paths. Now we are using the systemd API and we work in terms of slices and scopes, which are 1 single entity (instead of 1 per controller). I think we use the actual filesystem paths only to collect the metrics.

It has been proposed that we should also monitor io and maybe other metrics. If we keep in the path of tracking each controller separately, this is going to get ever more complicated that it already is. v2 is a good chance to simplify this


return cpu_path, memory_path

def start_extension_command(self, extension_name, command, cmd_name, timeout, shell, cwd, env, stdout, stderr, error_code=ExtensionErrorCodes.PluginUnknownFailure): # pylint: disable=W0613
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should change CGroupConfigurator to just avoid the call altogether to avoid those warnings

@@ -166,58 +155,67 @@ def initialize(self):
agent_drop_in_file_memory_accounting, agent_drop_in_file_cpu_quota])
self.__cleanup_all_files(files_to_cleanup)
self.__reload_systemd_config()
logger.info("Agent reset the quotas if distro: {0} goes from supported to unsupported list", get_distro())
log_cgroup_info("Agent reset the quotas if distro: {0} goes from supported to unsupported list".format(get_distro()), send_event=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should send this event

except Exception as err:
logger.warn("Unable to delete Agent drop-in files while resetting the quotas: {0}".format(err))
logger.warn("[CGW] Unable to delete Agent drop-in files while resetting the quotas: {0}".format(err))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's not mark it as CGW (cgroup warning), since it is already a warning

from azurelinuxagent.ga.cgroup import CpuCgroup
from azurelinuxagent.common.future import ustr


def log_cgroup_info(formatted_string, op=WALAEventOperation.CGroupsInfo, send_event=True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should probably be in CgroupsApi. CgroupsTelemetry handles collecting the resource metrics

Copy link
Member

@narrieta narrieta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I clicked Approve by mistake.

I added comments to the Agent's code. I will review the tests on a second pass


1. For non-leaf cgroups, the cgroup.subtree_control shows space separated list of the controllers which are
enabled to control resource distribution from the cgroup to its children. All non-root "cgroup.subtree_control"
files can only contain controllers which are enabled in the parent's "cgroup.subtree_control" file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you stated for the given cgroup path, controllers enabled information present in parent cgroup.subtree_control file but you are checking in current path file and also, why do we need to check for every cgroup path. Isn't this check needed only for root?

Documentation stated this

Top-down Constraint[¶](https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#top-down-constraint) Resources are distributed top-down and a cgroup can further distribute a resource only if the resource has been distributed to it from the parent. This means that all non-root "cgroup.subtree_control" files can only contain controllers which are enabled in the parent's "cgroup.subtree_control" file. A controller can be enabled only if the parent has the controller enabled and a controller can't be disabled if one or more children have it enabled.

Looks like root will have what controllers enabled, without here, you can't enable it in non-root cgroups

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If controller is in cgroup.subtree_control file then we know it is enabled in the current cgroup and any of its children due to top down constraint. That's why I had that check. If this is unclear then maybe cgroup.controllers is a better check for non-root cgroups, and this method would be better divided by root vs non-root instead of leaf vs non-leaf.

Answer to the second question in your other comments

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, if subtree_contorl is empty in agent service cgroup directory but controllers enabled in azure.slice, then this fails and says that controllers are not enabled when it's enabled

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It won't fail since it checks for existence of .* interface files, which only exist if controller is enabled

Copy link
Contributor Author

@maddieford maddieford Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree cgroup.controllers would be clearer check though for non-root cgroup

Copy link
Contributor

@nagworld9 nagworld9 Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just want to make sure what I speak correct here. When I check doc for cgroup.controller, it just says what controllers available to enable but it does not say what enabled.

Each cgroup has a "cgroup.controllers" file which lists all controllers available for the cgroup to enable:
If we want to use this, then we need to check is this file containing all the controllers minus what's mounted in v1? or will it show everything regardless v1 status?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a controller (cpu for example) mounted in v1, then it will not be mounted in v2. Controllers cannot be mounted in both hierarchies simultaneously, so it wouldn't appear in the list of available controllers for v2.

cgroup.controllers file lists all controllers available for the cgroup to enable to its children. A cgroup can only enable a controller to its children if it is enabled in the current cgroup itself, due to top down constraint.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure we need this check other than for the root cgroup. I think we need to check whether cpu & memory are enabled (cgroup.subtree_control) on root. If they are enabled, we should use them. If not. we should not enable them. at least in the initial releases -- we can reconsider this later.

We own the non-root cgroups (walinuxagent.service, azure-walinuxagent-logcollector.slice, azure-vmextensions.slice, azure-vmextensions-Extension.Foo-1.0.0.0.slice) and we should enable the controllers we want to use on those cgroups.

azurelinuxagent/ga/cgroupapi.py Outdated Show resolved Hide resolved
cpu_cgroup_path = None
if cpu_mount_point is not None and cpu_cgroup_relative_path is not None:
cgroup_path = os.path.join(cpu_mount_point, cpu_cgroup_relative_path)
if self.is_controller_enabled('cpu', cgroup_path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same thing here


return cpu_path, memory_path

def start_extension_command(self, extension_name, command, cmd_name, timeout, shell, cwd, env, stdout, stderr, error_code=ExtensionErrorCodes.PluginUnknownFailure): # pylint: disable=W0613
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call has protection and only called if cgroups enabled. I believe we don't enable for v2 now, this won't be called


# check whether cgroup monitoring is supported on the current distro
self._cgroups_supported = CGroupsApi.cgroups_supported()
if not self._cgroups_supported:
logger.info("Cgroup monitoring is not supported on {0}", get_distro())
log_cgroup_info("Cgroup monitoring is not supported on {0}".format(get_distro()), send_event=False)
Copy link
Contributor

@nagworld9 nagworld9 Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this is also important info and should send this. It just one time log(on service start)

azurelinuxagent/ga/cgroupconfigurator.py Outdated Show resolved Hide resolved
@@ -836,6 +819,12 @@ def start_extension_command(self, extension_name, command, cmd_name, timeout, sh
extension_name, ustr(exception))
self.disable(reason, DisableCgroups.ALL)
# fall-through and re-invoke the extension
except CGroupsException as exception:
Copy link
Contributor

@nagworld9 nagworld9 Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed, protected with enable flag


# It is possible for different controllers to be simultaneously mounted under v1 and v2. If any are mounted under
# v1, use v1.
if v1.is_cpu_or_memory_mounted():
Copy link
Contributor

@nagworld9 nagworld9 Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't it supposed to be and condition? we use v1 if both cpu and memory mounted?

I just thought one scenario we use v1 for cpu, and cx enabled memory in v2 for their application. Then we start using v1 for memory then that will create discrepancies

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should use v1 if cpu or memory is mounted in v1. If cpu is mounted in v1 and memory is mounted in v2, then we choose the SystemdCgroupsApiv1() api. When we get the mount points from the v1 api, memory will be None, because memory is not mounted in v1. Since memory mount point is none, agent will not track or enforce memory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially I was thinking one can mount both controllers in v1 but they can use v2 as in v2 they don't have to do separate mounting for controllers, and simply start enabling controllers though subcontorl file. After reading more and experimenting, it's not possible. If we have to use same controller in v2, they should unmount that controller in v1. So, you are check looks ok now. Since they unmount v1, we get None, so we don't use that controller.

A cgroup v2 controller is available only if it is not currently in use via a mount against a cgroup v1 hierarchy. Or, to put things another way, it is not possible to employ the same controller against both a v1 hierarchy and the unified v2 hierarchy. This means that it may be necessary first to unmount a v1 controller (as described above) before that controller is available in v2. Since [systemd(1)](https://www.man7.org/linux/man-pages/man1/systemd.1.html) makes heavy use of some v1 controllers by default, it can in some cases be simpler to boot the system with selected v1 controllers disabled. To do this, specify the cgroup_no_v1=list option on the kernel boot command line; list is a comma-separated list of the names of the controllers to disable, or the word all to disable all v1 controllers. (This situation is correctly handled by [systemd(1)](https://www.man7.org/linux/man-pages/man1/systemd.1.html), which falls back to operating without the specified controllers.)
https://www.man7.org/linux/man-pages/man7/cgroups.7.html

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we simplify this asking for the file system type

If you wonder how to detect which of these three modes is currently used, use statfs() on /sys/fs/cgroup/. If it reports CGROUP2_SUPER_MAGIC in its .f_type field, then you are in unified mode. If it reports TMPFS_MAGIC then you are either in legacy or hybrid mode. To distinguish these two cases, run statfs() again on /sys/fs/cgroup/unified/. If that succeeds and reports CGROUP2_SUPER_MAGIC you are in hybrid mode, otherwise not. From a shell, you can check the Type in stat -f /sys/fs/cgroup and stat -f /sys/fs/cgroup/unified.

Then if unified use v2, else v1.

Results from ubuntu 20 (hybrid) and 22 (unified)

root@nam-u20:/home/nam# stat -f --format=%T /sys/fs/cgroup
tmpfs

root@nam-ubuntu22:/home/nam# stat -f --format=%T /sys/fs/cgroup
cgroup2fs


# check whether cgroup monitoring is supported on the current distro
self._cgroups_supported = CGroupsApi.cgroups_supported()
if not self._cgroups_supported:
logger.info("Cgroup monitoring is not supported on {0}", get_distro())
log_cgroup_info("Cgroup monitoring is not supported on {0}".format(get_distro()), send_event=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 2 checks for CGroupsApi.cgroups_supported(); one here and one on line 143. Can we merge them?

log_cgroup_info("Cgroup monitoring is not supported on {0}".format(get_distro()), send_event=False)
return

# Determine which version of the Cgroup API should be used. If the correct version can't be determined,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the check for v1 v2 v2 should come after the check for systemd on line 175 below

maddieford and others added 4 commits March 26, 2024 10:22
* get_cgroup_api should raise exception when controllers not mounted

* Combine cgroups_supported() check

* Combine SystemdCgroupsApi and CGroupApi classes

* fix pylint and tests with sudo

* Rename SystemdCgroupsApi to SystemdCgroupApi

* Cgroup should be singular when referring to the APi

* Unimpleneted methods should raise NotImplementederror

* Check for cpu,cpuacct

* v2 start extension command should not be implemented

* log_cgorup_info and log_cgroup_warning should be in cgroupapi

* Systemd check should come before api

* Explicitly check for empty dict

* Only check if controllers are enabled at root for v2

* Remove unnecessary mocked paths in mock cgroup env

* V2 does not have concept of mounting controllers

* Fix super call for python 2

* get_cgroup_api should be function

* Move logging functions up

* Use stat -f to get cgroup mode

* Mock hybrid path

* Fix unit tests:

* Debug tests

* Debug tests

* Debug unit tests

* Fix unit tests

* Fix pylint

* Fix e2e test for v2

* Fix e2e test

* Fix e2e test

* Fix e2e test

* Combine common implementations
def __init__(self):
super(SystemdCgroupApiv2, self).__init__()
self._root_cgroup_path = None
self._controllers_enabled_at_root = []
Copy link
Member

@narrieta narrieta Mar 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like these two properties are initialized as a side effect of calling get_controller_root_paths(). That does not seem correct. For example, if somebody calls is_controller_enabled_at_root() before calling get_controller_root_paths() then self._root_cgroup_path won't be initialized.

self._root_cgroup_path = None
self._controllers_enabled_at_root = []

def is_controller_enabled_at_root(self, controller):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this can be marked as private


class CGroupsApi(object):

class CGroupUtil(object):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are several static methods shared between cgroupconfigurator and systemdcgroupapi, which are not related to systemd cgroup api implementation. I added them to this CGroupUtil

azurelinuxagent/ga/cgroupapi.py Show resolved Hide resolved
@@ -63,18 +69,18 @@ def cgroups_supported():
(distro_name.lower() in ('centos', 'redhat') and 8 <= distro_version.major < 9)

@staticmethod
def track_cgroups(extension_cgroups):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was not being used

unified_hierarchy_path = os.path.join(CGROUP_FILE_SYSTEM_ROOT, "unified")
if os.path.exists(unified_hierarchy_path) and shellutil.run_command(["stat", "-f", "--format=%T", unified_hierarchy_path]).rstrip() == "cgroup2fs":
# Hybrid mode is being used. Check if any controllers are available to be enabled in the unified hierarchy.
available_unified_controllers_file = os.path.join(unified_hierarchy_path, "cgroup.controllers")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From systemd documentation:

Hybrid — this is a hybrid between the unified and legacy mode. It’s set up mostly like legacy, except that there’s also an additional hierarchy /sys/fs/cgroup/unified/ that contains the cgroup v2 hierarchy. (Note that in this mode the unified hierarchy won’t have controllers attached, the controllers are all mounted as separate hierarchies as in legacy mode, i.e. /sys/fs/cgroup/unified/ is purely and exclusively about core cgroup v2 functionality and not about resource management.) In this mode compatibility with cgroup v1 is retained while some cgroup v2 features are available too. This mode is a stopgap. Don’t bother with this too much unless you have too much free time

According to documentation, controllers shouldn't be added in this mode. I added this check in case customer is doing something weird to attach controllers to unified hierarchy in hybrid mode, but let me know if you think I should remove it.

def __init__(self):
self._cgroup_mountpoints = None
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing concept of mountpoints in base class since it is not applicable to v2.



class SystemdCgroupsApiTestCase(AgentTestCase):
def test_get_systemd_version_should_return_a_version_number(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to line 115

found = re.search(r"systemd \d+", version_info) is not None
self.assertTrue(found, "Could not determine the systemd version: {0}".format(version_info))

def test_get_cpu_and_memory_mount_points_should_return_the_cgroup_mount_points(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to SystemdCgroupv1Api and SystemdCgroupv2Api test classes

self.assertEqual(cpu, '/sys/fs/cgroup/cpu,cpuacct', "The mount point for the CPU controller is incorrect")
self.assertEqual(memory, '/sys/fs/cgroup/memory', "The mount point for the memory controller is incorrect")

def test_get_service_cgroup_paths_should_return_the_cgroup_mount_points(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was also moved to each api implementations test classes


def test_get_cpu_and_memory_cgroup_relative_paths_for_process_should_return_the_cgroup_relative_paths(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to each api implementations' test class

self.assertEqual(cpu, "system.slice/walinuxagent.service", "The relative path for the CPU cgroup is incorrect")
self.assertEqual(memory, "system.slice/walinuxagent.service", "The relative memory for the CPU cgroup is incorrect")

def test_get_cgroup2_controllers_should_return_the_v2_cgroup_controllers(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no longer necessary since get_cgroup2_controllers was removed

azurelinuxagent/ga/cgroupapi.py Show resolved Hide resolved
cpu_controller_root,
memory_controller_root)

if self.cgroup_v2_enabled():
Copy link
Contributor

@nagworld9 nagworld9 Apr 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before we don't have this sort of implementation, we don not enable cgroups v2 based on the controller's check. Now do we need to run 193? can we move this check above?

* Run unit tests

* Clean up drop in files if cgroups are disabled

* Init values for cgroup apis

* Rever test change
if not os.path.exists(CGROUP_FILE_SYSTEM_ROOT):
v1_mount_point = shellutil.run_command(['findmnt', '-t', 'cgroup', '--noheadings'])
v2_mount_point = shellutil.run_command(['findmnt', '-t', 'cgroup2', '--noheadings'])
raise CGroupsException("Expected cgroup filesystem to be mounted at '/sys/fs/cgroup', but it is not.\n v1 mount point: {0}\n v2 mount point: {1}".format(v1_mount_point, v2_mount_point))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: for v1 it will be a list of mountpoints, let's add a newline v1 mount point: \n{0}\n. Maybe the same for v2 for consistency.

maddieford and others added 3 commits April 4, 2024 10:43
* Fix merge issues

* Fix unit tests
* get_cgroup_api can raise InvalidCgroupMountpointException

* Add unit test for agent
log_cgroup_info("Using cgroup v1 for resource enforcement and monitoring")
return cgroup_api

raise CGroupsException("Detected unknown cgroup mode: {0}".format(root_hierarchy_mode))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest "{0} has an unexpected file type: {1}".format(CGROUP_FILE_SYSTEM_ROOT, root_hierarchy_mode)

"""
cpu_mountpoint = self._cgroup_mountpoints.get('cpu,cpuacct')
memory_mountpoint = self._cgroup_mountpoints.get('memory')
if cpu_mountpoint is not None and cpu_mountpoint != '/sys/fs/cgroup/cpu,cpuacct':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use CGROUP_FILE_SYSTEM_ROOT instead of hardcoded path, same for memory below

azurelinuxagent/ga/cgroupapi.py Show resolved Hide resolved
azurelinuxagent/ga/cgroupapi.py Show resolved Hide resolved
return root_cgroup_path
return None

def _get_controllers_enabled_at_root(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this depends on _get_root_cgroup_path being called first, in order to self._root_cgroup_path to be initialized. I'd make that dependency explicit by making this a static method and passing the path as argument, instead of referencing self._root_cgroup_path,

cpu_path = None
memory_path = None
for line in fileutil.read_file("/proc/{0}/cgroup".format(process_id)).splitlines():
match = re.match(r'\d+::(?P<path>\S+)', line)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should probably match 0 rather than \d+

@@ -478,6 +469,9 @@ def agent_enabled(self):
def extensions_enabled(self):
return self._extensions_cgroups_enabled

def cgroup_v2_enabled(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to using_cgroup_v2 or similar?

currently cgroup_v2_enabled() can return True when enabled() returns False. That is just weird.

return

self.__setup_azure_slice()

cpu_controller_root, memory_controller_root = self.__get_cgroup_controllers()
self._agent_cpu_cgroup_path, self._agent_memory_cgroup_path = self.__get_agent_cgroups(agent_slice,
if self.cgroup_v2_enabled():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW with this PR, the azure slice/drop in files are not created until after we check if v2 is in use. So on new machines with this update, drop-in files won't be created if v2 is in use.

@maddieford you made this comments, do you have a plans to change this condition and place above self.__setup_azure_slice() As of now it's not matching what you said.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I thought I moved it. I'll move it in next commit

if self._root_cgroup_path is not None:
enabled_controllers_file = os.path.join(self._root_cgroup_path, 'cgroup.subtree_control')
if os.path.exists(enabled_controllers_file):
controllers_enabled_at_root = fileutil.read_file(enabled_controllers_file).rstrip().split(" ")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only handle single space on split, should we consider multiple whitespaces and just call split without delimiter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

systemd mentions it is a space separated list. Anything else would be unexpected.

I'll remove delimiter to be safe

@maddieford maddieford merged commit f84cde2 into Azure:develop Apr 5, 2024
11 checks passed
@maddieford maddieford deleted the cgroups_api_v2_refactor branch August 26, 2024 23:09
nagworld9 added a commit that referenced this pull request Nov 13, 2024
* Add support for Azure Clouds (#2795)

* Add support for Azure Clouds
---------

Co-authored-by: narrieta <narrieta>

* Check certificates only if certificates are included in goal state and update test-requirements to remove codecov (#2803)

* Update version to dummy 1.0.0.0'

* Revert version change

* Only check certificats if goal state includes certs

* Fix code coverage deprecated issue

* Move condition to function call

* Add tests for no outbound connectivity (#2804)

* Add tests for no outbound connectivity

---------

Co-authored-by: narrieta <narrieta>

* Use cloud when validating test location (#2806)

* Use cloud when validating test location
---------

Co-authored-by: narrieta <narrieta>

* Redact access tokens from extension's output (#2811)

* Redact access tokens from extension's output

* python 2.6

---------

Co-authored-by: narrieta <narrieta>

* Add @GabstaMSFT as code owner (#2813)

Co-authored-by: narrieta <narrieta>

* Fix name of single IB device when provisioning RDMA (#2814)

The current code assumes the ipoib interface name is ib0 when single IB
interface is provisioned. This is not always true when udev rules are used
to rename to other names like ibPxxxxx.

Fix this by searching any interface name starting with "ib".

* Allow tests to run on random images (#2817)

* Allow tests to run on random images

* PR feedback

---------

Co-authored-by: narrieta <narrieta>

* Bug fixes for end-to-end tests (#2820)

Co-authored-by: narrieta <narrieta>

* Enable all Azure clouds on end-to-end tests (#2821)

Co-authored-by: narrieta <narrieta>

* Add Azure CLI to container image (#2822)

Co-authored-by: narrieta <narrieta>

* Fixes for Azure clouds (#2823)

* Fixes for Azure clouds

* add debug info

---------

Co-authored-by: narrieta <narrieta>

* Add test for extensions disabled; refactor VirtualMachine and VmExtension (#2824)

* Add test for extensions disabled; refactor VirtualMachine and VmExtension
---------

Co-authored-by: narrieta <narrieta>

* Fixes for end-to-end tests (#2827)

Co-authored-by: narrieta <narrieta>

* Add test for osProfile.linuxConfiguration.provisionVMAgent (#2826)

* Add test for osProfile.linuxConfiguration.provisionVMAgent

* add files

* pylint

* added messages

* ssh issue

---------

Co-authored-by: narrieta <narrieta>

* Enable suppression rules for waagent.log (#2829)

Co-authored-by: narrieta <narrieta>

* Wait for service start when setting up test VMs; collect VM logs when setup fails (#2830)

Co-authored-by: narrieta <narrieta>

* Add vm arch to heartbeat telemetry (#2818) (#2838)

* Add VM Arch to heartbeat telemetry

* Remove outdated vmsize heartbeat tesT

* Remove unused import

* Use platform to get vmarch

(cherry picked from commit 66e8b3d)

* Add regular expression to match logs from very old agents (#2839)

Co-authored-by: narrieta <narrieta>

* Increase concurrency level for end-to-end tests (#2841)

Co-authored-by: narrieta <narrieta>

* Agent update refactor supports GA versioning (#2810)

* agent update refactor (#2706)

* agent update refactor

* address PR comments

* updated available agents

* fix pylint warn

* updated test case warning

* added kill switch flag

* fix pylint warning

* move last update attempt variables

* report GA versioning supported feature. (#2752)

* control agent updates in e2e tests and fix uts (#2743)

* disable agent updates in dcr and fix uts

* address comments

* fix uts

* report GA versioning feature

* Don't report SF flag idf auto update is disabled (#2754)

* fix uts (#2759)

* agent versioning test_suite (#2770)

* agent versioning test_suite

* address PR comments

* fix pylint warning

* fix update assertion

* fix pylint error

* logging manifest type and don't log same error until next period in agent update. (#2778)

* improve logging and don't log same error until next period

* address comments

* update comment

* update comment

* Added self-update time window. (#2794)

* Added self-update time window

* address comment

* Wait and retry for rsm goal state (#2801)

* wait for rsm goal state

* address comments

* Not sharing agent update tests vms and added scenario to daily run (#2809)

* add own vm property

* add agent_update to daily run

* merge conflicts

* address comments

* address comments

* additional comments addressed

* fix pylint warning

* Add test for FIPS (#2842)

* Add test for FIPS

* add test

* increase sleep

* remove unused file

* added comment

* check uptime

---------

Co-authored-by: narrieta <narrieta>

* Eliminate duplicate list of test suites to run (#2844)

* Eliminate duplicate list of test suites to run

* fix paths

* add agent update

---------

Co-authored-by: narrieta <narrieta>

* Port NSBSD system to the latest version of waagent (#2828)

* nsbsd: adapt to recent dns.resolver

* osutil: Provide a get_root_username function for systems where its not 'root' (like in nsbsd)

* nsbsd: tune the configuration filepath

* nsbsd: fix lib installation path

---------

Co-authored-by: Norberto Arrieta <narrieta@users.noreply.github.com>

* Fix method name in update test (#2845)

Co-authored-by: narrieta <narrieta>

* Expose run name as a runbook variable (#2846)

Co-authored-by: narrieta <narrieta>

* Collect test artifacts as a separate step in the test pipeline (#2848)

* Collect test artifacts as a separate step in the test pipeline
---------

Co-authored-by: narrieta <narrieta>

* remove agent update test and py27 version from build (#2853)

* Fix infinite retry loop in end to end tests (#2855)

* Fix infinite retry loop

* fix message

---------

Co-authored-by: narrieta <narrieta>

* Remove empty "distro" module (#2854)

Co-authored-by: narrieta <narrieta>

* Enable Python 2.7 for unit tests (#2856)

* Enable Python 2.7 for unit tests

---------

Co-authored-by: narrieta <narrieta>

* Skip downgrade if requested version below daemon version (#2850)

* skip downgrade for agent update

* add test

* report it in status

* address comments

* revert change

* improved error msg

* address comment

* update location schema and added skip clouds in suite yml (#2852)

* update location schema in suite yml

* address comments

* .

* pylint warn

* comment

* Do not collect LISA logs by default (#2857)

Co-authored-by: narrieta <narrieta>

* Add check for noexec on Permission denied errors (#2859)

* Add check for noexec on Permission denied errors

* remove type annotation

---------

Co-authored-by: narrieta <narrieta>

* Wait for log message in AgentNotProvisioned test (#2861)

* Wait for log message in AgentNotProvisioned test

* hardcoded value

---------

Co-authored-by: narrieta <narrieta>

* Always collect logs on end-to-end tests (#2863)

* Always collect logs

* cleanup

---------

Co-authored-by: narrieta <narrieta>

* agent publish scenario (#2847)

* agent publish

* remove vm size

* address comments

* deamom version fallback

* daemon versionfix

* address comments

* fix pylint error

* address comment

* added error handling

* add time window for agent manifest download (#2860)

* add time window for agent manifest download

* address comments

* address comments

* ignore 75-persistent-net-generator.rules in e2e tests (#2862)

* ignore 75-persistent-net-generator.rules in e2e tests

* address comment

* remove

* Always publish artifacts and test results (#2865)

Co-authored-by: narrieta <narrieta>

* Add tests for extension workflow (#2843)

* Update version to dummy 1.0.0.0'

* Revert version change

* Basic structure

* Test must run in SCUS for test ext

* Add GuestAgentDCRTest Extension id

* Test stucture

* Update test file name

* test no location

* Test location as southcentralus

* Assert ext is installed

* Try changing version for dcr test ext

* Update expected message in instance view

* try changing message to string

* Limit images for ext workflow

* Update classes after refactor

* Update class name

* Refactor tests

* Rename extension_install to extension_workflow

* Assert ext status

* Assert operation sequence is expected

* Remove logger reference

* Pass ssh client

* Update ssh

* Add permission to run script

* Correct permissions

* Add execute permissions for helper script

* Make scripts executable

* Change args to string

* Add required parameter

* Add shebang for retart_agent

* Fix arg format

* Use restart utility

* Run restart with sudo

* Add enable scenario

* Attempt to remove start_time

* Only assert enable

* Add delete scenario

* Fix uninstall scenario

* Add extension update scenario

* Run assert scenario on update scenario

* Fix reference to ext

* Format args as str instead of arr

* Update test args

* Add test case for update without install

* Fix delete

* Keep changes

* Save changes

* Add special chars test case

* Fix dcr_ext issue{

* Add validate no lag scenario

* Fix testguid reference

* Add additional log statements for debugging

* Fix message to check before encoding

* Encode setting name

* Correctly check data

* Make check data executable

* Fix command args for special char test

* Fix no lag time

* Fix ssh client reference

* Try message instead of text

* Remove unused method

* Start clean up

* Continue code cleanup

* Fix pylint errors

* Fix pylint errors

* Start refactor

* Debug agent lag

* Update lag logging

* Fix assert_that for lag

* Remove typo

* Add readme for extension_workflow scenario

* Reformat comment

* Improve logging

* Refactor assert scenario

* Remove unused constants

* Remove unusued parameter in assert scenario

* Add logging

* Improve logging

* Improve logging

* Fix soft assertions issue

* Remove todo for delete polling

* Remove unnecessary new line

* removed unnecessary function

* Make special chars log more readable

* remove unnecessary log

* Add version to add or update log

* Remove unnecessary assert instance view

* Add empty log line

* Add update back to restart args to debug

* Add update back to restart args to debug

* Remove unused init

* Remove test_suites from pipeline yml

* Update location in test suite yml

* Add comment for location restriction

* Remove unused init and fix comments

* Improve method header

* Rename scripts

* Remove print_function

* Rename is_data_in_waagent_log

* Add comments describing assert operation sequence script

* add comments to scripts and type annotate assert operation sequence

* Add GuestAgentDcrExtension source code to repo

* Fix typing.dict error

* Fix typing issue

* Remove outdated comment

* Add comments to extension_workflow.py

* rename scripts to match test suite name

* Ignore pylint warnings on test ext

* Update pylint rc to ignore tests_e2e/GuestAgentDcrTestExtension

* Update pylint rc to ignore tests_e2e/GuestAgentDcrTestExtension

* disable all errors/warnings dcr test ext

* disable all errors/warnings dcr test ext

* Run workflow on debian

* Revert to dcr config distros

* Move enable increment to beginning of function

* Fix gs completed regex

* Remove unnessary files from dcr test ext dir

* Update agent_ext_workflow.yml to skip China and Gov clouds (#2872)

* Update agent_ext_workflow.yml to skip China and Gov clouds

* Update tests_e2e/test_suites/agent_ext_workflow.yml

* fix daemon version (#2874)

* Wait for extension goal state processing before checking for lag in log (#2873)

* Update version to dummy 1.0.0.0'

* Revert version change

* Add sleep time to allow goal state processing to complete before lag check

* Add retry logic to gs processing lag check

* Clean up retry logic

* Add back empty line

* Fix timestamp parsing issue

* Fix timestamp parsing issue

* Fix timestamp parsing issue

* Do 3 retries{

* Extract tarball with xvf during setup (#2880)

In a pipeline run we saw the following error when extracting the tarball on the test node:

Adding v to extract the contents with verbose

* enable agent update in daily run (#2878)

* Create Network Security Group for test VMs (#2882)

* Create Network Security Group for test VMs

* error handling

---------

Co-authored-by: narrieta <narrieta>

* don't allow downgrades for self-update (#2881)

* don't allow downgrades for self-update

* address comments

* update comment

* add logger

* Supress telemetry failures from check agent log (#2887)

Co-authored-by: narrieta <narrieta>

* Install assertpy on test VMs (#2886)

* Install assertpy on test VMs

* set versions

---------

Co-authored-by: narrieta <narrieta>

* Add sample remote tests (#2888)

* Add sample remote tests

* add pass

* review feedback

---------

Co-authored-by: narrieta <narrieta>

* Enable Extensions.Enabled in tests (#2892)

* enable Extensions.Enabled

* address comment

* address comment

* use script

* improve msg

* improve msg

* Reorganize file structure of unit tests (#2894)

* Reorganize file structure of unit tests

* remove duplicate

* add init

* mocks

---------

Co-authored-by: narrieta <narrieta>

* Report useful message when extension processing is disabled (#2895)

* Update version to dummy 1.0.0.0'

* Revert version change

* Fail GS fast in case of extensions disabled

* Update extensions_disabled scenario to look for GS failed instead of timeout when extensions are disabled

* Update to separate onHold and extensions enabled

* Report ext disabled error in handler status

* Try using GoalStateUnknownFailure

* Fix indentation error

* Try failing ext handler and checking logs

* Report ext processing error

* Attempt to fail fast

* Fix param name

* Init error

* Try to reuse current code

* Try to reuse current code

* Clean code

* Update scenario tests

* Add ext status file to fail fast

* Fail fast test

* Report error when ext disabled

* Update timeout to 20 mins

* Re enable ext for debugging

* Re enable ext for debugging

* Log agent status update

* Create ext status file with error code

* Create ext status file with error code

* We should report handler status even if not installed in case of extensions disabled

* Clean up code change

* Update tests for extensions disabled

* Update test comment

* Update test

* Remove unused line

* Remove ununsed timeout

* Test failing case

* Remove old case

* Remove unused import

* Test multiconfig ext

* Add multi-config test case

* Clean up test

* Improve logging

* Fix dir for testfile

* Remove ignore error rules

* Remove ununsed imports

* Set handler status to not ready explicitly

* Use OS Util to get agent conf path

* Retry tar operations after 'Unexpected EOF in archive' during node setup (#2891)

* Update version to dummy 1.0.0.0'

* Revert version change

* Capture output of the copy commands during setup

* Add verbose to copy command

* Update typing for copy to node methods

* Print contents of tar before extracting

* Print contents of tar before extracting

* Print contents of tar before extracting

* Print contents of tar before extracting

* Retry copying tarball if contents on test node do not match

* Revert copy method def

* Revert copy method def

* Catch EOF error

* Retry tar operations if we see failure

* Revert target_path

* Remove accidental copy of exception

* Remove blank line

* tar cvf and copy commands overwrite

* Add log and telemetry event for extension disabled (#2897)

* Update version to dummy 1.0.0.0'

* Revert version change

* Add logs and telemetry for processing extensions when extensions disabled

* Reformat string

* Agent status scenario (#2875)

* Update version to dummy 1.0.0.0'

* Revert version change

* Create files for agent status scenario

* Add agent status test logic

* fix pylint error

* Add comment for retry

* Mark failures as exceptions

* Improve messages in logs

* Improve comments

* Update comments

* Check that agent status updates without processing additional goal states 3 times

* Remove unused agent status exception

* Update comment

* Clean up comments, logs, and imports

* Exception should inherit from baseexception

* Import datetime

* Import datetime

* Import timedelta

* instance view time is already formatted

* Increse status update time

* Increse status update time

* Increse status update time

* Increase timeout

* Update comments and timeoutS

* Allow retry if agent status timestamp isn't updated after 30s

* Remove unused import

* Update time value in comment

* address PR comments

* Check if properties are None

* Make types & errors more readable

* Re-use vm_agent variable

* Add comment for dot operator

* multi config scenario (#2898)

* Update version to dummy 1.0.0.0'

* Revert version change

* multi config scenario bare bones

* multi config scenario bare bones

* Stash

* Add multi config test

* Run on arm64

* RCv2 is not supported on arm64

* Test should own VM

* Add single config ext to test

* Add single config ext to test

* Do not fail test if there are unexpected extensions on the vm

* Update comment for accuracy

* Make resource name parameter optional

* Clean up code

* agent and ext cgroups scenario (#2866)

* agent-cgroups scenario

* address comments

* address comments

* fix-pylint

* pylint warn

* address comments

* improved logging"

* improved ext cgroups scenario

* new changes

* pylint fix

* updated

* address comments

* pylint warn

* address comment

* merge conflicts

* agent firewall scenario (#2879)

* agent firewall scenario

* address comments

* improved logging

* pylint warn

* address comments

* updated

* address comments

* pylint warning

* pylint warning

* address comment

* merge conflicts

* Add retry and improve the log messages in agent update test (#2890)

* add retry

* improve log messages

* merge conflicts

* Cleanup common directory (#2902)

Co-authored-by: narrieta <narrieta>

* improved logging (#2893)

* skip test in mooncake and usgov (#2904)

* extension telemetry pipeline scenario (#2901)

* Update version to dummy 1.0.0.0'

* Revert version change

* Barebones for etp

* Scenario should own VM because of conf change

* Add extension telemetry pipeline test

* Clean up code

* Improve log messages

* Fix pylint errors

* Improve logging

* Improve code comments

* VmAccess is not supported on flatcar

* Address PR comments

* Add support_distros in VmExtensionIdentifier

* Fix logic for support_distros in VmExtensionIdentifier

* Use run_remote_test for remote script

* Ignore logcollector fetch failure if it recovers (#2906)

* download_fail unit test should use agent version in common instead of 9.9.9.9 (#2908) (#2912)

(cherry picked from commit ed80388)

* Download certs on FT GS after check_certificates only when missing from disk (#2907) (#2913)

* Download certs on FT GS only when missing from disk

* Improve telemetry for inconsistent GS

* Fix string format

(cherry picked from commit c13f750)

* Update pipeline.yml to increase timeout to 90 minutes (#2910)

Runs have been timing out after 60 minutes due to multiple scenarios sharing VMs

* Fix agent memory usage check (#2903)

* fix memory usage check

* add test

* added comment

* fix test

* disable ga versioning changes (#2917)

* Disable ga versioning changes (#2909)

* disbale rsm changes

* add flag

(cherry picked from commit 5a4fae8)

* merge conflicts

* fix the ignore rule in agent update test (#2915) (#2918)

* ignore the agent installed version

* address comments

* address comments

* fixes

(cherry picked from commit 8985a42)

* Use Mariner 2 in FIPS test (#2916)

* Use Mariner 2 in FIPS test
---------

Co-authored-by: narrieta <narrieta>

* Change pipeline timeout to 90 minutes (#2925)

* fix version checking (#2920)

Co-authored-by: Norberto Arrieta <narrieta@users.noreply.github.com>

* mariner container image (#2926)

* mariner container image

* added packages repo

* addressed comments

* addressed comments

* Fix for "local variable _COLLECT_NOEXEC_ERRORS referenced before assignment" (#2935)

* Fix for "local variable _COLLECT_NOEXEC_ERRORS referenced before assignment"

* pylint

---------

Co-authored-by: narrieta <narrieta>

* fix agent manifest call frequency (#2923) (#2932)

* fix agent manifest call frequency

* new approach

(cherry picked from commit 6554032)

* enable rhel/centos cgroups (#2922)

* Add support for EC certificates (#2936)

* Add support for EC certificates

* pylint

* pylint

* typo

---------

Co-authored-by: narrieta <narrieta>

* Add Cpu Arch in local logs and telemetry events (#2938)

* Add cpu arch to telem and local logs

* Change get_vm_arch to static method

* update unit tests

* Remove e2e pipeline file

* Remove arch from heartbeat

* Move get_vm_arch to osutil

* fix syntax issue

* Fix unit test

* skip cgorup monitor (#2939)

* Clarify support status of installing from source. (#2941)

Co-authored-by: narrieta <narrieta>

* agent cpu quota scenario (#2937)

* agent_cpu_quota scenario

* addressed comments

* addressed comments

* skip test version install (#2950)

* skip test install

* address comments

* pylint

* local run stuff

* undo

* Add support for VM Scale Sets to end-to-end tests (#2954)

---------

Co-authored-by: narrieta <narrieta>

* Ignore dependencies when the extension does not have any settings (#2957) (#2962)

* Ignore dependencies when the extension does not have any settings

* Remove message

---------

Co-authored-by: narrieta <narrieta>
(cherry picked from commit 79bc12c)

* Cache daemon version (#2942) (#2963)

* cache daemon version

* address comments

* test update

(cherry picked from commit 279d557)

* update warning message (#2946) (#2964)

(cherry picked from commit 33552ee)

* fix self-update frequency to spread over 24 hrs for regular type and 4 hrs for hotfix  (#2948) (#2965)

* update self-update frequency

* address comment

* mark with comment

* addressed comment

(cherry picked from commit f15e6ef)

* Reduce the firewall check period in agent firewall tests (#2966)

* reduce firewall check period

* reduce firewall check period

* undo get daemon version change (#2951) (#2967)

* undo daemon change

* pylint

(cherry picked from commit fabe7e5)

* disable agent update (#2953) (#2968)

(cherry picked from commit 9b15b04)

* Change agent_cgroups to own Vm (#2972)

* Change cgroups to own Vm

* Agent cgroups should own vm

* Check SSH connectivity during end-to-end tests (#2970)

Co-authored-by: narrieta <narrieta>

* Gathering Guest ProxyAgent Log Files (#2975)

* Remove debug info from waagent.status.json (#2971)

* Remove debug info from waagent.status.json

* pylint warnings

* pylint

---------

Co-authored-by: narrieta <narrieta>

* Extension sequencing scenario (#2969)

* update tests

* cleanup

* .

* .

* .

* .

* .

* .

* .

* .

* .

* Add new test cases

* Update scenario to support new tests

* Scenario should support failing extensions and extensions with no settings

* Clean up test

* Remove locations from test suite yml

* Fix deployment issue

* Support creating multiple resource groups for vmss in one run

* AzureMonitorLinuxAgent is not supported on flatcar

* AzureMonitor is not supported on flatcar

* remove agent update

* Address PR comments

* Fix issue with getting random ssh client

* Address PR Comments

* Address PR Comments

* Address PR comments

* Do not keep rg count in runbook

* Use try/finally with lock

* only check logs after scenario startS

* Change to instance member

---------

Co-authored-by: narrieta <narrieta>

* rename log file for agent publish scenario (#2956)

* rename log file

* add param

* address comment

* Fix name collisions on resource groups created by AgentTestSuite (#2981)

Co-authored-by: narrieta <narrieta>

* Save goal state history explicitly (#2977)

* Save goal state explicitly

* typo

* remove default value in internal method

---------

Co-authored-by: narrieta <narrieta>

* Handle errors when adding logs to the archive (#2982)

Co-authored-by: narrieta <narrieta>

* Timing issue while checking cpu quota (#2976)

* timing issue

* fix pylint"

* undo

* Use case-insentive match when cleaning up test resource groups (#2986)

Co-authored-by: narrieta <narrieta>

* Update supported Ubuntu versions (#2980)

* Fix pylint warning (#2988)

Co-authored-by: narrieta <narrieta>

* Add information about HTTP proxies (#2985)

* Add information about HTTP proxies

* no_proxy

---------

Co-authored-by: narrieta <narrieta>

* agent persist firewall scenario (#2983)

* agent persist firewall scenario

* address comments

* new comments

* GA versioning refactor plus fetch new rsm properties. (#2974)

* GA versioning refactor

* added comment

* added abstract decorator

* undo abstract change

* update names

* addressed comments

* pylint

* agent family

* state name

* address comments

* conf change

* Run remote date command to get test case start time (#2993)

* Run remote date command to get test case start time

* Remove unused import

* ext_sequencing scenario: get enable time from extension status files (#2992)

* Get enable time from extension status files

* Check for empty array

* add status example in comments

* ssh connection retry on restarts (#3001)

* Add e2e test scenario for hostname monitoring (#3003)

* Validate hostname is published

* Run on distro without known issues

* Add comment about debugging network down

* Create e2e scenario for hostname monitoring

* Remove unused import

* Increase timeout for hostname change

* Add password to VM and check for agent status if ssh fails

* run scenario on all endorsed distros

* Use getdistro() to check distro

* Add comment to get_distro

* Add publish_hostname to runbook

* Make get_distro.py executable

* Address first round of PR comments

* Do not enable hostname monitoring on distros where it is disabled

* Skip test on ubuntu

* Update get-waagent-conf-value to remove unused variable

* AMA is not supported on cbl-mariner 1.0 (#3002)

* Cbl-mariner 1.0 is not supported by AMA

* Use get distro to check distro

* Add comment to get_distro

* log update time for self updater (#3004)

* add update time log

* log new agent update time

* fix tests

* Fix publish hostname in china and gov clouds (#3005)

* Fix regex to parse china/gov domain names

* Improve regex

* Improve regex

* Self update e2e test (#3000)

* self-update test

* addressed comments

* fix tests

* log

* added comment

* merge conflicts

* Lisa should not cleanup failed environment if keep_environment=failed (#3006)

* Throw exception for test suite if a test failure occurs

* Remove unused import

* Clean up

* Add comment

* fix(ubuntu): Point to correct dhcp lease files (#2979)

From Ubuntu 18.04, the default dhcp client was systemd-networkd.
However, WALA has been checking for the dhclient lease files.
This PR seeks to correct this bug.Interestingly, it was already
configuring systemd-networkd but checking for dhclient lease files.

Co-authored-by: Norberto Arrieta <narrieta@users.noreply.github.com>

* Use self-hosted pool for automation runs (#3007)

Co-authored-by: narrieta <narrieta>

* Add distros which use Python 2.6 (for reference only) (#3009)

Co-authored-by: narrieta <narrieta>

* Move cleanup pipeline to self-hosted pool (#3010)

Co-authored-by: narrieta <narrieta>

* NM should not be restarted during hostname publish if NM_CONTROLLED=y (#3008)

* Only restart NM if NM_controlled=n

* Clean up code

* Clean up code

* improve logging

* Make check on NM_CONTROLLED value sctrict

* Install missing dependency (jq) on Azure Pipeline Agents (#3013)

* Install missing dependency (jq) on Azure Pipeline Agents

* use if statement

* remove if statement

---------

Co-authored-by: narrieta <narrieta>

* Do not reset the mode of a extension's log directory (#3014)

Co-authored-by: narrieta <narrieta>

* Daemon should remove stale published_hostname file and log useful warning (#3016)

* Daemon should remove published_hostname file and log useful warning

* Clean up fast track file if vm id has changed

* Clean up initial_goal_state file if vm id has changed

* Clean up rsm_update file if vm id has changed

* Do not report TestFailedException in test results (#3019)

Co-authored-by: narrieta <narrieta>

* skip agent update run on arm64 distros (#3018)

* Clean test VMs older than 12 hours (#3021)

Co-authored-by: narrieta <narrieta>

* honor rsm update with no time when agent receives new GS (#3015)

* honor rsm update immediately

* pylint

* improve msg

* address comments

* address comments

* address comments

* added verbose logging

* Don't check Agent log from the top after each test suite (#3022)

* Don't check Agent log from the top after each test suite

* fix initialization of override

---------

Co-authored-by: narrieta <narrieta>

* update the proxy agenet log folder for logcollector (#3028)

* Log instance view before asserting (#3029)

* Add config parameter to wait for cloud-init (Extensions.WaitForCloudInit) (#3031)

* Add config parameter to wait for cloud-init (Extensions.WaitForCloudInit)

---------

Co-authored-by: narrieta <narrieta>

* Revert changes to publish_hostname in RedhatOSModernUtil (#3032)

* Revert changes to publish_hostname in RedhatOSModernUtil

* Fix pylint bad-super-call

* Remove agent_wait_for_cloud_init from automated runs (#3034)

Co-authored-by: narrieta <narrieta>

* Adding AutoUpdate.UpdateToLatestVersion new flag support (#3020)

* support new flag

* address comments

* added more info

* updated

* address comments

* resolving comment

* updated

* Retry get instance view if only name property is present (#3036)

* Retry get instance view if incomplete during assertions

* Retry getting instance view if only name property is present

* Fix regex in agent extension workflow (#3035)

* Recover primary nic if down after publishing hostname in RedhatOSUtil (#3024)

* Check nic state and recover if down:

* Fix typo

* Fix state comparison

* Fix pylint errors

* Fix string comparison

* Report publish hostname failure in calling thread

* Add todo to check nic state for all distros where we reset network

* Update detection to check connection state and separate recover from publish

* Pylint unused argument

* refactor recover_nic argument

* Network interface e2e test

* e2e test for recovering the network interface on redhat distros

* Only run scenario on distros which use RedhatOSUtil

* Fix call to parent publish_hostname to include recover_nic arg

* Update comments in default os util

* Remove comment

* Fix comment

* Do not do detection/recover on RedhatOSMOdernUtil

* Resolve PR comments

* Make script executable

* Revert pypy change

* Fix publish hostname paramters

* Add recover_network_interface scenario to runbook (#3037)

* Implementation of new conf flag AutoUpdate.UpdateToLatestVersion support (#3027)

* GA update to latest version flag

* address comments

* resloving comments

* added TODO

* ignore warning

* resolving comment

* address comments

* config present check

* added a comment

* Fix daily pipeline failures for recover_network_interface (#3039)

* Fix daily pipeline failures for recover_network_interface

* Clear any unused settings properties when enabling cse

---------

Co-authored-by: Norberto Arrieta <narrieta@users.noreply.github.com>

* Keep failed VMs by default on pipeline runs (#3040)

* enable RSM e2e tests (#3030)

* enable RSM tests

* merge conflicts

* Check for 'Access denied' errors when testing SSH connectivity (#3042)

Co-authored-by: narrieta <narrieta>

* Add Ubuntu 24 to end-to-end tests (#3041)

* Add Ubuntu 24 to end-to-end tests

* disable AzureMonitorLinuxAgent

---------

Co-authored-by: narrieta <narrieta>

* Skip capture of VM information on test runs (#3043)

Co-authored-by: narrieta <narrieta>

* Create symlink for waagent.com on Flatcar (#3045)

Co-authored-by: narrieta <narrieta>

* don't allow agent update if attempts reached max limit (#3033)

* set max update attempts

* download refactor

* pylint

* disable RSM updates (#3044)

* Skip test on alma and rocky until we investigate (#3047)

* Ext_sequencing scenario should check agent log for extension enable order (#3049)

* Ext_sequencing scenario should check agent log for extension enable order

* Format timestamp for ignore errors before timestamp

* If test is skipped, scenario start will be datetime min

* Remove unnecessary log

* Make none check explicit

* update canary region (#3056)

* Add Python 3.10 to the pylint matrix (#3057)

Co-authored-by: narrieta <narrieta>

* reset network service unit file if python version changes (#3058)

* Ignore network unreachable errors in publish hostname (#3060)

* Address pylint warning deprecated-method (#3059)

Co-authored-by: narrieta <narrieta>

* fix agent update UT (#3051) (#3054)

(cherry picked from commit d9f7ed4)

* modify agent update flag (#3053) (#3055)

(cherry picked from commit 049de5c)

* skip run on flatcar (#3061)

* retry on agent cgroups tracking check (#3062)

* retry on agentcgroups check

* address comments

* Recognize SLE-Micro as a SLE based distribution (#3048)

Using the agent with SLE-Micro forces the agent to fallback to the common
default implementation for nominally distribution specific behavior. This
misses the SUSE specific implementations.

Co-authored-by: Nageswara Nandigam <84482346+nagworld9@users.noreply.github.com>

* Retry ssh check if connection reset (#3065)

* Add distutils/version.py to azurelinuxagent (#3063)

* Add distutils/version.py to azurelinuxagent

---------

Co-authored-by: narrieta <narrieta>

* Run pylint on Python 3.11 (#3067)

* Run pylint on Python 3.11

---------

Co-authored-by: narrieta <narrieta>

* Fix pylint warnings (#3069)

* Fix pylint warnings

* Update .github/workflows/ci_pr.yml

Co-authored-by: maddieford <93676569+maddieford@users.noreply.github.com>

---------

Co-authored-by: narrieta <narrieta>
Co-authored-by: maddieford <93676569+maddieford@users.noreply.github.com>

* reset uphold setting for agent service in flatcar distro (#3066)

* reset uphold settings for flatcar images

* updated comment

* stop the rebbot service

* address comments

* retry on quota reset check (#3068)

* Use legacycrypt instead of crypt on Python >= 3.13 (#3070)

* Use legacycrypt instead of crypt on Python >= 3.13

* remove ModuleNotFound

---------

Co-authored-by: narrieta <narrieta>

* Skip network unreachable error in publish hostname test (#3071)

Co-authored-by: narrieta <narrieta>

* Fix osutil/default route_add to pass string array. (#3072)

Co-authored-by: narrieta <narrieta>

* Fix argument to GoalState.__init__ (#3073)

Co-authored-by: narrieta <narrieta>

* Ignore network unreachable error in hostname test (#3074)

* Ignore network unreachable error in hostname test

---------

Co-authored-by: narrieta <narrieta>

* Add lock around access to fast_track.json (#3076)

Co-authored-by: narrieta <narrieta>

* added retries for agent cgroups test (#3075)

* retries for agent cgroups test

* pylint warn

* addressed comment

* cron job script (#3077)

* Fix mock for cgroup unit test (#3079)

* Fix mock for cgroup unit test

---------

Co-authored-by: narrieta <narrieta>

* Add DistroVersion class to compare distro versions (#3078)

* Add DistroVersion class to compare distro versions

* comment

* python 2

---------

Co-authored-by: narrieta <narrieta>

* enable GA versioning (#3082)

* Run unit tests with pytest on Python >= 3.10 (#3081)

* Run unit tests with pytest on Python >= 3.10
---------

Co-authored-by: narrieta <narrieta>

* Fix pytest warnings (#3084)

Co-authored-by: narrieta <narrieta>

* update setup (#3088)

* Add keyvault test to daily run + Specify tests suite as a list (#3089)

Co-authored-by: narrieta <narrieta>

* ignore case (#3093)

* Add retry on keyvault test (#3095)

* Add retry on keyvault test

* newline

---------

Co-authored-by: narrieta <narrieta>

* Reboot Vm if CSE timesout so logs are collected (#3097)

* LogCollector should skip and log warning for files that don't exist (#3098)

* Skip collection on files that do not exist

* Fix pylint

* Separate error handling

* log file to collect

* wait for provision to complete before install test agent (#3094)

* wait for provision to complete

* address comments

* agent publish refactor (#3091)

* agent publish refactor

* support arm 64vm

* convert dict to str

* address comments

* pylint

* new comments

* updated comment

* Add EnableFirewall to README (#3100)

* Add EnableFirewall to README

* change phrasing

---------

Co-authored-by: narrieta <narrieta>

* Add Ubuntu minimal to test run (#3102)

* Add ubuntu minimal to test run

* typo

* suppress warnings

---------

Co-authored-by: narrieta <narrieta>

* check for unexpected process in agent cgroups before cgroups enabled (#3103)

* check for unexpected process in cgroup before enable

* agent restart

* move the process check

* fix unit tests

* address comments

* pylint

* Cgroups api refactor for v2 (#3096)

* Cgroups api refactor (#6)

* Initial changes for log collector cgroups v2 support

* Fix pylint issues

* Fix pylint issues

* Fix pylint issues

* Check that both controllers are mounted in the chosen cgroups version for log collector

* Fix regex

* Update test_agent unit tests

* Fix unit tests

* Update format strings

* Fix broken cgroupconfigurator unit tests

* pyling

* Fix cgroups api unit tests

* Ignore unused args

* Ignore unused args

* Add cgroup configurator tests

* v2 required check in parent cgroup

* unit tests is_controller_enabled

* Fix test failure and pylint:

* pylint

* Update agent checks

* Fix controller enable logic and unit tests

* Remove changes to collect logs

* Fix pylint

* Add e2e test for v2

* Fix log warnings

* Add cgroups v2 disabled scenario to daily runbook

* Address PR comments (#7)

* get_cgroup_api should raise exception when controllers not mounted

* Combine cgroups_supported() check

* Combine SystemdCgroupsApi and CGroupApi classes

* fix pylint and tests with sudo

* Rename SystemdCgroupsApi to SystemdCgroupApi

* Cgroup should be singular when referring to the APi

* Unimpleneted methods should raise NotImplementederror

* Check for cpu,cpuacct

* v2 start extension command should not be implemented

* log_cgorup_info and log_cgroup_warning should be in cgroupapi

* Systemd check should come before api

* Explicitly check for empty dict

* Only check if controllers are enabled at root for v2

* Remove unnecessary mocked paths in mock cgroup env

* V2 does not have concept of mounting controllers

* Fix super call for python 2

* get_cgroup_api should be function

* Move logging functions up

* Use stat -f to get cgroup mode

* Mock hybrid path

* Fix unit tests:

* Debug tests

* Debug tests

* Debug unit tests

* Fix unit tests

* Fix pylint

* Fix e2e test for v2

* Fix e2e test

* Fix e2e test

* Fix e2e test

* Combine common implementations

* Improve comments

* Pylint

* Address PR comments (#8)

* Run unit tests

* Clean up drop in files if cgroups are disabled

* Init values for cgroup apis

* Rever test change

* get_cgroup_api should check if mountpoints are correct (#9)

* Fix conflict after merge

* Merge issues (#10)

* Fix merge issues

* Fix unit tests

* get_cgroup_api raises InvalidCroupMountpointException (#11)

* get_cgroup_api can raise InvalidCgroupMountpointException

* Add unit test for agent

* Address PR comments (#12)

* Increase timeout for agent to start and provisioning to complete (#3105)

* Keep whole goal state in log (#3104)

* Log cgroup if process found in unexpected slice (#3107)

* Allow retries for ifdown and add comments (#3106)

* Collect telemetry for firewall settings changed (#3110) (#3112)

Co-authored-by: narrieta <narrieta>
(cherry picked from commit 468cf81)

* Update agent_publish test to check for new agent update pattern (#3114) (#3119)

* Add new agent update pattern

* Use record message

* Need to update log record timestamp

(cherry picked from commit 1d91c14)

* remove secret and use cert for aad app in e2e pipeline (#3116)

* remove secret and use cert

* address comment

* wait for rg creation in e2e tests (#3117)

* wait for rg creation

* update param

* check for rg existance

* input rg name

* Reduce the frequency of firewall telemetry (#3124) (#3127)

* Reduce the frequency of firewall telemetry

* python 2: timespan.total_seconds() does not exist

* fix unit test

---------

Co-authored-by: narrieta <narrieta>
(cherry picked from commit 5302651)

* suppress pylint warn contextmanager-generator-missing-cleanup (#3138)

* suppress pylint warn

* addressed comments

* Switching to SNI based authentication for aad app (#3137)

* SNI auth

* new env var

* pylint

* new namespace (#3139)

* support dict/list resources type for lisa template (#3140)

* support dict/list for resources schema

* addressed comment

* eFix multi config (#16) (#3145)

* Use runcommand api for runcommand multiconfig operations

* remove rc

* Fix comments

* Remove comment

* Fix rc

* pylint

* Add line

* refactor cgroup controllers (#3135)

* refactor cgroup controllers (#13)

* Refactor Cgroup, CpuCgroup, MemoryCgroup to ControllerMetrics, CpuMetrics, MemoryMetrics

* Create methods to get unit/process cgroup representation

* Refactoring changes

* Refactoring changes

* Fix e2e test

* Fix unintentional comment change

* Remove unneeded comments

* Clean up comments and make code more readable

* Simplify get controller metrics

* Clean up cgroupapi

* Cleanup cgroup -> controllermetrics changes

* Clean up cgroup configurator

* Fix unit tests for agent.py

* Fix cgroupapi tests

* Fix cgroupconfigurator and tests

* Rename controller metrics tests

* Ignore pylint issues

* Improve test coverage for cgroupapi

* Rename cgroup to metrics

* Update cgroup.procs to accurately represent file

* Do not track metrics if controller is not mounted

* We should set cpu quota before tracking cpu metrics

* Pylint

* address pr comments (#14)

* Address Nag's comments

* pyling

* pylint

* remove lambda (#15)

* updated PR template (#3144)

* fixing custom image test run (#3147)

* Avoiding mocked exception from being lost on test (#3149)

If another exception arises (that's the case here when python 3.12 is used due to the changes in os.shutil.rmtree), the mocked exception is lost because it is incomplete (neither errno nor strerror are set: it goes to args).

* Add more useful logging for agent unit properties (#3154)

* Remove wireserver fallback for imds calls (#3152)

* Remove wireserver fallback for imds calls

* remove unused method

* remove obsolete unit test

* remove unused import

---------

Co-authored-by: narrieta@microsoft <narrieta>

* Remove unused import (#3155)

Co-authored-by: narrieta@microsoft <narrieta>

* Expand support for backend ethernet (#3150)

IBManager will continue to be used for a new ethernet-backend offering from AzureHPC. While the key name remains the same (IPoIB_data), the interfaces will be of the format ethXX. Removing the check that skips anything that isn't ibXX. We are not at the risk of proceeding for any other nics since the IPoIB_data will only have the backend RDMA ones, and despite reading from the system for the loop, we match it against the array parsed from the IPoIB_data KVP. IB interfaces have padded virtual macs, non-IB interfaces won't. Adding if-else to only do the padded-octet check for IB. Everything else will use the standard 6-octet pattern.

* Allow use of node 16 (#3160)

Co-authored-by: narrieta@microsoft <narrieta>

* Fix Ubuntu version codename for 24.04 (#3159)

24.04 is noble, not focal

Co-authored-by: Norberto Arrieta <narrieta@users.noreply.github.com>

* Fix regex pattern for ext seq scenario (#3162)

* Update test certificate data (#3166)

Co-authored-by: narrieta@microsoft <narrieta>

* Remove extension status only on extension delete (#3167)

* Remove extension status only on extension delete

* .

* .

---------

Co-authored-by: narrieta@microsoft <narrieta>

* Add support for Azure Linux 3 (#3183)

* .

* Add Azure Linux 3 to test runs

* .

* .

* .

* .

* Update setup.py

---------

Co-authored-by: narrieta@microsoft <narrieta>

* Use self-update for initial update (#3184)

* use self-update for initial update

* addressing comments

* cleanup files

* state files

* remove comment

* send updatemode in heartbeat and don't send RSM supported feature flag if versioning disabled in agent (#3189)

* rsm changes

(cherry picked from commit d73cef5)

* addressed comment

(cherry picked from commit 1ab9122)

* updated comment

* addressed comments

* added semicolon

* Disable multi-config test in AzureCloud (#3192)

Co-authored-by: narrieta@microsoft <narrieta>

* Add cgroupv2 support for log collector (#3188)

* Lc v2 implementation branch (#18)

* memory experimentation changes

* Initial changes

* obvious issues

* Fix e2e test

* First round of unit test fixes

* Fix existing unit tests

* Remove unneeded cpu files

* Get memory usage should return tuple

* Fix log for tracking cgroup

* Add unit tests

* Add unit tests

* Address pylint comments

* Clean up code

* clean up code

* Fix unit tests (#19)

* Fix unit tests

* Fix unit tests

* Revisions (#20)

* Respond to comments

* Test failures

* Fix type issue

* Revisions

* Additional revisions (#21)

* Revisions

* Remove unit test for sending telem

* final fixes

* add config flag

* Fix e2e tests

* workaround for python3.5 UTs build setup and replace assert_called_once mock method (#3191)

* python3.5 workaround

* replace assert_called_once

* addressing comment

* Fix log collector unit tests on 3.5 (#3193)

* Fix unit tests 3.5

* Fix ut

* Fix JIT for FIPS 140-3 (#3190)

* .

* .

* .,

---------

Co-authored-by: narrieta@microsoft <narrieta>

* Capture logcollector pattern only once (#3194)

* Capture logcollector pattern only once

* Add comment

* Check agent Slice unit property before setting up azure.slice (#3196) (#3198)

(cherry picked from commit bdd4a4b)

* version update to 2.12.0.0 (#3195)

* fixing attribute error (#3202)

* version update to 2.12.0.1" (#3203)

* supress too-many-positional-args pylint warn (#3224) (#3225)

(cherry picked from commit 4dcf95c)

* move setupslice after cgroupsv2 check, remove unit file for log collector and remove fiirewall daemon-reload (#3223) (#3226)

* move daemon reload

* test fix

* UT test

* firewall daemon-reload

* address comments

* address comments

(cherry picked from commit 47e969a)

* Ubuntu 24 image (#25) (#3229) (#3230)

* Update ubuntu 24

* Add ubuntu 24 to nat clouds

* Add arm64 ubuntu 24

* Update all ubuntu images

* Skip arm64 in nat clouds

* Fix syntax issues

(cherry picked from commit 31adf25)

* Add controller/cgroup path telemetry (#3231)

* version update to 2.12.0.2 (#3233)

---------

Co-authored-by: Norberto Arrieta <narrieta@users.noreply.github.com>
Co-authored-by: maddieford <93676569+maddieford@users.noreply.github.com>
Co-authored-by: Long Li <longli@microsoft.com>
Co-authored-by: sebastienb-stormshield <sebastien.bini@stormshield.eu>
Co-authored-by: Zheyu Shen <arsdragonfly@gmail.com>
Co-authored-by: Zhidong Peng <zpeng@microsoft.com>
Co-authored-by: d1r3ct0r <mwadimemakokha@gmail.com>
Co-authored-by: Robert Schweikert <rjschwei@suse.com>
Co-authored-by: Miriam España Acebal <miriam.espana@canonical.com>
Co-authored-by: Anam Ahmad <Anam9@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants