Cloudwatchagent resulting in high CPU when using wildcard #807

Errosion · 2023-08-08T20:55:06Z

Describe the bug
On a Windows Server 2016 Datacenter Bastion EC2 host, we have Cloudwatch agent and Sophos installed. Upon implementing some additional log monitoring for the Sophos FIM export logs, we saw a huge spike the CWagent CPU usage. From around <5% up to a consistent 15-20% CPU time.

The CWAgent config has the below defined:

"metrics_collection_interval": 300,

, {
	"file_path": "C:\\ProgramData\\Sophos\\File Integrity Monitoring\\Export\\*.xml",
	"log_group_name": "/Sophos/FIM",
	"log_stream_name": "{instance_id}"
},

With that configured, the cloudwatch agent sits consistently at 15-20% CPU time on a t3.small instance and will go up higher when in use.
With the above set of logs removed from the config, the agent sits at <5% CPU time.

At any given time there are 670 or less files within that directory, no other subdirectories. The total size of all of the files is less than 2MB.

Steps to reproduce
Add the above config line into the config, restart the cloudwatchagent service. CPU will spike and continue to run at 15-20% when system is idle and upwards of 40-50% when other things are running.

What did you expect to see?
A much smaller increase to the overall CPU % being used by the agent.

What did you see instead?
See steps to reproduce

What version did you use?
Issue started being seen on CWAgent 1.247359.1.
Has carried over to 1.247360.0b252689.

What config did you use?
{ "agent": { "metrics_collection_interval": 300, "logfile": "c:\\ProgramData\\Amazon\\AmazonCloudWatchAgent\\Logs\\amazon-cloudwatch-agent.log" }, "metrics": { "metrics_collected": { "Memory": { "measurement": ["% Committed Bytes In Use"] }, "LogicalDisk": { "measurement": ["% Free Space"], "resources": ["*"] }, "System": { "measurement": ["*"] } }, "append_dimensions": { "ImageId": "{aws:ImageId}", "InstanceId": "{aws:InstanceId}", "InstanceType": "{aws:InstanceType}" } }, "logs": { "logs_collected": { "windows_events": { "collect_list": [{ "event_format": "xml", "event_levels": ["VERBOSE", "INFORMATION", "WARNING", "ERROR", "CRITICAL"], "event_name": "Security", "log_group_name": "Windows-Security", "log_stream_name": "{instance_id}" }, { "event_format": "xml", "event_levels": ["VERBOSE", "INFORMATION", "WARNING", "ERROR", "CRITICAL"], "event_name": "Microsoft-Windows-Windows Firewall With Advanced Security/Firewall", "log_group_name": "Windows-Firewall", "log_stream_name": "{instance_id}" }] }, "files": { "collect_list": [{ "file_path": "C:\\ProgramData\\Sophos\\Sophos Anti-Virus\\logs\\SAV.txt", "log_group_name": "/Sophos/SAV", "log_stream_name": "{instance_id}" }, { "file_path": "C:\\ProgramData\\Sophos\\Sophos Anti-Virus\\logs\\SAV-Trace.txt", "log_group_name": "/Sophos/SAV-Trace", "log_stream_name": "{instance_id}" }, { "file_path": "C:\\ProgramData\\Sophos\\Sophos Anti-Virus\\logs\\Sophos Cloud Scheduled Scan.txt", "log_group_name": "/Sophos/Sophos-Cloud-Scheduled-Scan", "log_stream_name": "{instance_id}" }, { "file_path": "C:\\ProgramData\\Sophos\\File Integrity Monitoring\\Export\\*.xml", "log_group_name": "/Sophos/FIM", "log_stream_name": "{instance_id}" }, { "file_path": "C:\\ProgramData\\Sophos\\Sophos Anti-Virus\\logs\\oaScannerWatchdog.txt", "log_group_name": "/Sophos/oaScannerWatchdog", "log_stream_name": "{instance_id}" }] } } } }

Environment
MS Windows Server 2016 Datacenter
Sophos Endpoint management
(File Integrity Monitoring 1.0.3.449)

Additional context
While I understand that there are some limitations to wildcard searches and these should be as specific as possible, the "*.xml" is the closest we can get. The filename format is:
YYYY-MM-DD_HH-MM_????_??_??_databatch.xml
Actual filenames:

2023-08-08_10-56_5021_2_2_databatch.xml
2023-07-14_13-15_4801_68_79_databatch.xml

I am not entirely certain as to what the "?" fields are but they are all digits.
Since the first five fields change at inconsistent patterns, I cannot really be more specific in the wildcard searches.

Sophos handles the rotation of these files and there is a 90 day retention policy.

The text was updated successfully, but these errors were encountered:

SaxyPandaBear · 2023-08-28T17:33:14Z

What happens if you run the agent in debug mode? You are using an agent version that is new enough to include the number of monitored log files in debug logs #506. There is a measurable overhead for the CloudWatch agent to tail a log file - from my own testing, this seems to be more pronounced on Windows hosts than Linux hosts, but I'm not entirely sure why. CPU usage seems to linearly scale based on the number of log files that the agent is monitoring concurrently

Errosion · 2023-08-28T19:36:38Z

I appreciate the response!

I enabled debug and the first thing I noticed was that CWagent was having issues accessing a few files defined in the config:

2023-08-28T18:58:55Z D! Stat file C:\ProgramData\Sophos\Sophos Anti-Virus\logs\SAV.txt failed due to CreateFile C:\ProgramData\Sophos\Sophos Anti-Virus\logs\SAV.txt: The system cannot find the file specified.
2023-08-28T18:58:55Z D! Stat file C:\ProgramData\Sophos\Sophos Anti-Virus\logs\SAV-Trace.txt failed due to CreateFile C:\ProgramData\Sophos\Sophos Anti-Virus\logs\SAV-Trace.txt: The system cannot find the file specified.
2023-08-28T18:58:55Z D! Stat file C:\ProgramData\Sophos\Sophos Anti-Virus\logs\oaScannerWatchdog.txt failed due to CreateFile C:\ProgramData\Sophos\Sophos Anti-Virus\logs\oaScannerWatchdog.txt: The system cannot find the file specified.

I confirmed that those files were not available, so I removed those from the config and restarted the agent again. Both to clear out the clutter and see if that made any impact on performance. Upon removing those entries, the same high CPU behavior occurred and the errors have since gone away. I suspected this would be the outcome anyways since those three files were in the config without the wildcard already and CWagent was running at low CPU.

With those files removed, this is what I am seeing on startup:
2023-08-28T19:01:06Z I! [logagent] piping log from /Sophos/FIM/<instanceID>(C:\ProgramData\Sophos\File Integrity Monitoring\Export\2023-08-28_15-19_5144_1_1_databatch.xml) to cloudwatchlogs with retention -1
Then these messages every second:
2023-08-28T19:01:48Z D! [logagent] open file count, 2 2023-08-28T19:01:49Z D! [logagent] open file count, 2 2023-08-28T19:01:50Z D! [logagent] open file count, 2

At one point while just letting things run a bit, I do see this. The open file count jumps to 3 after reading a file from the wildcard search:

2023-08-28T19:04:14Z D! [logagent] open file count, 2
2023-08-28T19:04:15Z D! [logagent] open file count, 2
2023-08-28T19:04:16Z D! [logagent] open file count, 2
2023-08-28T19:04:16Z D! [inputs.logfile] The state file C:\ProgramData\Amazon\AmazonCloudWatchAgent\Logs\state\C__ProgramData_Sophos_File_Integrity_Monitoring_Export_2023-08-28_19-04_5145_2_2_databatch.xml for C:\ProgramData\Sophos\File Integrity Monitoring\Export\2023-08-28_19-04_5145_2_2_databatch.xml does not exist: CreateFile C:\ProgramData\Amazon\AmazonCloudWatchAgent\Logs\state\C__ProgramData_Sophos_File_Integrity_Monitoring_Export_2023-08-28_19-04_5145_2_2_databatch.xml: The system cannot find the file specified.
2023-08-28T19:04:16Z I! [logagent] piping log from /Sophos/FIM/<instanceID>(C:\ProgramData\Sophos\File Integrity Monitoring\Export\2023-08-28_19-04_5145_2_2_databatch.xml) to cloudwatchlogs with retention -1
2023-08-28T19:04:17Z D! [logagent] open file count, 3
2023-08-28T19:04:18Z D! [logagent] open file count, 3
2023-08-28T19:04:19Z D! [logagent] open file count, 3
2023-08-28T19:04:20Z D! [logagent] open file count, 3
2023-08-28T19:04:21Z D! [logagent] open file count, 3
2023-08-28T19:04:21Z D! [outputs.cloudwatchlogs] Pusher published 3 log events to group: /Sophos/FIM stream: <instanceID> with size 1 KB in 269.9729ms.
2023-08-28T19:04:22Z D! [logagent] open file count, 3
2023-08-28T19:04:23Z D! [logagent] open file count, 3

Before making any modifications, CWagent was running at less than 5% CPU.
With the above changes to add the wildcard back in, CWagent is at 25-20% CPU.

The other interesting message that I see is this one. Seems like just a size indicator of what is being pushed.
[cloudwatchlogs_/Sophos/FIM_rawSize: 1333.000000]

Since there was the error attempting to write to the "C:\ProgramData\Amazon\AmazonCloudWatchAgent\Logs\state\C__ProgramData_Sophos_File_Integrity_Monitoring_Export_2023-08-28_19-04_5145_2_2_databatch.xml" file, I tried purging the entire "C:\ProgramData\Amazon\AmazonCloudWatchAgent\Logs\state" directory and all files within it to see if that made any difference and that did not. After restarting CWagent after purging the "state" directory, I see a bunch of messages in the debug output indicating log events were getting pushed again. I am guessing the state directory is a method for tracking what has been sent and what has not.

It was a bit of a shot thinking that maybe something had gotten queue'd up or something quirky in that directory. But no change in behavior. CWagent still running at 25-30% CPU.

I do see that after purging the state directory and the restart, the open file count has dropped back down to 2.

I also see that there are four files in the "C:\ProgramData\Amazon\AmazonCloudWatchAgent\Logs\state" directory at this point which correspond to what is being monitored.

The only other thing to note that I see in here is this message which typically says "no stats is available":
2023-08-28T19:33:48Z D! Profiler dump: [no stats is available...]

I do notice that the CWAgent has since upgraded as well and is now:
CWAgent/1.300026.2b172 (go1.20.6; windows; amd64)

SaxyPandaBear · 2023-08-28T20:11:17Z

Yes, the /state/ folder is used internally by the agent to track file byte offsets for log files it is monitoring. I don't think that errors writing there should impact CPU enough to be measurable so I'd ignore those for now.

There shouldn't be a performance drop across versions - if there is measurable performance degradation please call it out.

The profiler dump is benign. The agent keeps a "profiler" map in memory for internal runtime debugging stats, similar to the raw size stat you saw [cloudwatchlogs_/Sophos/FIM_rawSize: 1333.000000]

Unfortunately right now, I don't think we have good benchmarking metrics around how resource intensive the agent is under certain workloads. From my time here, I do think that what you observe is what I personally would expect performance wise on Windows, but like I said earlier, I'm still not entirely sure why it is so bad on Windows compared to Linux just for monitoring log files.

Errosion · 2023-08-29T12:11:13Z

So then it sounds like our only option is to increase the EC2 instance size where we see this behavior. Likely one step up should be enough to offset the usage from the agent, I would think.

Am I understanding that correctly then?
(If this CPU usage is what one would expect from a Windows agent monitoring a file)

I could attempt to install an older version of CWagent to see if the behavior remains the same on that older version. Would there be a recommended version to try for that?
Thinking it would need to be prior to 1.247359.1 as the config change occurred on this version and was seen starting there.

I had a few other tests I was thinking about to see if maybe it made a difference though I am not expecting much of one.
I was going to try and specify a single file within that directory, instead of using a wildcard and see the results. (would expect the results to be the same)
Also was going to attempt to remove the other monitored sections in the config except the wildcard (or single file) and if the behavior remains the same. (would expect it to be the same there as well)

SaxyPandaBear · 2023-08-29T13:14:51Z

Yeah unfortunately at this time, without adequate automated benchmarking I think the simplest and safest recommendation is to increase your instance size. I don't know how far back you would have to go, honestly, to try to validate that this is a new regression. As long as I've been here, I think this has been an established pain point for running the agent on Windows. AFAIK it was observed well before 1.247359. From what I have seen people try, you should be able to measure and observe the resource usage of the agent linearly increase if you (assuming no wildcard configuration):

Start with monitoring a single log file
Add another log file to the configuration and restart
Repeat 2 while monitoring the CloudWatch agent process

The reason I said to assume no wildcard configuration is because pointing to an exact file on the host would be the least resource intensive, and you could use that as a baseline for expectations.

Errosion · 2023-08-29T13:29:44Z

I ran a few additional tests as well just to confirm some thoughts.

It appears to be either the usage of a wildcard AND the number of files involved or purely the number of files involved, regardless of the wildcard, as has been indicated previously in this thread.

My first test was to clear out everything but the wildcarded directory to be monitored:

	"logs": {
		"logs_collected": {
			"files": {
				"collect_list": [{
					"file_path": "C:\\ProgramData\\Sophos\\File Integrity Monitoring\\Export\\*.xml",
					"log_group_name": "/Sophos/FIM",
					"log_stream_name": "{instance_id}"
				}]

Upon restarting the agent, CPU spiked to 25-30% CPU. This would involve 662 files in total.

I then specified just one specific file within that directory:

				"collect_list": [{
					"file_path": "C:\\ProgramData\\Sophos\\File Integrity Monitoring\\Export\\2023-08-29_08-34_5151_1_1_databatch.xml",
					"log_group_name": "/Sophos/FIM",
					"log_stream_name": "{instance_id}"
				}]

CPU then dropped to less than 5% CPU. So one file monitored in total.

I then specified a subset of the files within that directory:

			"files": {
				"collect_list": [{
					"file_path": "C:\\ProgramData\\Sophos\\File Integrity Monitoring\\Export\\2023-08-2*databatch.xml",
					"log_group_name": "/Sophos/FIM",
					"log_stream_name": "{instance_id}"
				}]

Bumped the CPU to around 5% CPU time consistently, rather than being below 5%. The above search would be approximately 70 files in total.

I replicated the same test but increased the wild to about 200 files (2023-08*databatch.xml), and the CPU jumped to consistently between 5-10%.

In all of the above tests, the agent only reported one file as being open:
2023-08-29T13:12:36Z D! [logagent] open file count, 1

As you have indicated, I think the only way to rule out wildcard versus not would be to just keep adding files, restarting and see how the agent performs.

But unfortunately, with the naming of these files, I do not think we can readily do away with the wildcards. To the best of my knowledge, the naming of those files is solely managed by Sophos. So from that perspective, I think the best we could do is limit the retention policy to a smaller set of files, which may not necessarily be possible here either.

Decisions, decisions......

Thanks for the follow up. I'll take this away for what it is and we will go from there.

SaxyPandaBear added question Further information is requested os/windows Windows labels Aug 28, 2023

Errosion closed this as not planned Won't fix, can't repro, duplicate, stale Aug 29, 2023

jefchien added a commit that referenced this issue Oct 3, 2024

Add new components (#807)

5612986

jefchien added a commit that referenced this issue Oct 4, 2024

Add new components (#807)

70f445d

jefchien added a commit that referenced this issue Oct 4, 2024

Add new components (#807)

4f5be08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cloudwatchagent resulting in high CPU when using wildcard #807

Cloudwatchagent resulting in high CPU when using wildcard #807

Errosion commented Aug 8, 2023

SaxyPandaBear commented Aug 28, 2023

Errosion commented Aug 28, 2023 •

edited

Loading

SaxyPandaBear commented Aug 28, 2023

Errosion commented Aug 29, 2023

SaxyPandaBear commented Aug 29, 2023 •

edited

Loading

Errosion commented Aug 29, 2023

Cloudwatchagent resulting in high CPU when using wildcard #807

Cloudwatchagent resulting in high CPU when using wildcard #807

Comments

Errosion commented Aug 8, 2023

SaxyPandaBear commented Aug 28, 2023

Errosion commented Aug 28, 2023 • edited Loading

SaxyPandaBear commented Aug 28, 2023

Errosion commented Aug 29, 2023

SaxyPandaBear commented Aug 29, 2023 • edited Loading

Errosion commented Aug 29, 2023

Errosion commented Aug 28, 2023 •

edited

Loading

SaxyPandaBear commented Aug 29, 2023 •

edited

Loading