-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cloudwatchagent resulting in high CPU when using wildcard #807
Comments
What happens if you run the agent in debug mode? You are using an agent version that is new enough to include the number of monitored log files in debug logs #506. There is a measurable overhead for the CloudWatch agent to tail a log file - from my own testing, this seems to be more pronounced on Windows hosts than Linux hosts, but I'm not entirely sure why. CPU usage seems to linearly scale based on the number of log files that the agent is monitoring concurrently |
I appreciate the response! I enabled debug and the first thing I noticed was that CWagent was having issues accessing a few files defined in the config:
I confirmed that those files were not available, so I removed those from the config and restarted the agent again. Both to clear out the clutter and see if that made any impact on performance. Upon removing those entries, the same high CPU behavior occurred and the errors have since gone away. I suspected this would be the outcome anyways since those three files were in the config without the wildcard already and CWagent was running at low CPU. With those files removed, this is what I am seeing on startup: At one point while just letting things run a bit, I do see this. The open file count jumps to 3 after reading a file from the wildcard search:
Before making any modifications, CWagent was running at less than 5% CPU. The other interesting message that I see is this one. Seems like just a size indicator of what is being pushed. Since there was the error attempting to write to the "C:\ProgramData\Amazon\AmazonCloudWatchAgent\Logs\state\C__ProgramData_Sophos_File_Integrity_Monitoring_Export_2023-08-28_19-04_5145_2_2_databatch.xml" file, I tried purging the entire "C:\ProgramData\Amazon\AmazonCloudWatchAgent\Logs\state" directory and all files within it to see if that made any difference and that did not. After restarting CWagent after purging the "state" directory, I see a bunch of messages in the debug output indicating log events were getting pushed again. I am guessing the state directory is a method for tracking what has been sent and what has not. It was a bit of a shot thinking that maybe something had gotten queue'd up or something quirky in that directory. But no change in behavior. CWagent still running at 25-30% CPU. I do see that after purging the state directory and the restart, the open file count has dropped back down to 2. I also see that there are four files in the "C:\ProgramData\Amazon\AmazonCloudWatchAgent\Logs\state" directory at this point which correspond to what is being monitored. The only other thing to note that I see in here is this message which typically says "no stats is available": I do notice that the CWAgent has since upgraded as well and is now: |
Yes, the /state/ folder is used internally by the agent to track file byte offsets for log files it is monitoring. I don't think that errors writing there should impact CPU enough to be measurable so I'd ignore those for now. There shouldn't be a performance drop across versions - if there is measurable performance degradation please call it out. The profiler dump is benign. The agent keeps a "profiler" map in memory for internal runtime debugging stats, similar to the raw size stat you saw Unfortunately right now, I don't think we have good benchmarking metrics around how resource intensive the agent is under certain workloads. From my time here, I do think that what you observe is what I personally would expect performance wise on Windows, but like I said earlier, I'm still not entirely sure why it is so bad on Windows compared to Linux just for monitoring log files. |
So then it sounds like our only option is to increase the EC2 instance size where we see this behavior. Likely one step up should be enough to offset the usage from the agent, I would think. Am I understanding that correctly then? I could attempt to install an older version of CWagent to see if the behavior remains the same on that older version. Would there be a recommended version to try for that? I had a few other tests I was thinking about to see if maybe it made a difference though I am not expecting much of one. |
Yeah unfortunately at this time, without adequate automated benchmarking I think the simplest and safest recommendation is to increase your instance size. I don't know how far back you would have to go, honestly, to try to validate that this is a new regression. As long as I've been here, I think this has been an established pain point for running the agent on Windows. AFAIK it was observed well before 1.247359. From what I have seen people try, you should be able to measure and observe the resource usage of the agent linearly increase if you (assuming no wildcard configuration):
The reason I said to assume no wildcard configuration is because pointing to an exact file on the host would be the least resource intensive, and you could use that as a baseline for expectations. |
I ran a few additional tests as well just to confirm some thoughts. It appears to be either the usage of a wildcard AND the number of files involved or purely the number of files involved, regardless of the wildcard, as has been indicated previously in this thread. My first test was to clear out everything but the wildcarded directory to be monitored:
Upon restarting the agent, CPU spiked to 25-30% CPU. This would involve 662 files in total. I then specified just one specific file within that directory:
CPU then dropped to less than 5% CPU. So one file monitored in total. I then specified a subset of the files within that directory:
Bumped the CPU to around 5% CPU time consistently, rather than being below 5%. The above search would be approximately 70 files in total. I replicated the same test but increased the wild to about 200 files (2023-08*databatch.xml), and the CPU jumped to consistently between 5-10%. In all of the above tests, the agent only reported one file as being open: As you have indicated, I think the only way to rule out wildcard versus not would be to just keep adding files, restarting and see how the agent performs. But unfortunately, with the naming of these files, I do not think we can readily do away with the wildcards. To the best of my knowledge, the naming of those files is solely managed by Sophos. So from that perspective, I think the best we could do is limit the retention policy to a smaller set of files, which may not necessarily be possible here either. Decisions, decisions...... Thanks for the follow up. I'll take this away for what it is and we will go from there. |
Describe the bug
On a Windows Server 2016 Datacenter Bastion EC2 host, we have Cloudwatch agent and Sophos installed. Upon implementing some additional log monitoring for the Sophos FIM export logs, we saw a huge spike the CWagent CPU usage. From around <5% up to a consistent 15-20% CPU time.
The CWAgent config has the below defined:
With that configured, the cloudwatch agent sits consistently at 15-20% CPU time on a t3.small instance and will go up higher when in use.
With the above set of logs removed from the config, the agent sits at <5% CPU time.
At any given time there are 670 or less files within that directory, no other subdirectories. The total size of all of the files is less than 2MB.
Steps to reproduce
Add the above config line into the config, restart the cloudwatchagent service. CPU will spike and continue to run at 15-20% when system is idle and upwards of 40-50% when other things are running.
What did you expect to see?
A much smaller increase to the overall CPU % being used by the agent.
What did you see instead?
See steps to reproduce
What version did you use?
Issue started being seen on CWAgent 1.247359.1.
Has carried over to 1.247360.0b252689.
What config did you use?
{ "agent": { "metrics_collection_interval": 300, "logfile": "c:\\ProgramData\\Amazon\\AmazonCloudWatchAgent\\Logs\\amazon-cloudwatch-agent.log" }, "metrics": { "metrics_collected": { "Memory": { "measurement": ["% Committed Bytes In Use"] }, "LogicalDisk": { "measurement": ["% Free Space"], "resources": ["*"] }, "System": { "measurement": ["*"] } }, "append_dimensions": { "ImageId": "{aws:ImageId}", "InstanceId": "{aws:InstanceId}", "InstanceType": "{aws:InstanceType}" } }, "logs": { "logs_collected": { "windows_events": { "collect_list": [{ "event_format": "xml", "event_levels": ["VERBOSE", "INFORMATION", "WARNING", "ERROR", "CRITICAL"], "event_name": "Security", "log_group_name": "Windows-Security", "log_stream_name": "{instance_id}" }, { "event_format": "xml", "event_levels": ["VERBOSE", "INFORMATION", "WARNING", "ERROR", "CRITICAL"], "event_name": "Microsoft-Windows-Windows Firewall With Advanced Security/Firewall", "log_group_name": "Windows-Firewall", "log_stream_name": "{instance_id}" }] }, "files": { "collect_list": [{ "file_path": "C:\\ProgramData\\Sophos\\Sophos Anti-Virus\\logs\\SAV.txt", "log_group_name": "/Sophos/SAV", "log_stream_name": "{instance_id}" }, { "file_path": "C:\\ProgramData\\Sophos\\Sophos Anti-Virus\\logs\\SAV-Trace.txt", "log_group_name": "/Sophos/SAV-Trace", "log_stream_name": "{instance_id}" }, { "file_path": "C:\\ProgramData\\Sophos\\Sophos Anti-Virus\\logs\\Sophos Cloud Scheduled Scan.txt", "log_group_name": "/Sophos/Sophos-Cloud-Scheduled-Scan", "log_stream_name": "{instance_id}" }, { "file_path": "C:\\ProgramData\\Sophos\\File Integrity Monitoring\\Export\\*.xml", "log_group_name": "/Sophos/FIM", "log_stream_name": "{instance_id}" }, { "file_path": "C:\\ProgramData\\Sophos\\Sophos Anti-Virus\\logs\\oaScannerWatchdog.txt", "log_group_name": "/Sophos/oaScannerWatchdog", "log_stream_name": "{instance_id}" }] } } } }
Environment
MS Windows Server 2016 Datacenter
Sophos Endpoint management
(File Integrity Monitoring 1.0.3.449)
Additional context
While I understand that there are some limitations to wildcard searches and these should be as specific as possible, the "*.xml" is the closest we can get. The filename format is:
YYYY-MM-DD_HH-MM_????_??_??_databatch.xml
Actual filenames:
I am not entirely certain as to what the "?" fields are but they are all digits.
Since the first five fields change at inconsistent patterns, I cannot really be more specific in the wildcard searches.
Sophos handles the rotation of these files and there is a 90 day retention policy.
The text was updated successfully, but these errors were encountered: