Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maxing out memory and swap space #419

Open
briri opened this issue Jan 6, 2023 · 5 comments
Open

Maxing out memory and swap space #419

briri opened this issue Jan 6, 2023 · 5 comments

Comments

@briri
Copy link
Collaborator

briri commented Jan 6, 2023

The instances have been periodically experiencing high memory usage which eventually results in maxing out our swap space.

This screenshot shows memory/swap usage and IOPS during a recent incident:
Screen Shot 2023-01-06 at 7 34 59 AM

Swap begins escalating dramatically between 10 AM - 11 AM on 12/29. It then maxes out around 8 PM on 12/30.

Apache and Rails server logs show no unusual traffic. The apache access logs show only 152 requests from 08 AM - 01 PM on 12/29

Suspect an issue with the rack_attack gem we use for rate limiting and throttling malicious activity. These issue coincide with the introduction of the gem, but that may be an invalid correlation.

@briri
Copy link
Collaborator Author

briri commented Jan 6, 2023

Our plan:

  • Setup Nagios alert for swap usage to give us an early warning before it maxes and impacts site performance
  • Introduce an AWS WAF in front of the ALB (just log/monitor initially so we can ensure that it doesn't block legitimate traffic)
  • Investigate Rails cache configuration and adjust
  • Investigate rack_attack gem use of the Rails cache and adjust configuration

@briri briri mentioned this issue Jan 9, 2023
@briri
Copy link
Collaborator Author

briri commented Jan 24, 2023

Removed the rack_attack gem from the stage environment and we are still seeing the same behavior. Memory usage steadily increases so we expect that there is a memory leak somewhere.

We're using the default Rails memory store which is 'FileStore', so it should be using IO to read/write from [project_root]/tmp for it's cache.

I am going to inspect the apache logs in the stage env (since the traffic there is low) to see what actual requests its handling and see if we can drill in from there.

I'll also do a diff of our Gemfile and package.json against what's in the core DMPRoadmap codebase since the other installations are not seeing this type of behavior (although they are not yet running on Rails 6 version)

@briri
Copy link
Collaborator Author

briri commented Oct 9, 2023

Going to introduce ActiveStorage and DelayedJob in early November which will auto generate narrative PDFs for public plans in the background. This should mitigate some of our 500 level errors we see when bots harvest these PDF files.

We will also be offloading all communication with the DMPHub to delayed_job to let things process in the background. While implementing this, we discovered a small loop in the callback logic that was causing DMPTool to send updates to the DMPHub 4 times instead of once. Not sure if this is contributing to the memory issues, but it should at least help.

@briri
Copy link
Collaborator Author

briri commented Apr 4, 2024

we put a cron job in place to restart puma on a schedule as a band-aid for this

@briri briri closed this as completed Apr 4, 2024
@briri briri reopened this May 7, 2024
@briri
Copy link
Collaborator Author

briri commented May 7, 2024

  1. Take 02 out from behind the ELB (monitor to see if the leak is traffic related). Also turn off delayed_job on 01 and restart both
  2. Plan is to create a branch and remove elements like the wkhtmltopdf gem and run on a single instance to see.
  3. Send logs to OpenSearch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant