Maxing out memory and swap space #419

briri · 2023-01-06T15:41:47Z

The instances have been periodically experiencing high memory usage which eventually results in maxing out our swap space.

This screenshot shows memory/swap usage and IOPS during a recent incident:

Swap begins escalating dramatically between 10 AM - 11 AM on 12/29. It then maxes out around 8 PM on 12/30.

Apache and Rails server logs show no unusual traffic. The apache access logs show only 152 requests from 08 AM - 01 PM on 12/29

Suspect an issue with the rack_attack gem we use for rate limiting and throttling malicious activity. These issue coincide with the introduction of the gem, but that may be an invalid correlation.

briri · 2023-01-06T15:44:25Z

Our plan:

Setup Nagios alert for swap usage to give us an early warning before it maxes and impacts site performance
Introduce an AWS WAF in front of the ALB (just log/monitor initially so we can ensure that it doesn't block legitimate traffic)
Investigate Rails cache configuration and adjust
Investigate rack_attack gem use of the Rails cache and adjust configuration

briri · 2023-01-24T18:24:43Z

Removed the rack_attack gem from the stage environment and we are still seeing the same behavior. Memory usage steadily increases so we expect that there is a memory leak somewhere.

We're using the default Rails memory store which is 'FileStore', so it should be using IO to read/write from [project_root]/tmp for it's cache.

I am going to inspect the apache logs in the stage env (since the traffic there is low) to see what actual requests its handling and see if we can drill in from there.

I'll also do a diff of our Gemfile and package.json against what's in the core DMPRoadmap codebase since the other installations are not seeing this type of behavior (although they are not yet running on Rails 6 version)

briri · 2023-10-09T15:01:02Z

Going to introduce ActiveStorage and DelayedJob in early November which will auto generate narrative PDFs for public plans in the background. This should mitigate some of our 500 level errors we see when bots harvest these PDF files.

We will also be offloading all communication with the DMPHub to delayed_job to let things process in the background. While implementing this, we discovered a small loop in the callback logic that was causing DMPTool to send updates to the DMPHub 4 times instead of once. Not sure if this is contributing to the memory issues, but it should at least help.

briri · 2024-04-04T23:54:18Z

we put a cron job in place to restart puma on a schedule as a band-aid for this

briri · 2024-05-07T17:34:19Z

Take 02 out from behind the ELB (monitor to see if the leak is traffic related). Also turn off delayed_job on 01 and restart both
Plan is to create a branch and remove elements like the wkhtmltopdf gem and run on a single instance to see.
Send logs to OpenSearch

briri added infrastructure effort - large (2+ days) labels Jan 6, 2023

briri self-assigned this Jan 6, 2023

briri mentioned this issue Jan 9, 2023

V4.0.4 beta #421

Merged

briri mentioned this issue Feb 1, 2023

Updates for V4.0.5 #430

Merged

briri mentioned this issue Feb 15, 2023

Deploy Questions #432

Closed

briri closed this as completed Apr 4, 2024

briri reopened this May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Maxing out memory and swap space #419

Maxing out memory and swap space #419

briri commented Jan 6, 2023

briri commented Jan 6, 2023

briri commented Jan 24, 2023

briri commented Oct 9, 2023

briri commented Apr 4, 2024

briri commented May 7, 2024 •

edited

Loading

Maxing out memory and swap space #419

Maxing out memory and swap space #419

Comments

briri commented Jan 6, 2023

briri commented Jan 6, 2023

briri commented Jan 24, 2023

briri commented Oct 9, 2023

briri commented Apr 4, 2024

briri commented May 7, 2024 • edited Loading

briri commented May 7, 2024 •

edited

Loading