-
-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SEV postmortem: outage due to elevated CPU and memory usage #129
Comments
Resource spikes appear to be continuing the following day, leading to occasional outages: Overall, traffic has increased dramatically over the past day: However, the entire application shouldn't be able to fall over just because there are too many people using it. The user code should be constrained instead. I will take some time to investigate this weekend. |
Resolving #82 would likely also improve the situation. |
Upgrading from t3.medium to t3.large to remediate the current load, since that's still within our budget due to recent donations. I will keep the existing instance around separately, so that I can test with it this weekend. |
Should help by giving 3GB headroom instead of 1GB for the server and operating system. Empirically, it looks like the OOM killer is operating properly and killing user code rather than system processes, but the small amount of headroom could have been a problem. Extensive usage of swap could also have been a problem so I disabled swap for user code. Reduced the CPU quota to eliminate access to bursting from user code, as well, and bumped the pid quota because we had a lot of headroom there.
It appears that Riju was unavailable from about 10:15pm to 10:20pm PT on Thu, Oct 21, 2021.
Supervisor logs during the incident
Webserver logs during the incident
Kernel logs during the incident
CPU and memory usage during the incident:
It appears that service was restored by the kernel OOM killer doing its job and killing some user processes (Racket code) that were consuming too much memory.
Now, it should be impossible for CPU and memory usage to spike to 100% no matter what code people are running, because the total resources allocated to user containers is constrained by the
riju.slice
systemd cgroup. Something may be wrong with that, or there may be some way of forcing the webserver container itself to consume a lot of CPU and memory. Perhaps programs that generate a lot of output do this. But given that the OOM killer got involved and identified Racket code as problematic, it seems likely that this was the fault of user code that was not properly resource-constrained by its cgroup.One way of getting more data to understand incidents like this would be #81, which would help to see if someone had executed a pathological program which caused an outage---thus allowing the issue to be reproduced and addressed outside of production.
Other suggestions are certainly appreciated. I'm still learning how to manage production systems effectively.
The text was updated successfully, but these errors were encountered: