-
-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Find the cause for the nightly latency spikes #344
Comments
I like the idea of making the visualization pages admin-only, since then we could put the new routes at It might even make sense to make profiling the default admin page, because the list of users isn't really that interesting. |
On closer inspection, I am not sure that
Also looked at |
This one mentions both Flask and cProfiler. |
|
More chances to get my hopes up:
This one turns a .prof file from cProfile into an .svg, maybe we could run that in a scheduled task to pre-render all the data we've accumulated, then list the .svg files in the profiling admin route; or it even has a mode where it outputs svgs directly into the profiling directory!: |
For the record on GitHub, profiling works great, and we've already done some huge optimizations and some are still in the pipeline (#345, #370), the nightly slowdowns are not solved, instead they seen to be even worse. They very heavily look like they're caused by the storage for the mod zips, backgrounds and thumbnails misbehaving (an SMB/CIFS mount on the host, then bind-mounted into the container, I think). |
Small summary and next steps: We've the issue with the nightly slowdowns, that got worse and worse over the last few weeks. We have basically two ways to make AWS load the files from disk itself:
So now we tried 1) in another attempt to combat the nightly slowdowns, but for whatever reason, it serves some files correctly, but some files return either a 502 or timeout after exactly 30s with cURL reporting:
The cause is unknown, might or might not be a bug in AWS, or a config error. |
(Easier to get back to this if it's open...) |
This sounds related:
I've seen these weird kernel stack traces in alpha's syslog as well (expand):
Workaround/fix proposed in the comments:
I know from previous research that mmap with CIFS can be problematic. |
Collecting together what I surmise the optimal config would be for prod for easy pasting: ProxyPass /content/ !
Alias /content/ /storage/sdmods/
# CIFS and MMAP don't get along
EnableMMAP off |
@V1TA5 reports that some of this load may be due to an anti-virus scan. If so, rescheduling it to not overlap US activity primetime may help. |
With the recent production upgrade (#398) even the remaining slowdowns have disappeared completely: All those performance improvements (including reducing storage/disk access, caching costly operations, reducing database queries etc.) really paid off! I'm going to close this issue now as this problem is resolved. @HebaruSan opened some issues to track further improvement opportunities, and also for discussion about the virus scanning, as we do want to get it back in some form or another (with less performance impact). |
Description (What went wrong?):
Every day/night between 23:00 UTC and 05:00 UTC, SpaceDock's server side processing times experience spikes from around 500ms to 2-4s, sometimes even up to 8 seconds.
This is measured from a server with a very good connection to SpaceDock's, against the
/kerbal-space-program/browse/top
endpoint, using the Prometheus Blackbox Exporter.The measurement is split up in different request steps (see the first picture):
Here's a second graph of the processing time only, divided in buckets, and to a linear scale.
This does not only affect
/kerbal-space-program/browse/top
, if you visit any other page on SpaceDock during these times you'll experience equally slow response times / long loading times.Fixing approach
The latency spikes happen during the evening in the US, likely when the request load is highest on SpaceDock.
There's one or more bottlenecks we need to find and fix to avoid such high processing times.
We can probably rule out the network bandwidth, otherwise we should see the spikes in "transfer", "tls" and "connect", not "processing". Also alpha is totally fine during these times.
It could be the database being overloaded, lock contention or whatever.
It could be the template rendering taking so long.
It could be gunicorn having too few or too many workers (or gunicorn instances per se, we're still at 6 instances à 8 workers).
It could be memory pressure / exhaustion (maybe coupled with too many gunicorn workers).
It could be some expensive code path.
...
Profiling
To get us closer to the cause, it would be nice if we could do some profiling of SpaceDock's performance.
There are basically three different ways to do this:
Profile individual requests on local development server
This can (and did) reveal some duplicated database calls, repeatedly called functions that are expensive and could/should be cached, and some other stuff.
The data we can get out of this is very limited though, and very far from real production performance data.
Profile alpha/beta using load tests
We could enable some profiler in alpha/beta, hit it hard, and try to find the problem.
Could give some hints about where it fails under load, has a lower risk of affecting production.
Just hammering a single endpoint won't give us accurate real-world data, trying to simulate real-world traffic will be hard to impossible (at least without knowing how the real-world traffic looks like).
Alpha+beta can be accessed by the two primary code maintainers of SpaceDock (me and @HebaruSan), so getting data in and out of there would be very easy and can be done without having to coordinate too much with @V1TA5.
Profile production by sampling x real-world requests per time period (or a set %)
This would give us the most accurate data, matching what the production server actually experiences.
Tracing every request would cause performance to drop even more, so we'd need to restrict the profiling to only a few requests every now and then (how often can be discussed once we found a way how to do it). We can also leave this running in the background indefinitely.
Since only @V1TA5 (and Darklight) have access to production, changing profiling settings and getting profile data out of it will be difficult. This one would probably require us to make the profiling data available directly as part of SpaceDock, e.g. by either running a web visualizer, or by making it downloadable somehow. Should be admin-locked.
In the end we probably get the most out of it if we have all three possibilities.
( Where local profiling is basically already possible, but we could make it simpler to set up)
Profiling Tools
We already found the following:
flask-profiler
: https://github.com/muatik/flask-profiler#samplingsnakeviz
: https://jiffyclub.github.io/snakeviz/The text was updated successfully, but these errors were encountered: