Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find the cause for the nightly latency spikes #344

Closed
DasSkelett opened this issue Apr 28, 2021 · 14 comments · Fixed by #356
Closed

Find the cause for the nightly latency spikes #344

DasSkelett opened this issue Apr 28, 2021 · 14 comments · Fixed by #356
Labels
Area: Backend Related to the Python code that runs inside gunicorn Priority: Normal Type: Improvement Type: Performance

Comments

@DasSkelett
Copy link
Member

Description (What went wrong?):

Every day/night between 23:00 UTC and 05:00 UTC, SpaceDock's server side processing times experience spikes from around 500ms to 2-4s, sometimes even up to 8 seconds.
This is measured from a server with a very good connection to SpaceDock's, against the /kerbal-space-program/browse/top endpoint, using the Prometheus Blackbox Exporter.

The measurement is split up in different request steps (see the first picture):

  • connect: TCP connection setup, pretty fast, always stable – the lowest line in the graph, the legend row for it is not visible in the picture
  • tls: setting up and negotiating the TLS session, handled by ATS – third line from the bottom, equally stable
  • processing: the time the client waits for the server to answer the request, i.e. server side processing, goes through all the way to gunicorn/Flask – top line, the problem child, latency spikes and outages
  • transfer: the time from begin of the response to fully receiving it – second line from below, fails when processing fails

Here's a second graph of the processing time only, divided in buckets, and to a linear scale.

This does not only affect /kerbal-space-program/browse/top, if you visit any other page on SpaceDock during these times you'll experience equally slow response times / long loading times.

Fixing approach

The latency spikes happen during the evening in the US, likely when the request load is highest on SpaceDock.
There's one or more bottlenecks we need to find and fix to avoid such high processing times.

We can probably rule out the network bandwidth, otherwise we should see the spikes in "transfer", "tls" and "connect", not "processing". Also alpha is totally fine during these times.
It could be the database being overloaded, lock contention or whatever.
It could be the template rendering taking so long.
It could be gunicorn having too few or too many workers (or gunicorn instances per se, we're still at 6 instances à 8 workers).
It could be memory pressure / exhaustion (maybe coupled with too many gunicorn workers).
It could be some expensive code path.
...

Profiling

To get us closer to the cause, it would be nice if we could do some profiling of SpaceDock's performance.

There are basically three different ways to do this:

Profile individual requests on local development server

This can (and did) reveal some duplicated database calls, repeatedly called functions that are expensive and could/should be cached, and some other stuff.
The data we can get out of this is very limited though, and very far from real production performance data.

Profile alpha/beta using load tests

We could enable some profiler in alpha/beta, hit it hard, and try to find the problem.
Could give some hints about where it fails under load, has a lower risk of affecting production.
Just hammering a single endpoint won't give us accurate real-world data, trying to simulate real-world traffic will be hard to impossible (at least without knowing how the real-world traffic looks like).
Alpha+beta can be accessed by the two primary code maintainers of SpaceDock (me and @HebaruSan), so getting data in and out of there would be very easy and can be done without having to coordinate too much with @V1TA5.

Profile production by sampling x real-world requests per time period (or a set %)

This would give us the most accurate data, matching what the production server actually experiences.
Tracing every request would cause performance to drop even more, so we'd need to restrict the profiling to only a few requests every now and then (how often can be discussed once we found a way how to do it). We can also leave this running in the background indefinitely.
Since only @V1TA5 (and Darklight) have access to production, changing profiling settings and getting profile data out of it will be difficult. This one would probably require us to make the profiling data available directly as part of SpaceDock, e.g. by either running a web visualizer, or by making it downloadable somehow. Should be admin-locked.

In the end we probably get the most out of it if we have all three possibilities.
( Where local profiling is basically already possible, but we could make it simpler to set up)

Profiling Tools

We already found the following:
flask-profiler: https://github.com/muatik/flask-profiler#sampling

  • Has a very customizable sample rate feature built-in

snakeviz: https://jiffyclub.github.io/snakeviz/

  • Has inverted flame graphs called icicles
  • can also list functions sorted by total time consumption or number of calls regardless of call source (i.e. from different places)
  • runs as web app, could be run as separate application in the production container, or maybe even from inside our backend as separate route, potentially using our existing admin authorization system
@DasSkelett DasSkelett added Type: Improvement Area: Backend Related to the Python code that runs inside gunicorn Priority: Normal labels Apr 28, 2021
@HebaruSan
Copy link
Contributor

HebaruSan commented Apr 28, 2021

I like the idea of making the visualization pages admin-only, since then we could put the new routes at /admin/profiling and /admin/profiling/<int:id>, which would help to keep them out of the way of the normal site routes.

It might even make sense to make profiling the default admin page, because the list of users isn't really that interesting.

@HebaruSan
Copy link
Contributor

HebaruSan commented Apr 28, 2021

On closer inspection, I am not sure that flask-profiler and snakeviz are compatible.

flask-profiler seems to only collect HTTP headers, start time, and elapsed time, which could be useful to a project that wishes to find out which of their routes are slow, but does not provide the kind of stack trace sampling that would be needed for flame/icicle graphs.

snakeviz says that it parses data files generated by cProfiler, which is not what flask-profiler seems to be doing. Maybe there is another plugin that can do it...

Also looked at Flask-Profile / flask.ext.profile, which seems about the same, it just collects route run times rather than stack trace samples.

@HebaruSan
Copy link
Contributor

This one mentions both Flask and cProfiler.
https://werkzeug.palletsprojects.com/en/1.0.x/middleware/profiler/
Possibly we could dump data into a directory and then get a listing of those files for the visualization page. But so far I don't see how to turn it on and off for different requests; it seems that it would profile everything once you enable it.

@HebaruSan
Copy link
Contributor

snakeviz seems to depend on IPython / Jupyter, so it may not be able to generate visualizations in a flask process (even though it says it is "a browser based graphical viewer").

@HebaruSan
Copy link
Contributor

HebaruSan commented Apr 28, 2021

More chances to get my hopes up:

This one turns a .prof file from cProfile into an .svg, maybe we could run that in a scheduled task to pre-render all the data we've accumulated, then list the .svg files in the profiling admin route; or it even has a mode where it outputs svgs directly into the profiling directory!:

@HebaruSan
Copy link
Contributor

HebaruSan commented Apr 29, 2021

@DasSkelett
Copy link
Member Author

DasSkelett commented Jun 15, 2021

For the record on GitHub, profiling works great, and we've already done some huge optimizations and some are still in the pipeline (#345, #370), the nightly slowdowns are not solved, instead they seen to be even worse.
image

They very heavily look like they're caused by the storage for the mod zips, backgrounds and thumbnails misbehaving (an SMB/CIFS mount on the host, then bind-mounted into the container, I think).
This results in very high server load, slowing down other application, but also freezes disk access itself:
image
(Note that this requst took 14.5 seconds to complete)

@DasSkelett
Copy link
Member Author

DasSkelett commented Jun 23, 2021

Small summary and next steps:

We've the issue with the nightly slowdowns, that got worse and worse over the last few weeks.
Earlier today I've discovered that we aren't serving user content (zips, background images, thumbnails) from the Apache Web Server [AWS] (which is inside the SpaceDock container, between ATS and gunicorn) directly, but pass every request through to gunicorn, which gets overwhelmed trying to serve all these big files (it really isn't made for that, and it blocks workers for a long time, and it seems to try to load the whole file into memory before responding to the request, instead of a chunked approach).

We have basically two ways to make AWS load the files from disk itself:

  1. Prohibit proxy-passing the requests to gunicorn, and alias the /content/ route to the storage directory:
    ProxyPass /content/ !
    Alias /content/ /storage/sd-alpha/
    
  2. Use the XSendFile feature. This basically makes gunicorn send an HTTP header (X-SendFile) with a file path as response, that tells AWS to read the file from disk and serve it itself. I think there's some kernel magic involved as well, in Kernel-world sendfile means copying a file from one file descriptor to another in Kernel space, instead of sending it to user space and moving it there, saving a few CPU cycles. See https://linux.die.net/man/2/sendfile. AWS makes use of this feature, I think.
    To use XSendFile, the cdn-domain config option needs to be unset due to how our code works right now. Thus it also no longer redirect to a /content URL, so caching from ATS would be disabled (maybe, don't know how it's configured).

So now we tried 1) in another attempt to combat the nightly slowdowns, but for whatever reason, it serves some files correctly, but some files return either a 502 or timeout after exactly 30s with cURL reporting:

curl: (92) HTTP/2 stream 0 was closed cleanly, but before getting  all response header fields, treated as error

The cause is unknown, might or might not be a bug in AWS, or a config error.

@HebaruSan
Copy link
Contributor

(Easier to get back to this if it's open...)

@HebaruSan HebaruSan reopened this Jun 23, 2021
@DasSkelett DasSkelett pinned this issue Jun 23, 2021
@DasSkelett DasSkelett changed the title Profile the backend / find the cause for the daily latency spikes Find the cause for the nightly latency spikes Jun 23, 2021
@DasSkelett
Copy link
Member Author

This sounds related:

apache reads wrong data over cifs filesystems served by samba
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=900821

I've seen these weird kernel stack traces in alpha's syslog as well (expand):
Jun 23 14:27:28 sd-alpha kernel: [1113982.280769] Call Trace:
Jun 23 14:27:28 sd-alpha kernel: [1113982.281730]  pagecache_get_page+0xff/0x2e0
Jun 23 14:27:28 sd-alpha kernel: [1113982.282812] Code: Bad RIP value.
Jun 23 14:27:28 sd-alpha kernel: [1113982.286344] [  11974]     0 11974     9508       32    65536       88             0 master
Jun 23 14:27:28 sd-alpha kernel: [1113982.287381] [  24532]    33 24532   469445    54876  3878912   393876             0 gunicorn
Jun 23 14:27:28 sd-alpha kernel: [1113982.288443] [  22956]    33 22956    54373    32213   532480     1474             0 gunicorn
Jun 23 14:27:28 sd-alpha kernel: [1113982.291749] [  16111]    33 16111    56617    20435   548864    15527             0 gunicorn
Jun 23 14:27:28 sd-alpha kernel: [1113982.295979] [  11638]    33 11638    35034     1425   307200    12891             0 celery
Jun 23 01:20:04 sd-alpha kernel: [1066737.502715] CIFS VFS: Close unmatched open
Jun 23 01:28:32 sd-alpha kernel: [1067245.511085] CPU: 9 PID: 22260 Comm: apache2 Tainted: P           O      5.4.119-1-pve #1
Jun 23 01:28:32 sd-alpha kernel: [1067245.511870]  try_charge+0x76b/0x7e0
Jun 23 01:28:32 sd-alpha kernel: [1067245.512791]  ? filemap_map_pages+0x28d/0x3b0
Jun 23 01:28:32 sd-alpha kernel: [1067245.513699] RAX: 0000000000000001 RBX: 00007f965c247398 RCX: 00007f965c77212b
Jun 23 01:32:25 sd-alpha kernel: [1067478.244693] Call Trace:
Jun 23 01:32:25 sd-alpha kernel: [1067478.245776]  ? unlock_page_memcg+0x12/0x20
Jun 23 01:32:25 sd-alpha kernel: [1067478.246919] RSP: 002b:00007f965bfc2df8 EFLAGS: 00010246
Jun 23 01:32:25 sd-alpha kernel: [1067478.251938] [  14162]    33 14162   187028   164910  1589248     1377             0 gunicorn
Jun 23 01:32:25 sd-alpha kernel: [1067478.257485] [  11609]     0 11609    58159      113    90112       94             0 accounts-daemon
Jun 23 01:32:25 sd-alpha kernel: [1067478.259886] [  11626]   112 11626   157310        0   143360      232             0 prometheus-apac
Jun 23 01:34:09 sd-alpha kernel: [1067583.003212] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Jun 23 01:34:09 sd-alpha kernel: [1067583.009798] [  31938]    33 31938    58478     4721   544768    33082             0 gunicorn
Jun 23 01:34:09 sd-alpha kernel: [1067583.010122] [  17399]    33 17399    50885    25731   479232     4498             0 gunicorn
Jun 23 01:34:09 sd-alpha kernel: [1067583.011602] [  11510]     0 11510    48202      222   401408      127          -250 systemd-journal
Jun 23 01:34:09 sd-alpha kernel: [1067583.068167] oom_reaper: reaped process 23806 (gunicorn), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Jun 23 01:34:56 sd-alpha kernel: [1067629.879105] CIFS VFS: Close unmatched open
Jun 23 01:34:57 sd-alpha kernel: [1067630.662049] CPU: 14 PID: 32080 Comm: apache2 Tainted: P           O      5.4.119-1-pve #1
Jun 23 01:34:57 sd-alpha kernel: [1067630.662963]  __add_to_page_cache_locked+0x265/0x340
Jun 23 01:34:57 sd-alpha kernel: [1067630.663878]  __do_fault+0x3c/0x130
Jun 23 01:34:57 sd-alpha kernel: [1067630.665833] R10: 0000000000000000 R11: 0000000000000000 R12: 00007f965bfc2e6c
Jun 23 01:34:57 sd-alpha kernel: [1067630.669566] [  14162]    33 14162   273894   251817  2285568     1377             0 gunicorn
Jun 23 01:34:57 sd-alpha kernel: [1067630.670871] [   4935]    33  4935    53750    31736   536576     1357             0 gunicorn
Jun 23 01:34:57 sd-alpha kernel: [1067630.672182] [  16956]    33 16956   233781   211414  1957888     1669             0 gunicorn
Jun 23 01:34:57 sd-alpha kernel: [1067630.678776] [  12041]    33 12041    34710        0   294912    13924             0 celery
Jun 23 01:34:57 sd-alpha kernel: [1067630.679878] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=ns,mems_allowed=0,oom_memcg=/lxc/101,task_memcg=/lxc/101/ns/system.slice/system-spacedock.slice/spacedock@8009.service,task=gunicorn,pid=17289,uid=33
Jun 23 01:35:31 sd-alpha kernel: [1067664.380327] [   9113]    33  9113   175832   113258  1499136    41823             0 gunicorn
Jun 23 01:35:31 sd-alpha kernel: [1067664.385528] [   6118]    33  6118    48619    21245   491520     6719             0 gunicorn
Jun 23 01:35:31 sd-alpha kernel: [1067664.386797] [  12177]    33 12177     6703      191    90112     3213             0 gunicorn
Jun 23 01:35:31 sd-alpha kernel: [1067664.388184] [   5706]    33  5706   502606       35   299008     2226             0 apache2
Jun 23 01:35:31 sd-alpha kernel: [1067664.389079] [  32674]    33 32674   502858     1025   294912      191             0 apache2
Jun 23 01:35:31 sd-alpha kernel: [1067664.389869] [  11635]   101 11635    56087      203    86016       51             0 rsyslogd
Jun 23 01:35:31 sd-alpha kernel: [1067664.391123] [  11652]     0 11652     4119       82    73728      122             0 systemd-logind
Jun 23 01:37:51 sd-alpha kernel: [1067804.264258] [  27038]    33 27038    43890    14676   450560     8557             0 gunicorn
Jun 23 01:37:51 sd-alpha kernel: [1067804.269497] [  22026]    33 22026   535860     1497   364544     1112             0 apache2
Jun 23 01:37:51 sd-alpha kernel: [1067804.270306] [  11621]     0 11621     6574       47    94208     1859             0 networkd-dispat
Jun 23 01:37:51 sd-alpha kernel: [1067804.271272] [  12041]    33 12041    34710        0   294912    13924             0 celery
Jun 23 01:37:51 sd-alpha kernel: [1067804.272189] [  11676]     0 11676      658       28    40960        0             0 agetty
Jun 23 01:38:52 sd-alpha kernel: [1067865.728116]  add_to_page_cache_lru+0x4f/0xd0
Jun 23 01:38:52 sd-alpha kernel: [1067865.729242]  page_fault+0x34/0x40
Jun 23 01:38:52 sd-alpha kernel: [1067865.735235] [  29446]    33 29446    50648    14055   487424    15938             0 gunicorn
Jun 23 01:38:52 sd-alpha kernel: [1067865.736748] [   8786]    33  8786   259890   236216  2158592     2969             0 gunicorn
Jun 23 01:38:52 sd-alpha kernel: [1067865.739326] [  12716]    33 12716    54471    32389   520192     1388             0 gunicorn
Jun 23 01:38:52 sd-alpha kernel: [1067865.740787] [  31553]    33 31553   502996     1534   299008      311             0 apache2
Jun 23 01:38:52 sd-alpha kernel: [1067865.818355] oom_reaper: reaped process 18953 (gunicorn), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Jun 23 01:40:25 sd-alpha kernel: [1067958.672277] [   6118]    33  6118    48299    25648   491520     1996             0 gunicorn
Jun 23 01:40:25 sd-alpha kernel: [1067958.675414] [  31553]    33 31553   502888     1832   299008      298             0 apache2
Jun 23 01:40:25 sd-alpha kernel: [1067958.676665] [  12022]    33 12022    34708        0   294912    13923             0 celery

Workaround/fix proposed in the comments:

Hi,

by default, apache uses mmap, so probably mmap is broken on cifs. An
alternate workaround should be to set EnableMMAP off in the apache
config.

Cheers,
Stefan

I know from previous research that mmap with CIFS can be problematic.

@HebaruSan
Copy link
Contributor

Collecting together what I surmise the optimal config would be for prod for easy pasting:

ProxyPass /content/ !
Alias /content/ /storage/sdmods/
# CIFS and MMAP don't get along
EnableMMAP off

@HebaruSan
Copy link
Contributor

@V1TA5 reports that some of this load may be due to an anti-virus scan. If so, rescheduling it to not overlap US activity primetime may help.

image

@DasSkelett
Copy link
Member Author

DasSkelett commented Aug 7, 2021

The virus scan is pretty much confirmed as cause.
After disabling it yesterday, the big slowdown was gone, with only small, individual spikes in processing time, that could be attributed to general user-generated load:

(Tonight in yellow, average Friday->Saturday night in light blue, all times UTC+2)

I'd leave this issue open for a bit, to make sure it wasn't a coincidence and check how it behaves the next days.

We also need to think about how to bring the scan back, in a less disruptive way. SpaceDock's mods should definitely be scanned, as we're essentially an open file hosting service accessible for everyone.


And a small follow-up to the EnableMMAP story:
We set EnableMMAP off and the proxy bypass for /content/ on production, and it works great, no more corrupt zips. It may or may not have reduced CPU and memory consumption, I don't have data on this (and it might've been negligible compared to the virus scan, but it should be huge without it).

@DasSkelett
Copy link
Member Author

With the recent production upgrade (#398) even the remaining slowdowns have disappeared completely:
grafana screenshot showing processing time for monitoring web requests

All those performance improvements (including reducing storage/disk access, caching costly operations, reducing database queries etc.) really paid off!

I'm going to close this issue now as this problem is resolved. @HebaruSan opened some issues to track further improvement opportunities, and also for discussion about the virus scanning, as we do want to get it back in some form or another (with less performance impact).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: Backend Related to the Python code that runs inside gunicorn Priority: Normal Type: Improvement Type: Performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants