Cache results of common database queries #349

DasSkelett · 2021-05-08T21:38:03Z

Motivation

Thanks to #346 we now have very valuable profiling data for our backend. After fixing the previous bottleneck, repeated template bytecode compilation due to disabled caching, the major part of most requests is now spent on database queries.
The answer: moar caching!

This PR does an initial step at caching the most common database queries (some of them executed on every request), that tend to not change at all or only very infrequently, or don't hurt if they are a bit out of date.
It's the low hanging fruit, there's still a lot more we should eventually cache, but those need some refactoring and other bigger changes, or careful planning. I preferred to get the easy stuff in first.

Changes

We are caching all of the following requests in a TTLCache, that is s LRU cache with timeouts, using the cachetools library:

It allows caching the return values of a function by simply slapping a decorator on it, which makes it pretty easy to use.

Cached lookups:

The announcement blog post query done in inject() (executed for every request) cached for 30 minutes. If we create a new announcement or delete/edit an existing one, which happens very rarely, it doesn't hurt if it takes some time to be visible to all users.
Caveat: it can happen that the workers get "out of sync" for up to 30 minutes, and the banner might change with every request depending on which worker your request goes to.
All the mod lookups for /browse, /browse/*.rss, /<gameshort>/browse, /<gameshort>, /<gameshort>/browse/*.rss cached for 30 minutes. They are factored out into get_*_mods() functions in common.py, one for each of those categories (featured/new/updated/top). What changes most there are the new and updated mods, but they should be fine being 30 minutes behind as well.
/browse/all, /browse/top, /<gameshort>/browse/all, /<gameshort>/browse/top, /api/browse/top, as well as /search and /<gameshort>/search through search_mods() cached for 10 minutes. I decided to give them a lower timeout, since mod searches also go through this function, and we don't really need to cache them. Maybe we should split the functions called by */top and */search at some point (and merge it with get_top_mods()).
search_mods() now takes a Game.id instead of the whole Game object, since that's all it needed anyway, and the Game objects change every time they're looked up from the database (i.e. with every request, they're never the same), which bypassed the cache.
The queries for referrals, download stats and follow events in /mod/<mod_id>, /mod/<mod_id>/<mod_name>/stats/(downloads|referrals|followers) cached for 30 minutes.
There could be some issues due slightly mismatching 30 day time frames of the downloads/follows and thirty_days_ago which is then used by the graph workers. I might do some database manipulation to test it. But most likely we are fine there.

The cache times are a blind guess at what could be acceptable as lag behind current data and effectiveness of the cache.
The derivation of the cache sizes is documented in the code. I basically did a generous calculation of the amount of possible function argument combinations (since each combination gets its own key in the cache), then added or removed a bit depending on the situation.
Every cache size gets multiplied by 48 for the amoutnt of worker (and the actual size in bytes of the returned object), which needs to be considered.

I'm open for suggestions for better values of these two fields.

Additional changes

Renamed get_mods() to get_paginated_mods() and paginate_mods() to paginate_query() to help distinguishing them from each other (and search_mods()). Also paginate_query() could be used for other objects as well theoretically, not only mods.

Some numbers

Path	Before	After	Comment
`/<gameshort>`	115ms	25ms	I think part of the reason this one and the one below (and `/browse`) were so bad is that we sorted by `Mod.download_count`, which is not indexed.
`/<gameshort>/browse`	50ms	12ms
`/<gameshort>/browse/top`	65ms	20ms
`/mod/<mod_id>`	38ms	35ms	Interestingly no relevant measurable improvement, maybe because we're still doing a ton of other SELECTs and UPDATEs
`/`	10ms	8ms

KerbalStuff/blueprints/anonymous.py

KerbalStuff/common.py

HebaruSan

Even aside from the performance improvements, it will be great to have so much duplicated code factored out into properly shared functions!

HebaruSan

To alpha it goes!

DasSkelett · 2021-05-10T11:19:49Z

Works fine, until someone triggers a 404 response:

2021-05-10 10:19:20,552 ERROR Backend:6761 Exception on /mod/7/What about Chrome/download [G
ET]
Traceback (most recent call last):
  File "/var/www/virtual/spacedock.info/htdocs/SpaceDock/lib/python3.8/site-packages/flask/a
pp.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/var/www/virtual/spacedock.info/htdocs/SpaceDock/lib/python3.8/site-packages/flask/a
pp.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/var/www/virtual/spacedock.info/htdocs/SpaceDock/KerbalStuff/common.py", line 65, in
 go
    ret = f(*args, **kwargs)
  File "/var/www/virtual/spacedock.info/htdocs/SpaceDock/KerbalStuff/blueprints/mods.py", li
ne 529, in download
    abort(404)
  File "/var/www/virtual/spacedock.info/htdocs/SpaceDock/lib/python3.8/site-packages/werkzeu
g/exceptions.py", line 772, in abort
    return _aborter(status, *args, **kwargs)
  File "/var/www/virtual/spacedock.info/htdocs/SpaceDock/lib/python3.8/site-packages/werkzeu
g/exceptions.py", line 753, in __call__
    raise self.mapping[code](*args, **kwargs)
werkzeug.exceptions.NotFound: 404 Not Found: Requested page not found. Looks like this was deleted, or maybe was never here.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/var/www/virtual/spacedock.info/htdocs/SpaceDock/lib/python3.8/site-packages/flask/app.py", line 2447, in wsgi_app
    response = self.full_dispatch_request()
  File "/var/www/virtual/spacedock.info/htdocs/SpaceDock/lib/python3.8/site-packages/flask/app.py", line 1952, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/var/www/virtual/spacedock.info/htdocs/SpaceDock/lib/python3.8/site-packages/flask/app.py", line 1816, in handle_user_exception
    return self.handle_http_exception(e)
  File "/var/www/virtual/spacedock.info/htdocs/SpaceDock/lib/python3.8/site-packages/flask/app.py", line 1744, in handle_http_exception
    return handler(e)
  File "/var/www/virtual/spacedock.info/htdocs/SpaceDock/KerbalStuff/app.py", line 137, in handle_404
    return render_template("error_404.html", error=e), 404
  File "/var/www/virtual/spacedock.info/htdocs/SpaceDock/lib/python3.8/site-packages/flask/templating.py", line 137, in render_template
    return _render(
  File "/var/www/virtual/spacedock.info/htdocs/SpaceDock/lib/python3.8/site-packages/flask/templating.py", line 120, in _render
    rv = template.render(context)
  File "/var/www/virtual/spacedock.info/htdocs/SpaceDock/lib/python3.8/site-packages/jinja2/environment.py", line 1090, in render
    self.environment.handle_exception()
  File "/var/www/virtual/spacedock.info/htdocs/SpaceDock/lib/python3.8/site-packages/jinja2/environment.py", line 832, in handle_exception
    reraise(*rewrite_traceback_stack(source=source))
  File "/var/www/virtual/spacedock.info/htdocs/SpaceDock/lib/python3.8/site-packages/jinja2/_compat.py", line 28, in reraise
    raise value.with_traceback(tb)
  File "/var/www/virtual/spacedock.info/htdocs/SpaceDock/KerbalStuff/../templates/error_404.html", line 1, in top-level template code
    {% extends "layout.html" %}
  File "/var/www/virtual/spacedock.info/htdocs/SpaceDock/KerbalStuff/../templates/layout.html", line 164, in top-level template code
    <h1>{{ announcement.title }} <small>{{ announcement.created.strftime("%Y-%m-%d %H:%M") }}</small></h1>
  File "/var/www/virtual/spacedock.info/htdocs/SpaceDock/lib/python3.8/site-packages/jinja2/environment.py", line 471, in getattr
    return getattr(obj, attribute)
  File "/var/www/virtual/spacedock.info/htdocs/SpaceDock/lib/python3.8/site-packages/sqlalchemy/orm/attributes.py", line 480, in __get__
    return self.impl.get(state, dict_)
  File "/var/www/virtual/spacedock.info/htdocs/SpaceDock/lib/python3.8/site-packages/sqlalchemy/orm/attributes.py", line 922, in get
    value = state._load_expired(state, passive)
  File "/var/www/virtual/spacedock.info/htdocs/SpaceDock/lib/python3.8/site-packages/sqlalchemy/orm/state.py", line 675, in _load_expired
    self.manager.expired_attribute_loader(self, toload, passive)
  File "/var/www/virtual/spacedock.info/htdocs/SpaceDock/lib/python3.8/site-packages/sqlalchemy/orm/loading.py", line 1343, in load_scalar_attributes
    raise orm_exc.DetachedInstanceError(
sqlalchemy.orm.exc.DetachedInstanceError: Instance <BlogPost at 0x7f61d37a1550> is not bound to a Session; attribute refresh operation cannot proceed (Background on this error at: http://sqlalche.me/e/14/bhk3)

Looks like SqlAlchemy automatically expires all ORM objects when an exception is raised (because we don't for 404 errors, only for real 500s), maybe through a roolback or something.
When objects are expired, an access to their attributes triggers a new db querie to make sure they are still up to date.
However, for our cached announcements (and maybe the other cached things too), the session in which they were initially fetched is probably long dead, thus SqlAlchemy fails to refresh these objects.

Preferably we'd find a way to completely decouple these objects from SqlAlchemy/the ORM//the database/the session; we don't need any of those features after they are cached. I kind of hoped .all() would do this, but apparently not.

Some useful links:

https://docs.sqlalchemy.org/en/14/errors.html#error-bhk3
https://docs.sqlalchemy.org/en/14/orm/session_state_management.html#session-expire
https://docs.sqlalchemy.org/en/14/orm/session_state_management.html#when-to-expire-or-refresh

The Session uses the expiration feature automatically whenever the transaction referred to by the session ends. Meaning, whenever Session.commit() or Session.rollback() is called, all objects within the Session are expired
https://docs.sqlalchemy.org/en/14/orm/session_api.html?highlight=expire#sqlalchemy.orm.Session.expunge
https://stackoverflow.com/a/7717565

It sounds like session.expunge(obj) is what we want, going to try that out.

Edit: Having trouble reproducing this bug locally. Something behaves differently.

DasSkelett · 2021-05-10T23:30:15Z

Okay, I managed to somewhat reproduce it by creating a new session inside get_announcement_posts and calling .rollbak() and .close() on it after fetching the posts.
Looping through the returned posts and calling session.expunge(post) indeed fixes this error.

However, I fear that this is also affection mods returned in get_*_mods, and all the other cached queries.
The problem with mod objects is, that they have a lot of relationships to other tables. Due to SqlAlchemy's lazy loading, these relations aren't fetched before they're accessed the first time, and thus even calling expunge() for mods doesn't get rid of this exception when accessing relationship columns.
I tried forcing eager-loading for all attributes (https://docs.sqlalchemy.org/en/14/orm/loading_relationships.html#wildcard-loading-strategies), but this kills the database because it seems to fetch all and everything remotely connected to a mod (and I think before filtering/limiting them, so for all x thousand mod rows). Eager-loading just a few columns isn't a good idea either, since we don't know what all the calling functions need, and it would be a maintenance nightmare.

I'm gonna look into it a bit further, but I might end up removing the cache decorators (but keep the refactoring) until we have figured out the correct way to do this (if it's even possible).

HebaruSan · 2021-05-11T01:36:09Z

I'm gonna look into it a bit further, but I might end up removing the cache decorators (but keep the refactoring) until we have figured out the correct way to do this (if it's even possible).

Sounds good to me. We should keep alpha in a working state as much as possible so we can deploy smoothly whenever our site owner finds it convenient.

Cache results of common database queries

94e28b9

DasSkelett added Type: Improvement Area: Backend Related to the Python code that runs inside gunicorn Priority: Normal Status: Ready labels May 8, 2021

DasSkelett requested a review from HebaruSan May 8, 2021 21:38

HebaruSan reviewed May 9, 2021

View reviewed changes

KerbalStuff/blueprints/anonymous.py Show resolved Hide resolved

HebaruSan reviewed May 9, 2021

View reviewed changes

KerbalStuff/common.py Show resolved Hide resolved

HebaruSan requested changes May 9, 2021

View reviewed changes

HebaruSan approved these changes May 9, 2021

View reviewed changes

HebaruSan merged commit a0dea25 into KSP-SpaceDock:alpha May 9, 2021

DasSkelett deleted the enh/cache-frequent-queries branch May 9, 2021 13:46

DasSkelett mentioned this pull request May 13, 2021

Revert caching of database queries #352

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache results of common database queries #349

Cache results of common database queries #349

DasSkelett commented May 8, 2021

HebaruSan left a comment

HebaruSan left a comment

DasSkelett commented May 10, 2021 •

edited

Loading

DasSkelett commented May 10, 2021

HebaruSan commented May 11, 2021

Cache results of common database queries #349

Cache results of common database queries #349

Conversation

DasSkelett commented May 8, 2021

Motivation

Changes

Cached lookups:

Additional changes

Some numbers

HebaruSan left a comment

Choose a reason for hiding this comment

HebaruSan left a comment

Choose a reason for hiding this comment

DasSkelett commented May 10, 2021 • edited Loading

DasSkelett commented May 10, 2021

HebaruSan commented May 11, 2021

DasSkelett commented May 10, 2021 •

edited

Loading