Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache results of common database queries #349

Merged

Conversation

DasSkelett
Copy link
Member

Motivation

Thanks to #346 we now have very valuable profiling data for our backend. After fixing the previous bottleneck, repeated template bytecode compilation due to disabled caching, the major part of most requests is now spent on database queries.
The answer: moar caching!

This PR does an initial step at caching the most common database queries (some of them executed on every request), that tend to not change at all or only very infrequently, or don't hurt if they are a bit out of date.
It's the low hanging fruit, there's still a lot more we should eventually cache, but those need some refactoring and other bigger changes, or careful planning. I preferred to get the easy stuff in first.

Changes

We are caching all of the following requests in a TTLCache, that is s LRU cache with timeouts, using the cachetools library:

It allows caching the return values of a function by simply slapping a decorator on it, which makes it pretty easy to use.

Cached lookups:

  • The announcement blog post query done in inject() (executed for every request) cached for 30 minutes. If we create a new announcement or delete/edit an existing one, which happens very rarely, it doesn't hurt if it takes some time to be visible to all users.
    Caveat: it can happen that the workers get "out of sync" for up to 30 minutes, and the banner might change with every request depending on which worker your request goes to.
  • All the mod lookups for /browse, /browse/*.rss, /<gameshort>/browse, /<gameshort>, /<gameshort>/browse/*.rss cached for 30 minutes. They are factored out into get_*_mods() functions in common.py, one for each of those categories (featured/new/updated/top). What changes most there are the new and updated mods, but they should be fine being 30 minutes behind as well.
  • /browse/all, /browse/top, /<gameshort>/browse/all, /<gameshort>/browse/top, /api/browse/top, as well as /search and /<gameshort>/search through search_mods() cached for 10 minutes. I decided to give them a lower timeout, since mod searches also go through this function, and we don't really need to cache them. Maybe we should split the functions called by */top and */search at some point (and merge it with get_top_mods()).
    search_mods() now takes a Game.id instead of the whole Game object, since that's all it needed anyway, and the Game objects change every time they're looked up from the database (i.e. with every request, they're never the same), which bypassed the cache.
  • The queries for referrals, download stats and follow events in /mod/<mod_id>, /mod/<mod_id>/<mod_name>/stats/(downloads|referrals|followers) cached for 30 minutes.
    There could be some issues due slightly mismatching 30 day time frames of the downloads/follows and thirty_days_ago which is then used by the graph workers. I might do some database manipulation to test it. But most likely we are fine there.

The cache times are a blind guess at what could be acceptable as lag behind current data and effectiveness of the cache.
The derivation of the cache sizes is documented in the code. I basically did a generous calculation of the amount of possible function argument combinations (since each combination gets its own key in the cache), then added or removed a bit depending on the situation.
Every cache size gets multiplied by 48 for the amoutnt of worker (and the actual size in bytes of the returned object), which needs to be considered.

I'm open for suggestions for better values of these two fields.

Additional changes

  • Renamed get_mods() to get_paginated_mods() and paginate_mods() to paginate_query() to help distinguishing them from each other (and search_mods()). Also paginate_query() could be used for other objects as well theoretically, not only mods.

Some numbers

Path Before After Comment
/<gameshort> 115ms 25ms I think part of the reason this one and the one below (and /browse) were so bad is that we sorted by Mod.download_count, which is not indexed.
/<gameshort>/browse 50ms 12ms
/<gameshort>/browse/top 65ms 20ms
/mod/<mod_id> 38ms 35ms Interestingly no relevant measurable improvement, maybe because we're still doing a ton of other SELECTs and UPDATEs
/ 10ms 8ms

@DasSkelett DasSkelett added Type: Improvement Area: Backend Related to the Python code that runs inside gunicorn Priority: Normal Status: Ready labels May 8, 2021
@DasSkelett DasSkelett requested a review from HebaruSan May 8, 2021 21:38
Copy link
Contributor

@HebaruSan HebaruSan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even aside from the performance improvements, it will be great to have so much duplicated code factored out into properly shared functions!

Copy link
Contributor

@HebaruSan HebaruSan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To alpha it goes!

@HebaruSan HebaruSan merged commit a0dea25 into KSP-SpaceDock:alpha May 9, 2021
@DasSkelett DasSkelett deleted the enh/cache-frequent-queries branch May 9, 2021 13:46
@DasSkelett
Copy link
Member Author

DasSkelett commented May 10, 2021

Works fine, until someone triggers a 404 response:

2021-05-10 10:19:20,552 ERROR Backend:6761 Exception on /mod/7/What about Chrome/download [G
ET]
Traceback (most recent call last):
  File "/var/www/virtual/spacedock.info/htdocs/SpaceDock/lib/python3.8/site-packages/flask/a
pp.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/var/www/virtual/spacedock.info/htdocs/SpaceDock/lib/python3.8/site-packages/flask/a
pp.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/var/www/virtual/spacedock.info/htdocs/SpaceDock/KerbalStuff/common.py", line 65, in
 go
    ret = f(*args, **kwargs)
  File "/var/www/virtual/spacedock.info/htdocs/SpaceDock/KerbalStuff/blueprints/mods.py", li
ne 529, in download
    abort(404)
  File "/var/www/virtual/spacedock.info/htdocs/SpaceDock/lib/python3.8/site-packages/werkzeu
g/exceptions.py", line 772, in abort
    return _aborter(status, *args, **kwargs)
  File "/var/www/virtual/spacedock.info/htdocs/SpaceDock/lib/python3.8/site-packages/werkzeu
g/exceptions.py", line 753, in __call__
    raise self.mapping[code](*args, **kwargs)
werkzeug.exceptions.NotFound: 404 Not Found: Requested page not found. Looks like this was deleted, or maybe was never here.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/var/www/virtual/spacedock.info/htdocs/SpaceDock/lib/python3.8/site-packages/flask/app.py", line 2447, in wsgi_app
    response = self.full_dispatch_request()
  File "/var/www/virtual/spacedock.info/htdocs/SpaceDock/lib/python3.8/site-packages/flask/app.py", line 1952, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/var/www/virtual/spacedock.info/htdocs/SpaceDock/lib/python3.8/site-packages/flask/app.py", line 1816, in handle_user_exception
    return self.handle_http_exception(e)
  File "/var/www/virtual/spacedock.info/htdocs/SpaceDock/lib/python3.8/site-packages/flask/app.py", line 1744, in handle_http_exception
    return handler(e)
  File "/var/www/virtual/spacedock.info/htdocs/SpaceDock/KerbalStuff/app.py", line 137, in handle_404
    return render_template("error_404.html", error=e), 404
  File "/var/www/virtual/spacedock.info/htdocs/SpaceDock/lib/python3.8/site-packages/flask/templating.py", line 137, in render_template
    return _render(
  File "/var/www/virtual/spacedock.info/htdocs/SpaceDock/lib/python3.8/site-packages/flask/templating.py", line 120, in _render
    rv = template.render(context)
  File "/var/www/virtual/spacedock.info/htdocs/SpaceDock/lib/python3.8/site-packages/jinja2/environment.py", line 1090, in render
    self.environment.handle_exception()
  File "/var/www/virtual/spacedock.info/htdocs/SpaceDock/lib/python3.8/site-packages/jinja2/environment.py", line 832, in handle_exception
    reraise(*rewrite_traceback_stack(source=source))
  File "/var/www/virtual/spacedock.info/htdocs/SpaceDock/lib/python3.8/site-packages/jinja2/_compat.py", line 28, in reraise
    raise value.with_traceback(tb)
  File "/var/www/virtual/spacedock.info/htdocs/SpaceDock/KerbalStuff/../templates/error_404.html", line 1, in top-level template code
    {% extends "layout.html" %}
  File "/var/www/virtual/spacedock.info/htdocs/SpaceDock/KerbalStuff/../templates/layout.html", line 164, in top-level template code
    <h1>{{ announcement.title }} <small>{{ announcement.created.strftime("%Y-%m-%d %H:%M") }}</small></h1>
  File "/var/www/virtual/spacedock.info/htdocs/SpaceDock/lib/python3.8/site-packages/jinja2/environment.py", line 471, in getattr
    return getattr(obj, attribute)
  File "/var/www/virtual/spacedock.info/htdocs/SpaceDock/lib/python3.8/site-packages/sqlalchemy/orm/attributes.py", line 480, in __get__
    return self.impl.get(state, dict_)
  File "/var/www/virtual/spacedock.info/htdocs/SpaceDock/lib/python3.8/site-packages/sqlalchemy/orm/attributes.py", line 922, in get
    value = state._load_expired(state, passive)
  File "/var/www/virtual/spacedock.info/htdocs/SpaceDock/lib/python3.8/site-packages/sqlalchemy/orm/state.py", line 675, in _load_expired
    self.manager.expired_attribute_loader(self, toload, passive)
  File "/var/www/virtual/spacedock.info/htdocs/SpaceDock/lib/python3.8/site-packages/sqlalchemy/orm/loading.py", line 1343, in load_scalar_attributes
    raise orm_exc.DetachedInstanceError(
sqlalchemy.orm.exc.DetachedInstanceError: Instance <BlogPost at 0x7f61d37a1550> is not bound to a Session; attribute refresh operation cannot proceed (Background on this error at: http://sqlalche.me/e/14/bhk3)

Looks like SqlAlchemy automatically expires all ORM objects when an exception is raised (because we don't for 404 errors, only for real 500s), maybe through a roolback or something.
When objects are expired, an access to their attributes triggers a new db querie to make sure they are still up to date.
However, for our cached announcements (and maybe the other cached things too), the session in which they were initially fetched is probably long dead, thus SqlAlchemy fails to refresh these objects.

Preferably we'd find a way to completely decouple these objects from SqlAlchemy/the ORM//the database/the session; we don't need any of those features after they are cached. I kind of hoped .all() would do this, but apparently not.

Some useful links:

It sounds like session.expunge(obj) is what we want, going to try that out.

Edit: Having trouble reproducing this bug locally. Something behaves differently.

@DasSkelett
Copy link
Member Author

Okay, I managed to somewhat reproduce it by creating a new session inside get_announcement_posts and calling .rollbak() and .close() on it after fetching the posts.
Looping through the returned posts and calling session.expunge(post) indeed fixes this error.

However, I fear that this is also affection mods returned in get_*_mods, and all the other cached queries.
The problem with mod objects is, that they have a lot of relationships to other tables. Due to SqlAlchemy's lazy loading, these relations aren't fetched before they're accessed the first time, and thus even calling expunge() for mods doesn't get rid of this exception when accessing relationship columns.
I tried forcing eager-loading for all attributes (https://docs.sqlalchemy.org/en/14/orm/loading_relationships.html#wildcard-loading-strategies), but this kills the database because it seems to fetch all and everything remotely connected to a mod (and I think before filtering/limiting them, so for all x thousand mod rows). Eager-loading just a few columns isn't a good idea either, since we don't know what all the calling functions need, and it would be a maintenance nightmare.

I'm gonna look into it a bit further, but I might end up removing the cache decorators (but keep the refactoring) until we have figured out the correct way to do this (if it's even possible).

@HebaruSan
Copy link
Contributor

I'm gonna look into it a bit further, but I might end up removing the cache decorators (but keep the refactoring) until we have figured out the correct way to do this (if it's even possible).

Sounds good to me. We should keep alpha in a working state as much as possible so we can deploy smoothly whenever our site owner finds it convenient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: Backend Related to the Python code that runs inside gunicorn Priority: Normal Status: Ready Type: Improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants