Use exists subquery to find deleted model instance pks #791

jeremy-engel · 2019-05-30T04:09:31Z

This is an idea that may improve the performance of get_deleted for some of the users in #748.

The changes shift the subquery on the model on get_deleted to use the built-in Exists (only django versions 1.11 and up are listed as supported as of now, and Exists was added in this version). The subquery should work across supported backends, the test suite passes, and I've used locally in MySQL and Postgres successfully.

As far as performance goes, things get a bit tricky, and I've been able to look into query times only in Postgres at the moment. But, in Postgres, I've experienced performance changes similar similar to the following when running get_deleted on the entire Version queryset:

For smaller models (in my testing, models with 10,000 or fewer instances), the original query takes 20% of the time of the new query (though on reasonable hardware for models of this size we are talking only a few milliseconds)
For larger models (in my testing, models with 100,000 or more instances), I've found that the new query takes 20% of the time of the original query (from roughly 10 seconds to 2 seconds in a relatively large model)

But there is another use case in which I've found more significant performance improvements (see jedie/django-reversion-compare#95). In the linked issue, the performance issues are from the following line (https://github.com/jedie/django-reversion-compare/blob/0bfe214e40933e38a5e5a94f3c6d0f56de9051be/reversion_compare/compare.py#L206):

    deleted = list(Version.objects.filter(revision=old_revision).get_deleted(related_model))

Here, the original query still must iterate over the pks the related model, but using Exists results in a significant performance improvement (in a model with a few million instances and many deletions, the query went from about 50 seconds to 1.5 milliseconds).

That being said, there are a couple of issues:

The performance improvement for larger models and filtered querysets may not be worth the cost for other cases
Should work with django's natively supported database backends, but performance may vary

In the general case, performance should be similar to the previous subquery. But when the queryset is filtered, a potentially costly scan of the model pks is avoided, greatly improving performance in models with many instances and revisions. Such a situation occurs in the `django-reversion-compare` package in https://github.com/jedie/django-reversion-compare/blob/0bfe214e40933e38a5e5a94f3c6d0f56de9051be/reversion_compare/compare.py#L206:

etianen · 2019-05-30T09:15:50Z

I like losing the custom SQL, and I like the idea of some performance improvements.

Can you make some benchmarks for the affected queries and views please? I only care about PostgreSQL, MySQL and SQLite. I'm happy to accept a small performance degradation for small datasets if the improvement for large datasets is as significant as you say.

jeremy-engel · 2019-05-30T17:19:29Z

To get some cross-database benchmarks, I used TestModel from test_app. To populate, the objects were created in a revision, saved in another revision, and then every other object was deleted using the following:

from test_app.models import TestModel
import reversion
from reversion.models import Version


def create_test_objects(num_objects):
    for obj_num in range(num_objects):
        with reversion.create_revision():
            obj = TestModel.objects.create()
        with reversion.create_revision():
            obj.save()
        # Delete half of created objects
        if obj_num % 2 == 0:
            obj.delete()

I ran the original query vs. the version in this pull request for postgres, sqlite, and mysql with 100, 1,000, and 10,000 objects (and 100,000 for postgres) for Version.objects.get_deleted(TestModel):

Postgres 11.3

Number of objects	Original query	New query
100	1.5 ms	1.7 ms
1000	4.8 ms	5.8 ms
10000	92.3 ms	35.0 ms
100000	933.7 ms	257.7 ms

MySQL 8.0

Number of objects	Original query	New query
100	2.1 ms	7.7 ms
1000	12.8 ms	237.7 ms
10000	85.8 ms	22313.4 ms

SQLite 3.24

Number of objects	Original query	New query
100	167 ms	166 ms
1000	1422 ms	1398 ms
10000	25430 ms	22557 ms

So MySQL should get the old fallback subquery again, but the postgres results are improved.

etianen · 2019-06-03T08:25:53Z

Thanks for a great PR, and a solid set of benchmarks! Merged.

Re-add fallback subquery for MySQL

f01740f

etianen merged commit 5a5b653 into etianen:master Jun 3, 2019

jeremy-engel deleted the get-deleted-speed branch June 5, 2019 03:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use exists subquery to find deleted model instance pks #791

Use exists subquery to find deleted model instance pks #791

jeremy-engel commented May 30, 2019

etianen commented May 30, 2019

jeremy-engel commented May 30, 2019

etianen commented Jun 3, 2019

Use exists subquery to find deleted model instance pks #791

Use exists subquery to find deleted model instance pks #791

Conversation

jeremy-engel commented May 30, 2019

etianen commented May 30, 2019

jeremy-engel commented May 30, 2019

etianen commented Jun 3, 2019