Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate using subqueries to filter across relationships #99

Open
rpkilby opened this issue Jul 31, 2016 · 2 comments
Open

Investigate using subqueries to filter across relationships #99

rpkilby opened this issue Jul 31, 2016 · 2 comments

Comments

@rpkilby
Copy link
Collaborator

rpkilby commented Jul 31, 2016

Just to rehash, the goal of this issue is to investigate the differences between the following two queries:

Blog.objects.filter(entry__in=Entry.objects.filter(headline__contains="Lennon"))

Blog.objects.filter(
    pk__in=Entry.objects \
                .filter(headline__contains="Lennon") \
                .values('blog')
)

In the latter query, by filtering on the pk of the blogs with the blog_ids of the filtered entries, we avoid a having to JOIN and having to specify DISTINCT.


As detailed here, there are nuances to filtering across to-many relationships. In general, most users would want to find blogs with entries about Lennon during 2008, not blogs that have entries about Lennon and that also have entries during 2008.

# combined join statements, generally the desired case
Blog.objects.filter(entry__headline__contains='Lennon', entry__pub_date__year=2008)

# chained join statements, the latter, less desirable case
Blog.objects.filter(entry__headline__contains='Lennon').filter(entry__pub_date__year=2008)

# Nested join is equivalent to the first case
Blog.objects.filter(
    entry=Entry.objects.filter(
        headline__contains='Lennon',
        pub_date__year=2008,
    ),
)

# It's also possible to use a nested subquery-only approach
Blog.objects.filter(
    id__in=Entry.objects.filter(
        headline__contains='Lennon',
        pub_date__year=2008,
    ).values('blog'),
)

In general, the combined join statements syntax (1st case above) is not possible to construct with the current implementation of django-filter. To make this possible, the filter API would need to return Q objects instead of invoking filter() and exclude() directly. Additionally, exclude() does not follow the same rules as filter(), and combined joins will behave like chained excludes.

  • Need to compare the sql/results of each approach. What queries are interchangeable?
  • Test matrix:
    • filter vs exclude
    • chained vs combined
    • join statements vs nested join vs nested subquery

Either way, it will be necessary to use either the nested join or nested subquery approach. This makes it possible to filter the default queryset (permissions handling).

Todo:

  • are chained and combined subqueries equivalent?

    Blog.objects.filter(
        entry__in=Entry.objects.filter(
            headline__contains='Lennon',
            pub_date__year=2008,
        ),
    )
    
    Blog.objects.filter(
        entry=Entry.objects \
            .filter(headline__contains='Lennon') \
            .filter(pub_date__year=2008)
    )
@rpkilby rpkilby modified the milestone: v1.0.0 Aug 2, 2016
@rpkilby rpkilby changed the title Investigate using subqueries to filter across relationships Investigate using nested queries to filter across relationships Nov 12, 2016
@rpkilby
Copy link
Collaborator Author

rpkilby commented Nov 13, 2016

Results:

q1 is the current behavior, but generally incorrect and has poor performance (n table joins for n filter statements). This also requires a distinct() call in to deduplicate the results.

q2 is unfeasible to generate, but correct.

q3/q4 and q5/q6 should be equivalent, with the exception that q3 & q4 rely on a join. That said, q3 & q4 are easier to generate than the subquery approach. The subquery approach is more difficult since it doesn't automatically account for to_fields.


filter() queries:

join statements nested join nested subquery
chained q1 q3 q5
combined q2 q4 q6
q1 = Blog.objects \
    .filter(entry__headline__contains='Lennon') \
    .filter(entry__pub_date__year=2008)

q2 = Blog.objects.filter(
    entry__headline__contains='Lennon',
    entry__pub_date__year=2008,
)

q3 = Blog.objects.filter(
    entry__in=Entry.objects \
        .filter(headline__contains='Lennon') \
        .filter(pub_date__year=2008)
)

q4 = Blog.objects.filter(
    entry__in=Entry.objects.filter(
        headline__contains='Lennon',
        pub_date__year=2008,
    ),
)

q5 = Blog.objects.filter(
    id__in=Entry.objects \
        .filter(headline__contains='Lennon') \
        .filter(pub_date__year=2008) \
        .values('blog')
)

q6 = Blog.objects.filter(
    id__in=Entry.objects.filter(
        headline__contains='Lennon',
        pub_date__year=2008,
    ).values('blog'),
)

q1

SELECT "testapp_blog"."id",
       "testapp_blog"."name"
FROM "testapp_blog"
INNER JOIN "testapp_entry" ON ("testapp_blog"."id" = "testapp_entry"."blog_id")
INNER JOIN "testapp_entry" T3 ON ("testapp_blog"."id" = T3."blog_id")
WHERE ("testapp_entry"."headline" LIKE %Lennon% ESCAPE '\'
       AND T3."pub_date" BETWEEN 2008-01-01 AND 2008-12-31) 

q2

SELECT "testapp_blog"."id",
       "testapp_blog"."name"
FROM "testapp_blog"
INNER JOIN "testapp_entry" ON ("testapp_blog"."id" = "testapp_entry"."blog_id")
WHERE ("testapp_entry"."headline" LIKE %Lennon% ESCAPE '\'
       AND "testapp_entry"."pub_date" BETWEEN 2008-01-01 AND 2008-12-31) 

q3 & q4

SELECT "testapp_blog"."id",
       "testapp_blog"."name"
FROM "testapp_blog"
INNER JOIN "testapp_entry" ON ("testapp_blog"."id" = "testapp_entry"."blog_id")
WHERE "testapp_entry"."id" IN
    (SELECT U0."id"
     FROM "testapp_entry" U0
     WHERE (U0."headline" LIKE %Lennon% ESCAPE '\'
            AND U0."pub_date" BETWEEN 2008-01-01 AND 2008-12-31)) 

q5 & q6

SELECT "testapp_blog"."id",
       "testapp_blog"."name"
FROM "testapp_blog"
WHERE "testapp_blog"."id" IN
    (SELECT U0."blog_id"
     FROM "testapp_entry" U0
     WHERE (U0."headline" LIKE %Lennon% ESCAPE '\'
            AND U0."pub_date" BETWEEN 2008-01-01 AND 2008-12-31)) 

@rpkilby rpkilby modified the milestones: v0.10.0, v1.0.0 Apr 19, 2017
@rpkilby rpkilby changed the title Investigate using nested queries to filter across relationships Investigate using subqueries to filter across relationships Oct 24, 2017
@rpkilby
Copy link
Collaborator Author

rpkilby commented Oct 26, 2017

The upcoming v1 will change how queries are constructed, going from queries like q1 to queries like q3.

However, this issue is now more about the differences q3 and q5, and that will not be resolved in the release. For now, removing from the milestone.

The change from q3 to q5 would also make the distinct argument pointless.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant