Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Further improve the media attached status query for accounts #16106

Conversation

abcang
Copy link
Contributor

@abcang abcang commented Apr 25, 2021

I further improved on #16105. Added a composite index for account_id and status_id to allow sorting using the index while keeping the condition to filter media_attachments by account. With this change, we can expect faster query execution even for users with fewer media_attachments.

From the EXPLAIN result, you can see that media_attachments is narrowed down first.

before:

SELECT 
  "statuses"."id", 
  "statuses"."updated_at" 
FROM 
  "statuses" 
  INNER JOIN "media_attachments" ON "media_attachments"."status_id" = "statuses"."id" 
WHERE 
  "statuses"."account_id" = xxxxxxxx 
  AND "statuses"."visibility" IN (0, 1) 
  AND "statuses"."deleted_at" IS NULL 
  AND (
    statuses.reply = FALSE 
    OR statuses.in_reply_to_account_id = statuses.account_id
  ) 
GROUP BY 
  "statuses"."id" 
ORDER BY 
  "statuses"."id" DESC 
LIMIT 
  20
Limit  (cost=0.99..1773.50 rows=20 width=16) (actual time=0.035..0.179 rows=20 loops=1)
  ->  Group  (cost=0.99..357429.10 rows=4033 width=16) (actual time=0.035..0.177 rows=20 loops=1)
        Group Key: statuses.id
        ->  Nested Loop  (cost=0.99..357419.02 rows=4033 width=16) (actual time=0.032..0.169 rows=22 loops=1)
              ->  Index Scan using index_statuses_20190820 on statuses  (cost=0.56..168563.86 rows=47324 width=16) (actual time=0.021..0.064 rows=29 loops=1)
                    Index Cond: (account_id = xxxxxxxx)
                    Filter: ((visibility = ANY ('{0,1}'::integer[])) AND ((NOT reply) OR (in_reply_to_account_id = account_id)))
              ->  Index Only Scan using index_media_attachments_on_status_id on media_attachments  (cost=0.43..3.97 rows=2 width=8) (actual time=0.003..0.003 rows=1 loops=29)
                    Index Cond: (status_id = statuses.id)
                    Heap Fetches: 22
Planning time: 0.518 ms
Execution time: 0.214 ms

after:

SELECT 
  "statuses"."id", 
  "statuses"."updated_at" 
FROM 
  "statuses" 
  INNER JOIN "media_attachments" ON "media_attachments"."status_id" = "statuses"."id" 
WHERE 
  "statuses"."account_id" = xxxxxxxx 
  AND "statuses"."visibility" IN (0, 1) 
  AND "media_attachments"."account_id" = xxxxxxxx 
  AND "statuses"."deleted_at" IS NULL 
  AND (
    statuses.reply = FALSE 
    OR statuses.in_reply_to_account_id = statuses.account_id
  ) 
GROUP BY 
  "statuses"."id" 
ORDER BY 
  "statuses"."id" DESC 
LIMIT 
  20
Limit  (cost=0.86..57937.92 rows=20 width=16) (actual time=0.031..0.147 rows=20 loops=1)
  ->  Group  (cost=0.86..127462.40 rows=44 width=16) (actual time=0.031..0.146 rows=20 loops=1)
        Group Key: statuses.id
        ->  Nested Loop  (cost=0.86..127462.29 rows=44 width=16) (actual time=0.030..0.140 rows=21 loops=1)
              ->  Index Only Scan using index_media_attachments_on_account_id_and_status_id on media_attachments  (cost=0.43..31251.58 rows=12193 width=8) (actual time=0.019..0.034 rows=21 loops=1)
                    Index Cond: (account_id = xxxxxxxx)
                    Heap Fetches: 21
              ->  Index Scan using statuses_pkey on statuses  (cost=0.43..7.88 rows=1 width=16) (actual time=0.004..0.004 rows=1 loops=21)
                    Index Cond: (id = media_attachments.status_id)
                    Filter: ((deleted_at IS NULL) AND (visibility = ANY ('{0,1}'::integer[])) AND ((NOT reply) OR (in_reply_to_account_id = account_id)) AND (account_id = xxxxxxxx))
Planning time: 0.377 ms
Execution time: 0.179 ms

@ClearlyClaire
Copy link
Contributor

Is there any benefit to this besides re-adding the explicit filtering of media_attachments by account id? If not I'm not sure about this, since media attachments only get attached to statuses if they are from the same account anyway.

@abcang
Copy link
Contributor Author

abcang commented Apr 25, 2021

This change will improve query speed for accounts with few media_attachments or accounts with a low percentage of posts with media_attachments.

It is possible to search more efficiently by using the table with the least amount of data as the driving table. In general, the number of media_attachments in an account is less than the number of statuses, but if media_attachments are not filtered, the statuses table will certainly be the driving table. Therefore, if an account has fewer media_attachments, searching from the statuses table first will be less efficient.

If media_attachments is filtered, the query planner will notice that the media_attachments table has less data and is easier to narrow down. And by adding a composite index, sorting without using indexes will not occur as it happened before.

@ClearlyClaire
Copy link
Contributor

It seems like you forgot to add the db migration script!

@abcang abcang force-pushed the improve_account_media_attached_status_query_further branch from 4d1e987 to 23327d3 Compare April 25, 2021 10:45
@abcang
Copy link
Contributor Author

abcang commented Apr 25, 2021

Thanks for telling me! I'd forgotten about that!

@Gargron Gargron merged commit d0fc69d into mastodon:main Apr 26, 2021
chrisguida pushed a commit to Start9Labs/mastodon that referenced this pull request Feb 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants