Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve media attached status query #16105

Merged
merged 1 commit into from
Apr 25, 2021

Conversation

abcang
Copy link
Contributor

@abcang abcang commented Apr 24, 2021

Improve queries by using JOIN and GROUP BY instead of IN operator. (Refer to #14675)

Also, I stopped using the condition of media_attachments because it could cause an inefficient join method to be selected, resulting in a large number of row fetches and sorting without index. With this change, a NESTED LOOPS JOIN will be selected, which will sort using index and minimize the amount of fetching. Users who have a lot of media_attachments will see more improvement.

As an example, here is the result of EXPLAIN on the public page of the account's media. (The actual time is short because the DB cache is working, pay attention to the cost.)

before:

SELECT 
  "statuses"."id", 
  "statuses"."updated_at" 
FROM 
  "statuses" 
WHERE 
  "statuses"."account_id" = xxxxxxxx 
  AND "statuses"."visibility" IN (0, 1) 
  AND "statuses"."id" IN (
    SELECT 
      "media_attachments"."status_id" 
    FROM 
      "media_attachments" 
    WHERE 
      "media_attachments"."account_id" = xxxxxxxx 
      AND (
        "media_attachments"."status_id" IS NOT NULL 
        OR "media_attachments"."scheduled_status_id" IS NOT NULL
      ) 
    GROUP BY 
      "media_attachments"."status_id"
  ) 
  AND "statuses"."deleted_at" IS NULL 
  AND (
    statuses.reply = FALSE 
    OR statuses.in_reply_to_account_id = statuses.account_id
  ) 
ORDER BY 
  "statuses"."id" DESC 
LIMIT 
  20
Limit  (cost=30917.03..31059.64 rows=20 width=16) (actual time=23.690..23.744 rows=20 loops=1)
  ->  Merge Semi Join  (cost=30917.03..199633.00 rows=23662 width=16) (actual time=23.689..23.738 rows=20 loops=1)
        Merge Cond: (statuses.id = media_attachments.status_id)
        ->  Index Scan using index_statuses_20190820 on statuses  (cost=0.56..168563.86 rows=47324 width=16) (actual time=0.021..0.058 rows=29 loops=1)
              Index Cond: (account_id = xxxxxxxx)
              Filter: ((visibility = ANY ('{0,1}'::integer[])) AND ((NOT reply) OR (in_reply_to_account_id = account_id)))
        ->  Sort  (cost=30916.47..30933.53 rows=6822 width=8) (actual time=23.666..23.669 rows=20 loops=1)
              Sort Key: media_attachments.status_id DESC
              Sort Method: quicksort  Memory: 908kB
              ->  HashAggregate  (cost=30345.61..30413.83 rows=6822 width=8) (actual time=18.604..20.623 rows=11172 loops=1)
                    Group Key: media_attachments.status_id
                    ->  Bitmap Heap Scan on media_attachments  (cost=374.56..30315.24 rows=12146 width=8) (actual time=3.823..14.611 rows=12403 loops=1)
                          Recheck Cond: (account_id = xxxxxxxx)
                          Filter: ((status_id IS NOT NULL) OR (scheduled_status_id IS NOT NULL))
                          Heap Blocks: exact=8737
                          ->  Bitmap Index Scan on index_media_attachments_on_account_id  (cost=0.00..371.52 rows=12146 width=0) (actual time=1.961..1.961 rows=14550 loops=1)
                                Index Cond: (account_id = xxxxxxxx)
Planning time: 0.383 ms
Execution time: 23.808 ms

after:

SELECT 
  "statuses"."id", 
  "statuses"."updated_at" 
FROM 
  "statuses" 
  INNER JOIN "media_attachments" ON "media_attachments"."status_id" = "statuses"."id" 
WHERE 
  "statuses"."account_id" = xxxxxxxx 
  AND "statuses"."visibility" IN (0, 1) 
  AND "statuses"."deleted_at" IS NULL 
  AND (
    statuses.reply = FALSE 
    OR statuses.in_reply_to_account_id = statuses.account_id
  ) 
GROUP BY 
  "statuses"."id" 
ORDER BY 
  "statuses"."id" DESC 
LIMIT 
  20
Limit  (cost=0.99..1773.50 rows=20 width=16) (actual time=0.035..0.179 rows=20 loops=1)
  ->  Group  (cost=0.99..357429.10 rows=4033 width=16) (actual time=0.035..0.177 rows=20 loops=1)
        Group Key: statuses.id
        ->  Nested Loop  (cost=0.99..357419.02 rows=4033 width=16) (actual time=0.032..0.169 rows=22 loops=1)
              ->  Index Scan using index_statuses_20190820 on statuses  (cost=0.56..168563.86 rows=47324 width=16) (actual time=0.021..0.064 rows=29 loops=1)
                    Index Cond: (account_id = xxxxxxxx)
                    Filter: ((visibility = ANY ('{0,1}'::integer[])) AND ((NOT reply) OR (in_reply_to_account_id = account_id)))
              ->  Index Only Scan using index_media_attachments_on_status_id on media_attachments  (cost=0.43..3.97 rows=2 width=8) (actual time=0.003..0.003 rows=1 loops=29)
                    Index Cond: (status_id = statuses.id)
                    Heap Fetches: 22
Planning time: 0.518 ms
Execution time: 0.214 ms

@ClearlyClaire
Copy link
Contributor

Also, I stopped using the condition of media_attachments because it could cause an inefficient join method to be selected, resulting in a large number of row fetches and sorting without index.

What do you mean by that?

@Gargron Gargron merged commit 1f47511 into mastodon:main Apr 25, 2021
@abcang
Copy link
Contributor Author

abcang commented Apr 25, 2021

The following is the EXPLAIN of the public page of the account's media with the media_attachments condition. In this case too, the sort is being performed without using index.

def only_media_scope
  Status.joins(:media_attachments).merge(@account.media_attachments.reorder(nil)).group(:id)
end
SELECT 
  "statuses"."id", 
  "statuses"."updated_at" 
FROM 
  "statuses" 
  INNER JOIN "media_attachments" ON "media_attachments"."status_id" = "statuses"."id" 
WHERE 
  "statuses"."account_id" = xxxxxxxx 
  AND "statuses"."visibility" IN (0, 1) 
  AND "media_attachments"."account_id" = xxxxxxxx 
  AND "statuses"."deleted_at" IS NULL 
  AND (
    statuses.reply = FALSE 
    OR statuses.in_reply_to_account_id = statuses.account_id
  ) 
GROUP BY 
  "statuses"."id" 
ORDER BY 
  "statuses"."id" DESC 
LIMIT 
  20
Limit  (cost=31142.13..107882.19 rows=20 width=16) (actual time=30.262..30.347 rows=20 loops=1)
  ->  Group  (cost=31142.13..199970.25 rows=44 width=16) (actual time=30.260..30.342 rows=20 loops=1)
        Group Key: statuses.id
        ->  Merge Join  (cost=31142.13..199970.14 rows=44 width=16) (actual time=30.258..30.334 rows=21 loops=1)
              Merge Cond: (statuses.id = media_attachments.status_id)
              ->  Index Scan using index_statuses_20190820 on statuses  (cost=0.56..168651.37 rows=47348 width=16) (actual time=0.022..0.113 rows=75 loops=1)
                    Index Cond: (account_id = xxxxxxxx)
                    Filter: ((visibility = ANY ('{0,1}'::integer[])) AND ((NOT reply) OR (in_reply_to_account_id = account_id)))
              ->  Sort  (cost=31139.24..31169.60 rows=12146 width=8) (actual time=30.203..30.205 rows=21 loops=1)
                    Sort Key: media_attachments.status_id DESC
                    Sort Method: quicksort  Memory: 967kB
                    ->  Bitmap Heap Scan on media_attachments  (cost=374.56..30315.24 rows=12146 width=8) (actual time=9.830..26.585 rows=12431 loops=1)
                          Recheck Cond: (account_id = xxxxxxxx)
                          Heap Blocks: exact=8750
                          ->  Bitmap Index Scan on index_media_attachments_on_account_id  (cost=0.00..371.52 rows=12146 width=0) (actual time=2.294..2.294 rows=14610 loops=1)
                                Index Cond: (account_id = xxxxxxxx)
Planning time: 0.441 ms
Execution time: 30.505 ms

This change may slow down users with fewer media_attachements a bit, so maybe we should keep the media_attachements condition and add an appropriate index...

@abcang abcang deleted the improve_media_attached_status_query branch April 25, 2021 04:56
@abcang
Copy link
Contributor Author

abcang commented Apr 25, 2021

I have also tried using index to improve things, and changing the index seems to be more effective.

#16106

chrisguida pushed a commit to Start9Labs/mastodon that referenced this pull request Feb 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants