SQL Paging and filtering with multiple types #2444

dessalines · 2022-09-19T18:56:54Z

Making this issue to discuss solutions to the SQL paging problem.

There are a few places in UI where different types of data in list can be shown to the user. A few:

The inbox (replies, mentions(either post or comment), and private messages).
The Profile page (overview with posts and comments)
A Saved page (saved posts and comments)
The modlog (all those different types).
Reports (comments / posts/ private messages).
Search (all the different types).

The problem is that, since different types of data live in different tables, doing a SQL select ... order by ... LIMIT X query can have data from widely different time periods both on pages 1 and 2.

Solving this problem could also let us return generic lists into single API endpoints, rather than having multiple ones like we do now, addressing #2441 .

The only solutions I can think of:

Time based queries. IE return all results within a week or a month, where page 1 is < 1 week, page 2 > 1 week and < 2 weeks, etc.
Select the entire tables in code, sort by time, and throw out results above the LIMIT. Probably incredibly slow, and taxing to the DB.
SQL union the tables on the id and published field, then do order by limit. Could be slow (since union would work on the whole table), and would take a ton of back-end work to convert these ids into views.

cc @Nutomic

The text was updated successfully, but these errors were encountered:

kartikynwa · 2022-09-21T03:16:33Z

If you go for option 1 then it will be pretty difficult to do proper pagination (limit offset style) because different users can have highly varied number of items in a specified timespan. Another problem is that you could get incorrect results in rare cases where despite high granularity certain rows end up having the same timestamp. Postgres ordering for ties is unstable so you can up getting differently ordered results for the same query.

If you don't want to implement a variety of sorting types and do it just on the basis of id (assuming it is a stable field like SERIAL that can be compared), you can use a query like this:

SELECT id, somecolumn, type FROM (

  SELECT id, somecolumn, 0 AS type
  FROM issue
  WHERE id > last_greatest_table1_id

  UNION ALL

  SELECT id, somecolumn, 1 AS type
  FROM pull
  WHERE id > last_greatest_table2_id

) t
ORDER BY id
LIMIT 100;

This ends being decently fast because there will always be an index on the id field. Postgres will just end up doing parallel index scan which is not a heavy operation. I don't know if this can be done in diesel though. The downside is that you will need some way of remembering what the last greatest id for each type was. Also the query will be different if the newest items don't have the greatest id.

An example EXPLAIN ANALYZE

coolhostname=# explain analyze SELECT id, timestamp, type FROM (
  SELECT id, timestamp, 0 AS type
  FROM table1
  WHERE timestamp < '2022-09-20 21:35:43.230754'
  UNION ALL
  SELECT id, timestamp, 1 AS type
  FROM table2
  WHERE timestamp < '2022-09-20 21:35:40.866767'
) t
ORDER BY timestamp DESC
LIMIT 100;
                                                                  QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=42700.87..42712.53 rows=100 width=33) (actual time=257.127..258.189 rows=100 loops=1)
   ->  Gather Merge  (cost=42700.87..60773.56 rows=154898 width=33) (actual time=257.124..258.177 rows=100 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Sort  (cost=41700.84..41894.47 rows=77449 width=33) (actual time=244.665..244.677 rows=82 loops=3)
               Sort Key: table1.timestamp DESC
               Sort Method: top-N heapsort  Memory: 46kB
               Worker 0:  Sort Method: top-N heapsort  Memory: 42kB
               Worker 1:  Sort Method: top-N heapsort  Memory: 40kB
               ->  Parallel Append  (cost=0.00..39764.62 rows=77449 width=33) (actual time=0.061..229.558 rows=61918 loops=3)
                     ->  Parallel Seq Scan on table1  (cost=0.00..36680.94 rows=74116 width=33) (actual time=0.024..192.779 rows=59238 loops=3)
                           Filter: (timestamp < '2022-09-20 21:35:43.230754'::timestamp without time zone)
                           Rows Removed by Filter: 29
                     ->  Parallel Seq Scan on table2  (cost=0.00..1908.22 rows=4705 width=40) (actual time=0.071..44.772 rows=4020 loops=2)
                           Filter: (timestamp < '2022-09-20 21:35:40.866767'::timestamp without time zone)
                           Rows Removed by Filter: 7
 Planning Time: 1.362 ms
 Execution Time: 258.353 ms
(18 rows)

dessalines · 2022-09-22T15:36:41Z

The time conflict probably wouldn't be an issue, because our time fields are postgres timestamp columns, which are down to microseconds.

We couldn't use IDs, because the ids in different tables have no time or any correlation to each other. IE the 10 highest ID comments for a user might go back a day, and the 10 highest ID posts go back a year.

If we did time-based queries, we could do unions (probably would be optimal to avoid doing multiple queries), but converting them back to views would be a lot of work.

With your query, that seems close to what the best way to do it would be, except the query would be (where timestame > '1 week ago'. Also make sure you have indexes on your timestamp columns.

dessalines · 2023-04-17T19:16:33Z

I'm now convinced that using option 3, IE unions, then sorting by published, is the only way to do this correctly. The problem is, that it requires some massive diesel work in building the views, and might potentially be very slow.

Take for example the search page; it has to be able to return users, comments, posts, etc. So it needs to union all those tables, sort and limit, build the views from a very long list of columns, then create the Vec<SearchResultType> .

Nutomic · 2023-04-17T20:06:09Z

Maybe it would be possible to do sort and limit only on a subset of columns like id, and then select the remaining data in a separate step?

dessalines · 2023-04-17T20:33:35Z

Yep, getting the collection of ids, then doing an in query, for the already set up views, is the only way I could see this not being a huge headache. I don't really like it, because one of the main performance points of SQL is minimizing the number of round-trip queries, but I don't see any other good way of doing it.

I also have to verify how fast the union + sort + limit works. If it unions the entire tables first, it could be extremely slow. Imagine unioning the entire post and comment tables just to do a search on them.

kartikynwa · 2023-04-18T06:00:08Z

@dessalines make sure you use UNION ALL rather than UNION wherever possible. It should be possible for the search view from my understanding. As you can see in the EXPLAIN ANALYZE output in #2444 (comment), postgres will work on the tables in parallel with it.

dessalines · 2023-05-31T12:32:41Z

The simplest place to start this, to get the hang of it, would be combining the routes post/report/list and comment/report/list into report/list .

There would be a new querybuilder, instead of PostReportQuery and CommentReportQuery, just ReportQuery.
The new querybuilder needs to copy the inner_joins for PostReportQuery and CommentReportQuery.
- One caveat: I really tried to find a way to abstract all of the inner_join....left_join into common functions, (if you look at PostReport::read and PostReportQuery::list,, they have to do the same joins) but the return types on them are nightmares, and its unfortunately not very doable. So we'll just have to copy paste all those joins.
It then does a diesel union_all with the CommentReportQuery right after it, then sort by published asc, and limit, at the very end.
The querybuilder then returns a Result<Vec<ReportType>>, a tagged enum type of:
```
enum ReportType {
  Comment(CommentReportView)
  Post(PostReportView)
}
```
- I'm not exactly sure how building that tuple will work, but it can be figured out.

cc @Nutomic

tbe · 2023-06-22T11:56:31Z

Paging in the database always requires an order by and limit, as PostgreSQL does not guarantee any ordering on a select otherwise. This is expensive, and a real issue.

I have seen many approaches to this issue in the past. I think the one that may work here is something like the following:

A unique, sequential, ID is required for all datasets.
On a paging request, select the next ID's (and only the ID's) from the database
- ordering on DB side should be fine, as this should be done in memory, or fetched directly from an index
Select from the database using these ID's

This can be either done on client side, or using a CTE in PostgreSQL

I'm willing to help here, bring my knowledge to the table, and if i find time get my feed wet with lemmy and rust.

EDIT: PS: Do not use a timestamp for the paging! Depending on the used precision, entries could be missed in some cases!

gaywallet · 2023-06-24T00:47:18Z

Add a field such as LineNBR to comment table, increment each time the user posts. Add an index on commenterid and LineNBR, you'll fix any performance issues. If it's incremental you can easily instantiate a view with most recent value - if you have that, than you can increment however many per page with simple math. Since we can slice by all, post, or comment, it may make sense to have the instantiated view have most recent values for both posts and comments. It may even make sense to have nbr just increment across these tables (could even create a person_post_comment table that's just person id, postid, commentid, LineNBR if you wanted)

Ultimately there's a few ways you can fix this but the question is what's on the table? Are we looking to optimize existing queries or attempt to improve DB design? DB violates 3NF frequently which is fine but key tables should be designed with specific use cases in mind, like itemizing posts on profile page

tbe · 2023-06-24T11:06:36Z

Add a field such as LineNBR to comment table, increment each time the user posts. Add an index on commenterid and LineNBR, you'll fix any performance issues.

This will work for ordering by new. What is about top and hot?

kartikynwa · 2023-06-24T14:29:23Z

@tbe Regarding this:

I have seen many approaches to this issue in the past. I think the one that may work here is something like the following:

A unique, sequential, ID is required for all datasets.

On a paging request, select the next ID's (and only the ID's) from the database
ordering on DB side should be fine, as this should be done in memory, or fetched directly from an index

Select from the database using these ID's

Sequential IDs are not necessary for this. For a simplified example, if you only look at post listings, we can order post listings using ORDER BY hot_rank DESC, post_id ASC LIMIT 100 and then get the next page using WHERE hot_rank <= last_hot_rank_of_prev_page AND post_id >last_post_id_of_prev_page if we keep the same ordering. The "cursor" value used by GraphQL APIs seems to work similar where the cursor value is some system built on top of base64 encoding where I think the tupl data of the last row in the page is encoded in a way that it can be decoded by the server if it is sent back to it.

This issue though talks about filtering like this over multiple tables whose output is combined using UNION ALL where implementing something like this becomes tricky. Something like what I mentioned can be used on materialised views created using UNION ALL on tables but I don't know how to use matrialised views properly so I can't really suggest it sincerely.

tbe · 2023-06-24T14:48:27Z

Materialized views are not a good idea for rapidly changing data, like comments and up votes. But I outlined another solution here: #3312

dessalines · 2024-01-29T14:53:43Z

Adding a since=UNIX_SECONDS for these multi-views could be a good way to convert them to time constrainted actions and use SQL unions.

dessalines · 2024-04-28T00:45:58Z

@dullbananas when you get a chance, I'd love to hear your input about the best approach to solving this one. It's one of our more critical issues that needs a SQL solution that's at least somewhat performant.

dullbananas · 2024-04-28T15:18:15Z

This enum can be represented in SQL as something similar to (Option<CommentReportView>, Option<PostReportView>), with one of the items being not null. A derive macro should be made to generate the Queryable implementation and an SQL expression constructor for each variant.

enum ReportType {
  Comment(CommentReportView)
  Post(PostReportView)
}

Then the UNION ALL approach can be implemented in the i_love_jesus library, which would allow combining multiple queries that each have a selection for ReportType.

dessalines · 2024-04-29T16:18:55Z

My main worry is that UNION ALL will have terrible performance, because it might have to combine and fetch the entire tables, before being able to sort and limit.

One other way I thought about solving this issue.

For all the views that show multiple types, create a combined table, that is filled by triggers, with the sort column (IE published). So for reports it'd be:
reports_combined (id, published, post_id, comment_id)
Create a migration that fills that table, and then also add triggers to comment_report and post_report that insert their IDs into that combined table.
Either join reports_combined to all the tables necessary (which might not be too clean to build all the data the views require), or if we're feeling lazy, get the ids, then create helper functions for CommentReportView to filter by those ids (SQL IN query).
Finally build the combined enum you described above.

It could work, although I don't really like it because its sort of a cache of data that already exists. But it could be performant.

dullbananas · 2024-04-30T01:23:40Z

Order and limit could be applied to each query in the union, then a second time for the whole union

dessalines · 2024-04-30T19:11:08Z

The issue is both sorting and paging: the first 10 results of post_report might be 2 months old, whereas the first 50 results of comment_report a day old.

Nutomic · 2024-05-07T10:19:35Z

If you select limit items from each table, combine them and then limit again you would get the correct results. Shouldnt have too much overhead either.

dessalines · 2024-05-07T17:25:51Z

I don't understand how just limit would work for different ranges.

the first 10 results of post_report might be 2 months old, whereas the first 50 results of comment_report a day old.

Lets say you want 4 items total, ordered recently. If you limit each of those fetches to 2 posts for example, you'll get :

2 results from comment_report a day old
2 results from post_repost a few months old.

When you should just get 4 results all from comment_report.

dullbananas · 2024-05-07T18:06:28Z

Each would have 4 results instead of 2, and the second sort would cause the recent stuff to go to the top before the second limit is applied

dessalines · 2024-05-07T18:27:46Z

I spose that'd work, but it'd be moving the 2nd sort and limit from postgres to code.

dullbananas · 2024-05-07T21:30:36Z

It's possible to specify LIMIT and ORDER BY for both the inner statements and the whole union, so the rust code doesn't need to do it

dessalines · 2024-11-26T11:28:07Z

The double limit / ordering solution is not going to work for at least two reasons:

Logically its incorrect, and I thought of a simpler example to prove this.
- Lets say you're combining post and comment reports, order by recent (IE published desc).
- You have 100 recent comment reports, and 1 post report from a year ago, and are selecting in batches of 10.
- If you select the first 10 of each (11 total), then order and limit a 2nd time, you kick off the post report.
- If you select the next batch of 10 (page 2), You'll get the next 10 comment reports, but ZERO post reports, because there are 0 post reports in page 2.
- So using this method you will never see any post reports.
Diesel doesn't support left joins to unioned tables: Support joining with union expressions diesel-rs/diesel#4074) . And they only recently added limit to unioned tables, which isn't released yet: Union + positional_order_by + limit diesel-rs/diesel#4336 . Both would be vital if we wanted to use the union approach (which again, is logically incorrect anyway).

The solution I'm going to move forward with then, is combined tables built on triggers outlined in this comment. It involves:

Creating combined tables that look like (published: Datetime<UTC>, post_report_id: Option<PostReportId>, comment_report_id: Option<CommentReportId>)
Creating triggers that insert into this combined table when either a post or comment row is created, as well as an initial history update in the DB migration.
Diesel supports left joins to these defined tables easily. There might be some minimal rust remapping to get the views into Vec<PostOrCommentReportEnum>, but most importantly, the data will come back in one fetch.
This should also allow using cursor pagination, not page-based.

dessalines · 2025-01-28T03:59:32Z

All the PRs for this have been merged now.

dessalines added the enhancement New feature or request label Sep 19, 2022

dessalines added area: api breaking change labels Nov 7, 2022

dessalines mentioned this issue Apr 19, 2023

Post mentions #2814

Closed

dessalines pinned this issue Apr 19, 2023

phiresky unpinned this issue Jul 25, 2023

dessalines mentioned this issue Oct 18, 2023

Merge moderation database tables #2349

Closed

dessalines mentioned this issue Nov 17, 2023

On a profile page, when looking at the "Overview" section with "New" selected, a list of the users posts is displayed at the bottom, even if they were not at that point chronologically LemmyNet/lemmy-ui#2145

Open

4 tasks

Nutomic mentioned this issue Feb 7, 2024

API v4 #4428

Closed

4 tasks

This was referenced Feb 16, 2024

Sorting by old in the inbox doesn't sort by old LemmyNet/lemmy-ui#2353

Open

Combine action tables #4459

Merged

dessalines mentioned this issue Feb 24, 2024

Adding a default_comment_sort_type column for local_site and local_user. #4469

Merged

This was referenced Mar 26, 2024

Moderation notifications (Get notified when you've been banned, your post has been removed, etc) #4572

Open

Add sort order parameter to moderation report endpoints #4581

Open

Nutomic mentioned this issue Apr 24, 2024

Fixing extra modlog entries when post_id or comment_id is given. #4664

Merged

dessalines added this to the 0.20.0 milestone May 4, 2024

dessalines mentioned this issue May 15, 2024

Follow posts to be alerted of new comments #3069

Open

4 tasks

Nutomic mentioned this issue Oct 17, 2024

Implement private communities #5076

Merged

12 tasks

dessalines mentioned this issue Nov 16, 2024

Return saved date when fetching saved content #5184

Closed

5 tasks

dessalines mentioned this issue Nov 25, 2024

WIP: Trying to add combined tables #5228

Closed

This was referenced Nov 26, 2024

Combine comment and post reports #5231

Merged

Instance blocks with mod log entry and expiration (fixes #2506) #5214

Merged

This was referenced Dec 7, 2024

Adding combined person content and person saved tables. #5251

Merged

Set up unified push messaging using the ntfy docker. Rework notifications. #2631

Open

Adding Combined modlog #5253

Merged

Adding combined inbox #5257

Merged

dessalines closed this as completed Jan 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SQL Paging and filtering with multiple types #2444

SQL Paging and filtering with multiple types #2444

dessalines commented Sep 19, 2022 •

edited

Loading

kartikynwa commented Sep 21, 2022 •

edited

Loading

dessalines commented Sep 22, 2022

dessalines commented Apr 17, 2023

Nutomic commented Apr 17, 2023

dessalines commented Apr 17, 2023 •

edited

Loading

kartikynwa commented Apr 18, 2023

dessalines commented May 31, 2023 •

edited

Loading

tbe commented Jun 22, 2023 •

edited

Loading

gaywallet commented Jun 24, 2023 •

edited

Loading

tbe commented Jun 24, 2023

kartikynwa commented Jun 24, 2023 •

edited

Loading

tbe commented Jun 24, 2023

dessalines commented Jan 29, 2024

dessalines commented Apr 28, 2024

dullbananas commented Apr 28, 2024

dessalines commented Apr 29, 2024 •

edited

Loading

dullbananas commented Apr 30, 2024

dessalines commented Apr 30, 2024

Nutomic commented May 7, 2024

dessalines commented May 7, 2024

dullbananas commented May 7, 2024

dessalines commented May 7, 2024

dullbananas commented May 7, 2024

dessalines commented Nov 26, 2024 •

edited

Loading

dessalines commented Jan 28, 2025

SQL Paging and filtering with multiple types #2444

SQL Paging and filtering with multiple types #2444

Comments

dessalines commented Sep 19, 2022 • edited Loading

kartikynwa commented Sep 21, 2022 • edited Loading

dessalines commented Sep 22, 2022

dessalines commented Apr 17, 2023

Nutomic commented Apr 17, 2023

dessalines commented Apr 17, 2023 • edited Loading

kartikynwa commented Apr 18, 2023

dessalines commented May 31, 2023 • edited Loading

tbe commented Jun 22, 2023 • edited Loading

gaywallet commented Jun 24, 2023 • edited Loading

tbe commented Jun 24, 2023

kartikynwa commented Jun 24, 2023 • edited Loading

tbe commented Jun 24, 2023

dessalines commented Jan 29, 2024

dessalines commented Apr 28, 2024

dullbananas commented Apr 28, 2024

dessalines commented Apr 29, 2024 • edited Loading

dullbananas commented Apr 30, 2024

dessalines commented Apr 30, 2024

Nutomic commented May 7, 2024

dessalines commented May 7, 2024

dullbananas commented May 7, 2024

dessalines commented May 7, 2024

dullbananas commented May 7, 2024

dessalines commented Nov 26, 2024 • edited Loading

dessalines commented Jan 28, 2025

dessalines commented Sep 19, 2022 •

edited

Loading

kartikynwa commented Sep 21, 2022 •

edited

Loading

dessalines commented Apr 17, 2023 •

edited

Loading

dessalines commented May 31, 2023 •

edited

Loading

tbe commented Jun 22, 2023 •

edited

Loading

gaywallet commented Jun 24, 2023 •

edited

Loading

kartikynwa commented Jun 24, 2023 •

edited

Loading

dessalines commented Apr 29, 2024 •

edited

Loading

dessalines commented Nov 26, 2024 •

edited

Loading