Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL Paging and filtering with multiple types #2444

Closed
6 tasks done
dessalines opened this issue Sep 19, 2022 · 25 comments
Closed
6 tasks done

SQL Paging and filtering with multiple types #2444

dessalines opened this issue Sep 19, 2022 · 25 comments
Labels
Milestone

Comments

@dessalines
Copy link
Member

dessalines commented Sep 19, 2022

#2441 #2349

Making this issue to discuss solutions to the SQL paging problem.

There are a few places in UI where different types of data in list can be shown to the user. A few:

  • The inbox (replies, mentions(either post or comment), and private messages).
  • The Profile page (overview with posts and comments)
  • A Saved page (saved posts and comments)
  • The modlog (all those different types).
  • Reports (comments / posts/ private messages).
  • Search (all the different types).

The problem is that, since different types of data live in different tables, doing a SQL select ... order by ... LIMIT X query can have data from widely different time periods both on pages 1 and 2.

Solving this problem could also let us return generic lists into single API endpoints, rather than having multiple ones like we do now, addressing #2441 .

The only solutions I can think of:

  • Time based queries. IE return all results within a week or a month, where page 1 is < 1 week, page 2 > 1 week and < 2 weeks, etc.
  • Select the entire tables in code, sort by time, and throw out results above the LIMIT. Probably incredibly slow, and taxing to the DB.
  • SQL union the tables on the id and published field, then do order by limit. Could be slow (since union would work on the whole table), and would take a ton of back-end work to convert these ids into views.

cc @Nutomic

@dessalines dessalines added the enhancement New feature or request label Sep 19, 2022
@kartikynwa
Copy link
Contributor

kartikynwa commented Sep 21, 2022

If you go for option 1 then it will be pretty difficult to do proper pagination (limit offset style) because different users can have highly varied number of items in a specified timespan. Another problem is that you could get incorrect results in rare cases where despite high granularity certain rows end up having the same timestamp. Postgres ordering for ties is unstable so you can up getting differently ordered results for the same query.

If you don't want to implement a variety of sorting types and do it just on the basis of id (assuming it is a stable field like SERIAL that can be compared), you can use a query like this:

SELECT id, somecolumn, type FROM (

  SELECT id, somecolumn, 0 AS type
  FROM issue
  WHERE id > last_greatest_table1_id

  UNION ALL

  SELECT id, somecolumn, 1 AS type
  FROM pull
  WHERE id > last_greatest_table2_id

) t
ORDER BY id
LIMIT 100;

This ends being decently fast because there will always be an index on the id field. Postgres will just end up doing parallel index scan which is not a heavy operation. I don't know if this can be done in diesel though. The downside is that you will need some way of remembering what the last greatest id for each type was. Also the query will be different if the newest items don't have the greatest id.

An example EXPLAIN ANALYZE
coolhostname=# explain analyze SELECT id, timestamp, type FROM (
  SELECT id, timestamp, 0 AS type
  FROM table1
  WHERE timestamp < '2022-09-20 21:35:43.230754'
  UNION ALL
  SELECT id, timestamp, 1 AS type
  FROM table2
  WHERE timestamp < '2022-09-20 21:35:40.866767'
) t
ORDER BY timestamp DESC
LIMIT 100;
                                                                  QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=42700.87..42712.53 rows=100 width=33) (actual time=257.127..258.189 rows=100 loops=1)
   ->  Gather Merge  (cost=42700.87..60773.56 rows=154898 width=33) (actual time=257.124..258.177 rows=100 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Sort  (cost=41700.84..41894.47 rows=77449 width=33) (actual time=244.665..244.677 rows=82 loops=3)
               Sort Key: table1.timestamp DESC
               Sort Method: top-N heapsort  Memory: 46kB
               Worker 0:  Sort Method: top-N heapsort  Memory: 42kB
               Worker 1:  Sort Method: top-N heapsort  Memory: 40kB
               ->  Parallel Append  (cost=0.00..39764.62 rows=77449 width=33) (actual time=0.061..229.558 rows=61918 loops=3)
                     ->  Parallel Seq Scan on table1  (cost=0.00..36680.94 rows=74116 width=33) (actual time=0.024..192.779 rows=59238 loops=3)
                           Filter: (timestamp < '2022-09-20 21:35:43.230754'::timestamp without time zone)
                           Rows Removed by Filter: 29
                     ->  Parallel Seq Scan on table2  (cost=0.00..1908.22 rows=4705 width=40) (actual time=0.071..44.772 rows=4020 loops=2)
                           Filter: (timestamp < '2022-09-20 21:35:40.866767'::timestamp without time zone)
                           Rows Removed by Filter: 7
 Planning Time: 1.362 ms
 Execution Time: 258.353 ms
(18 rows)

@dessalines
Copy link
Member Author

The time conflict probably wouldn't be an issue, because our time fields are postgres timestamp columns, which are down to microseconds.

We couldn't use IDs, because the ids in different tables have no time or any correlation to each other. IE the 10 highest ID comments for a user might go back a day, and the 10 highest ID posts go back a year.

If we did time-based queries, we could do unions (probably would be optimal to avoid doing multiple queries), but converting them back to views would be a lot of work.

With your query, that seems close to what the best way to do it would be, except the query would be (where timestame > '1 week ago'. Also make sure you have indexes on your timestamp columns.

@dessalines
Copy link
Member Author

I'm now convinced that using option 3, IE unions, then sorting by published, is the only way to do this correctly. The problem is, that it requires some massive diesel work in building the views, and might potentially be very slow.

Take for example the search page; it has to be able to return users, comments, posts, etc. So it needs to union all those tables, sort and limit, build the views from a very long list of columns, then create the Vec<SearchResultType> .

@Nutomic
Copy link
Member

Nutomic commented Apr 17, 2023

Maybe it would be possible to do sort and limit only on a subset of columns like id, and then select the remaining data in a separate step?

@dessalines
Copy link
Member Author

dessalines commented Apr 17, 2023

Yep, getting the collection of ids, then doing an in query, for the already set up views, is the only way I could see this not being a huge headache. I don't really like it, because one of the main performance points of SQL is minimizing the number of round-trip queries, but I don't see any other good way of doing it.

I also have to verify how fast the union + sort + limit works. If it unions the entire tables first, it could be extremely slow. Imagine unioning the entire post and comment tables just to do a search on them.

@kartikynwa
Copy link
Contributor

@dessalines make sure you use UNION ALL rather than UNION wherever possible. It should be possible for the search view from my understanding. As you can see in the EXPLAIN ANALYZE output in #2444 (comment), postgres will work on the tables in parallel with it.

@dessalines dessalines pinned this issue Apr 19, 2023
@dessalines
Copy link
Member Author

dessalines commented May 31, 2023

The simplest place to start this, to get the hang of it, would be combining the routes post/report/list and comment/report/list into report/list .

  • There would be a new querybuilder, instead of PostReportQuery and CommentReportQuery, just ReportQuery.
  • The new querybuilder needs to copy the inner_joins for PostReportQuery and CommentReportQuery.
    • One caveat: I really tried to find a way to abstract all of the inner_join....left_join into common functions, (if you look at PostReport::read and PostReportQuery::list,, they have to do the same joins) but the return types on them are nightmares, and its unfortunately not very doable. So we'll just have to copy paste all those joins.
  • It then does a diesel union_all with the CommentReportQuery right after it, then sort by published asc, and limit, at the very end.
  • The querybuilder then returns a Result<Vec<ReportType>>, a tagged enum type of:
    enum ReportType {
      Comment(CommentReportView)
      Post(PostReportView)
    }
    • I'm not exactly sure how building that tuple will work, but it can be figured out.

cc @Nutomic

@tbe
Copy link

tbe commented Jun 22, 2023

Paging in the database always requires an order by and limit, as PostgreSQL does not guarantee any ordering on a select otherwise. This is expensive, and a real issue.

I have seen many approaches to this issue in the past. I think the one that may work here is something like the following:

  • A unique, sequential, ID is required for all datasets.
  • On a paging request, select the next ID's (and only the ID's) from the database
    • ordering on DB side should be fine, as this should be done in memory, or fetched directly from an index
  • Select from the database using these ID's

This can be either done on client side, or using a CTE in PostgreSQL

I'm willing to help here, bring my knowledge to the table, and if i find time get my feed wet with lemmy and rust.

EDIT: PS: Do not use a timestamp for the paging! Depending on the used precision, entries could be missed in some cases!

@gaywallet
Copy link

gaywallet commented Jun 24, 2023

Add a field such as LineNBR to comment table, increment each time the user posts. Add an index on commenterid and LineNBR, you'll fix any performance issues. If it's incremental you can easily instantiate a view with most recent value - if you have that, than you can increment however many per page with simple math. Since we can slice by all, post, or comment, it may make sense to have the instantiated view have most recent values for both posts and comments. It may even make sense to have nbr just increment across these tables (could even create a person_post_comment table that's just person id, postid, commentid, LineNBR if you wanted)

Ultimately there's a few ways you can fix this but the question is what's on the table? Are we looking to optimize existing queries or attempt to improve DB design? DB violates 3NF frequently which is fine but key tables should be designed with specific use cases in mind, like itemizing posts on profile page

@tbe
Copy link

tbe commented Jun 24, 2023

Add a field such as LineNBR to comment table, increment each time the user posts. Add an index on commenterid and LineNBR, you'll fix any performance issues.

This will work for ordering by new. What is about top and hot?

@kartikynwa
Copy link
Contributor

kartikynwa commented Jun 24, 2023

@tbe Regarding this:

I have seen many approaches to this issue in the past. I think the one that may work here is something like the following:

  • A unique, sequential, ID is required for all datasets.
  • On a paging request, select the next ID's (and only the ID's) from the database
    ordering on DB side should be fine, as this should be done in memory, or fetched directly from an index
  • Select from the database using these ID's

Sequential IDs are not necessary for this. For a simplified example, if you only look at post listings, we can order post listings using ORDER BY hot_rank DESC, post_id ASC LIMIT 100 and then get the next page using WHERE hot_rank <= last_hot_rank_of_prev_page AND post_id >last_post_id_of_prev_page if we keep the same ordering. The "cursor" value used by GraphQL APIs seems to work similar where the cursor value is some system built on top of base64 encoding where I think the tupl data of the last row in the page is encoded in a way that it can be decoded by the server if it is sent back to it.

This issue though talks about filtering like this over multiple tables whose output is combined using UNION ALL where implementing something like this becomes tricky. Something like what I mentioned can be used on materialised views created using UNION ALL on tables but I don't know how to use matrialised views properly so I can't really suggest it sincerely.

@tbe
Copy link

tbe commented Jun 24, 2023

Materialized views are not a good idea for rapidly changing data, like comments and up votes. But I outlined another solution here: #3312

@dessalines
Copy link
Member Author

Adding a since=UNIX_SECONDS for these multi-views could be a good way to convert them to time constrainted actions and use SQL unions.

@dessalines
Copy link
Member Author

@dullbananas when you get a chance, I'd love to hear your input about the best approach to solving this one. It's one of our more critical issues that needs a SQL solution that's at least somewhat performant.

@dullbananas
Copy link
Collaborator

This enum can be represented in SQL as something similar to (Option<CommentReportView>, Option<PostReportView>), with one of the items being not null. A derive macro should be made to generate the Queryable implementation and an SQL expression constructor for each variant.

enum ReportType {
  Comment(CommentReportView)
  Post(PostReportView)
}

Then the UNION ALL approach can be implemented in the i_love_jesus library, which would allow combining multiple queries that each have a selection for ReportType.

@dessalines
Copy link
Member Author

dessalines commented Apr 29, 2024

My main worry is that UNION ALL will have terrible performance, because it might have to combine and fetch the entire tables, before being able to sort and limit.

One other way I thought about solving this issue.

  • For all the views that show multiple types, create a combined table, that is filled by triggers, with the sort column (IE published). So for reports it'd be:
  • reports_combined (id, published, post_id, comment_id)
  • Create a migration that fills that table, and then also add triggers to comment_report and post_report that insert their IDs into that combined table.
  • Either join reports_combined to all the tables necessary (which might not be too clean to build all the data the views require), or if we're feeling lazy, get the ids, then create helper functions for CommentReportView to filter by those ids (SQL IN query).
  • Finally build the combined enum you described above.

It could work, although I don't really like it because its sort of a cache of data that already exists. But it could be performant.

@dullbananas
Copy link
Collaborator

Order and limit could be applied to each query in the union, then a second time for the whole union

@dessalines
Copy link
Member Author

The issue is both sorting and paging: the first 10 results of post_report might be 2 months old, whereas the first 50 results of comment_report a day old.

@dessalines dessalines added this to the 0.20.0 milestone May 4, 2024
@Nutomic
Copy link
Member

Nutomic commented May 7, 2024

If you select limit items from each table, combine them and then limit again you would get the correct results. Shouldnt have too much overhead either.

@dessalines
Copy link
Member Author

I don't understand how just limit would work for different ranges.

the first 10 results of post_report might be 2 months old, whereas the first 50 results of comment_report a day old.

Lets say you want 4 items total, ordered recently. If you limit each of those fetches to 2 posts for example, you'll get :

  • 2 results from comment_report a day old
  • 2 results from post_repost a few months old.

When you should just get 4 results all from comment_report.

@dullbananas
Copy link
Collaborator

Each would have 4 results instead of 2, and the second sort would cause the recent stuff to go to the top before the second limit is applied

@dessalines
Copy link
Member Author

I spose that'd work, but it'd be moving the 2nd sort and limit from postgres to code.

@dullbananas
Copy link
Collaborator

It's possible to specify LIMIT and ORDER BY for both the inner statements and the whole union, so the rust code doesn't need to do it

@dessalines
Copy link
Member Author

dessalines commented Nov 26, 2024

The double limit / ordering solution is not going to work for at least two reasons:

  • Logically its incorrect, and I thought of a simpler example to prove this.
    • Lets say you're combining post and comment reports, order by recent (IE published desc).
    • You have 100 recent comment reports, and 1 post report from a year ago, and are selecting in batches of 10.
    • If you select the first 10 of each (11 total), then order and limit a 2nd time, you kick off the post report.
    • If you select the next batch of 10 (page 2), You'll get the next 10 comment reports, but ZERO post reports, because there are 0 post reports in page 2.
    • So using this method you will never see any post reports.
  • Diesel doesn't support left joins to unioned tables: Support joining with union expressions diesel-rs/diesel#4074) . And they only recently added limit to unioned tables, which isn't released yet: Union + positional_order_by + limit diesel-rs/diesel#4336 . Both would be vital if we wanted to use the union approach (which again, is logically incorrect anyway).

The solution I'm going to move forward with then, is combined tables built on triggers outlined in this comment. It involves:

  • Creating combined tables that look like (published: Datetime<UTC>, post_report_id: Option<PostReportId>, comment_report_id: Option<CommentReportId>)
  • Creating triggers that insert into this combined table when either a post or comment row is created, as well as an initial history update in the DB migration.
  • Diesel supports left joins to these defined tables easily. There might be some minimal rust remapping to get the views into Vec<PostOrCommentReportEnum>, but most importantly, the data will come back in one fetch.
  • This should also allow using cursor pagination, not page-based.

@dessalines
Copy link
Member Author

All the PRs for this have been merged now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants