-
-
Notifications
You must be signed in to change notification settings - Fork 892
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SQL Paging and filtering with multiple types #2444
Comments
If you go for option 1 then it will be pretty difficult to do proper pagination (limit offset style) because different users can have highly varied number of items in a specified timespan. Another problem is that you could get incorrect results in rare cases where despite high granularity certain rows end up having the same timestamp. Postgres ordering for ties is unstable so you can up getting differently ordered results for the same query. If you don't want to implement a variety of sorting types and do it just on the basis of id (assuming it is a stable field like SERIAL that can be compared), you can use a query like this: SELECT id, somecolumn, type FROM (
SELECT id, somecolumn, 0 AS type
FROM issue
WHERE id > last_greatest_table1_id
UNION ALL
SELECT id, somecolumn, 1 AS type
FROM pull
WHERE id > last_greatest_table2_id
) t
ORDER BY id
LIMIT 100; This ends being decently fast because there will always be an index on the id field. Postgres will just end up doing parallel index scan which is not a heavy operation. I don't know if this can be done in diesel though. The downside is that you will need some way of remembering what the last greatest id for each type was. Also the query will be different if the newest items don't have the greatest id. An example EXPLAIN ANALYZE
|
The time conflict probably wouldn't be an issue, because our time fields are postgres timestamp columns, which are down to microseconds. We couldn't use IDs, because the ids in different tables have no time or any correlation to each other. IE the 10 highest ID comments for a user might go back a day, and the 10 highest ID posts go back a year. If we did time-based queries, we could do unions (probably would be optimal to avoid doing multiple queries), but converting them back to views would be a lot of work. With your query, that seems close to what the best way to do it would be, except the query would be (where timestame > '1 week ago'. Also make sure you have indexes on your timestamp columns. |
I'm now convinced that using option 3, IE unions, then sorting by Take for example the search page; it has to be able to return users, comments, posts, etc. So it needs to union all those tables, sort and limit, build the views from a very long list of columns, then create the |
Maybe it would be possible to do sort and limit only on a subset of columns like |
Yep, getting the collection of ids, then doing an I also have to verify how fast the union + sort + limit works. If it unions the entire tables first, it could be extremely slow. Imagine unioning the entire post and comment tables just to do a search on them. |
@dessalines make sure you use UNION ALL rather than UNION wherever possible. It should be possible for the search view from my understanding. As you can see in the EXPLAIN ANALYZE output in #2444 (comment), postgres will work on the tables in parallel with it. |
The simplest place to start this, to get the hang of it, would be combining the routes
cc @Nutomic |
Paging in the database always requires an order by and limit, as PostgreSQL does not guarantee any ordering on a select otherwise. This is expensive, and a real issue. I have seen many approaches to this issue in the past. I think the one that may work here is something like the following:
This can be either done on client side, or using a CTE in PostgreSQL I'm willing to help here, bring my knowledge to the table, and if i find time get my feed wet with lemmy and rust. EDIT: PS: Do not use a timestamp for the paging! Depending on the used precision, entries could be missed in some cases! |
Add a field such as LineNBR to comment table, increment each time the user posts. Add an index on commenterid and LineNBR, you'll fix any performance issues. If it's incremental you can easily instantiate a view with most recent value - if you have that, than you can increment however many per page with simple math. Since we can slice by all, post, or comment, it may make sense to have the instantiated view have most recent values for both posts and comments. It may even make sense to have nbr just increment across these tables (could even create a person_post_comment table that's just person id, postid, commentid, LineNBR if you wanted) Ultimately there's a few ways you can fix this but the question is what's on the table? Are we looking to optimize existing queries or attempt to improve DB design? DB violates 3NF frequently which is fine but key tables should be designed with specific use cases in mind, like itemizing posts on profile page |
This will work for ordering by new. What is about top and hot? |
@tbe Regarding this:
Sequential IDs are not necessary for this. For a simplified example, if you only look at post listings, we can order post listings using This issue though talks about filtering like this over multiple tables whose output is combined using UNION ALL where implementing something like this becomes tricky. Something like what I mentioned can be used on materialised views created using UNION ALL on tables but I don't know how to use matrialised views properly so I can't really suggest it sincerely. |
Materialized views are not a good idea for rapidly changing data, like comments and up votes. But I outlined another solution here: #3312 |
Adding a |
@dullbananas when you get a chance, I'd love to hear your input about the best approach to solving this one. It's one of our more critical issues that needs a SQL solution that's at least somewhat performant. |
This enum can be represented in SQL as something similar to
Then the |
My main worry is that One other way I thought about solving this issue.
It could work, although I don't really like it because its sort of a cache of data that already exists. But it could be performant. |
Order and limit could be applied to each query in the union, then a second time for the whole union |
The issue is both sorting and paging: the first 10 results of |
If you select |
I don't understand how just limit would work for different ranges.
Lets say you want 4 items total, ordered recently. If you limit each of those fetches to 2 posts for example, you'll get :
When you should just get 4 results all from comment_report. |
Each would have 4 results instead of 2, and the second sort would cause the recent stuff to go to the top before the second limit is applied |
I spose that'd work, but it'd be moving the 2nd sort and limit from postgres to code. |
It's possible to specify |
The double limit / ordering solution is not going to work for at least two reasons:
The solution I'm going to move forward with then, is
|
All the PRs for this have been merged now. |
#2441 #2349
Making this issue to discuss solutions to the SQL paging problem.
There are a few places in UI where different types of data in list can be shown to the user. A few:
The problem is that, since different types of data live in different tables, doing a SQL
select ... order by ... LIMIT X
query can have data from widely different time periods both on pages 1 and 2.Solving this problem could also let us return generic lists into single API endpoints, rather than having multiple ones like we do now, addressing #2441 .
The only solutions I can think of:
order by limit
. Could be slow (since union would work on the whole table), and would take a ton of back-end work to convert these ids into views.cc @Nutomic
The text was updated successfully, but these errors were encountered: