Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL pagination #1067

Closed
jseldess opened this issue Feb 3, 2017 · 18 comments
Closed

SQL pagination #1067

jseldess opened this issue Feb 3, 2017 · 18 comments
Milestone

Comments

@jseldess
Copy link
Contributor

jseldess commented Feb 3, 2017

From @bdarnell, overheard in a thread with a user:

limit/offset is a really inefficient way of doing pagination. unless you really need to be able to jump to arbitrary pages, it's better to have each key include the index of the last entry returned (or the first entry not returned) so you can pick up the query from there on the next request

if you do need to use limit/offset, then it's often better to split it into two queries: one query that uses limit/offset and just reads the task_id (so it can use the index alone without joining with the primary table), and a second query that uses where task_id in (...) to get the data (assuming task_id is the primary key)

Also:

limit/offset pagination gets a lot more efficient if your index includes enough columns to satisfy the group by/order by clauses without joining against the primary data

And then from @RaduBerinde:

yes, if the index is sorted like the order by clause, we won't need to sort, which means we will just read (limit + offset) rows. This applies even if we are doing an index-join because other columns are needed by the select.

@sploiselle
Copy link
Contributor

Potential reference point: https://www.citusdata.com/blog/2016/03/30/five-ways-to-paginate/

@jseldess
Copy link
Contributor Author

Comment from @danhhz, from #657:

The usual way to do this is one SELECT query per page that uses LIMIT/OFFSET, but since it doesn't make sense to run this in a sql transaction, any insertions/deletions could make this an inconsistent view of the table.

CockroachDB has AS OF SYSTEM TIME implemented, which means we can let the user iterate over a consistent version of their database without a transaction. This would work by selecting cluster_logical_timestamp() in the first SELECT, passing it through the page (like limit and offset are) and using it in subsequent queries. @mjibson has already implemented paging exactly like this for the ./cockroach dump command.

See the discussion in cockroachdb/cockroach#9227 (comment)

@wzrdtales
Copy link
Contributor

@jseldess That might sound like a good idea, but apart from dumping scenarios it mostly is not. AS OF SYSTEM TIME will also deliver outdated results, so for a pagintion you really only want to avoid strange pagination behaviors, but you do not want to deliver outdated results.

@wzrdtales
Copy link
Contributor

The most practical thing would be to allow instead of an offset telling it to show everything after a specified id within the current sorting.

@wzrdtales
Copy link
Contributor

wzrdtales commented Mar 31, 2018

Since there is nothing but time as a serial factor here, the tables affected should not allow inserting anything with the same timestamp or anything in that window at a later time.

@wzrdtales
Copy link
Contributor

wzrdtales commented Mar 31, 2018

A very nasty pagination way would be

SELECT * FROM tableName ORDER BY "createdDate" DESC LIMIT 5 OFFSET (
  SELECT rn FROM (
    SELECT id, row_number() over (order by "createdDate" DESC) AS rn FROM tableName 
    ORDER BY "createdDate" DESC
  ) 
  WHERE id = 'd25b7651-b553-4857-9380-41027342b74e'
);

Which is not by far a fast way.

@antoniomo
Copy link

antoniomo commented Jul 18, 2018

Hi,

Why not just:

CREATE INDEX ON tableName(created_at DESC, id DESC);

SELECT * FROM tableName WHERE createdDate <= 'latest returned date' AND id < 'latest returned id' ORDER BY createdDate DESC, id DESC LIMIT 5;

I haven't tested it much, but it seems to be using the index just fine. Also notice the use of <= on the date column to make sure we don't skip values if we accept several rows with the same creation date.

@wzrdtales
Copy link
Contributor

wzrdtales commented Jul 18, 2018

if you have UUIDs as Ids that is already broken and you're back again to offset pagination which is unsafe for any scenario where your application is actually being used and data may change while paginating.

@antoniomo
Copy link

Even with UUIDs as IDs this would work, as you order on date first, right? For offset pagination (assuming you aren't going for deep offsets) AS OF SYSTEM TIME is quite reasonable, taking the current_timestamp() of the first page when the client requested it.

@wzrdtales
Copy link
Contributor

No not quite, the date sorting works but you can't apply your id sort anymore. Your AND id < doesn't work with UUIDs. AS OF SYSTEM TIME is not reasonable at all actually, since you freeze the systems state, this is not applicable for most applications out there. Unless it doesn't matter for your customers that they see plain old stuff, just b/c they started paginating. Sorting the timestamp is only actually usable if the times don't overlap. But most of the time it doesn't really matter that much if they do so this can be most of the time ignored.

However, pagination with offsets is bad for a multitude of reasons anyway.

But hey, srsly, enough people written about that already, so I wont recall all the reasons why offset pagination is bad, but just link some random articles:
https://blog.jooq.org/2016/08/10/why-most-programmers-get-pagination-wrong/
https://coderwall.com/p/lkcaag/pagination-you-re-probably-doing-it-wrong

@antoniomo
Copy link

antoniomo commented Jul 19, 2018

I'm not recommending OFFSET pagination if you are gonna use a big offset, as said in my comment. Why would AND id < not work with UUIDs if you have an index sorting them in a consistent way? It's only to break ties consistently on items inserted at the exact same time, nothing more, the UUID order doesn't matter as long as it's consistent across queries.

@antoniomo
Copy link

Actually you are right, with UUIDs the query I put would only work for those with tied insertion time, not as a general pagination, my bad.

@antoniomo
Copy link

This will work with UUIDs and a TIMESTAMP column in a performant way:

CREATE INDEX ON tableName(datetime DESC, id DESC) STORING (additional columns for the select);

SELECT * FROM t1 WHERE (datetime,id) < ('last returned datetime', 'last returned uuid') ORDER BY time DESC, id DESC limit 5;

Is this a good generic solution for your case?

@wzrdtales
Copy link
Contributor

No, still wont work. UUIDs do not have a consistent behaviour, they're random. So your comparison operation, for smaller or bigger, will have an inconsistent behaviour that can change when data gets inserted.

@wzrdtales
Copy link
Contributor

wzrdtales commented Jul 19, 2018

The only valid operation on an UUID to compare it with another one would be (not) equal actually.

@antoniomo
Copy link

Right, if you are inserting data "back in time". Assuming new insertions are at "now", this should work. Still not general for all cases.

@antoniomo
Copy link

The only valid operation on an UUID to compare it with another one would be (not) equal actually.

Semantically yeah, but we don't really care that the ordering makes sense, just that it's consistent, so either lexicographic or byte-based ordering would do for pagination with this query (assuming that new insertions are "now" so any client already paginating to older results don't get affected by those new insertions).

@jseldess
Copy link
Contributor Author

Closing this in favor of a separate issue to recommend client-side pagination in most cases: #3743

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants