-
Notifications
You must be signed in to change notification settings - Fork 458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make client-side pagination a true recommendation #3743
Comments
I sent the recommendations to K and he replied: `Hi, terrific answer Andy. Thanks for spending time on looking this up and we will add it to our general guidelines/knowledgebase for CRDB. What devs usually do (I think) is to use what Spring Data JPA or Spring Data JDBC provides. It has a simple pagination API which does basic limit/offset traversal by default, but you can create own custom queries.` Let's make sure to recommend the right course of action here. |
Another customer needing updated docs on pagination: |
Thanks Andy for collating things together. After reading this I concur with Jordan and Andy K's opinions. We could implement cursors technically but the PK values + LIMIT (and thus not OFFSET) are a better solution. This needs to be combined with AS OF SYSTEM TIME for reliable pagination. I think the product is ready for clients to do effective pagination, however I also think it's not actually trivial for clients to achieve so we need ample guidance in docs. This will be an interesting / worthy doc project. |
With "client-side" pagination I assume you actually mean keyset pagination? However what could be a valuable feature for cockroach would be to have AFTER and BEFORE keyword which could be used instead of offset. Cockroachdb could optimize this way better, by using the index on the column and start retrieving records after that, than this very nasty subquery solution. |
@wzrdtales, I don't understand what you mean. If you have a table with |
@jordanlewis Nope, they are ordered, but not, well ordered in the order they're created. |
That's true that UUIDs aren't well-ordered in their creation order, but IMO that's entirely separate from the discussion we're having here, which is about the recommendation we give to users for how to paginate a table by the order of its primary key. |
well pagination is about a predictable outcome of pagination. Which is basically the issue of LIMIT and OFFSET. they're not quite predictable. However |
Actually also adding to this, especially this is unpredictable since new records can appear after and before the record you're comparing against without any deterministic outcome. So yeah may be still a different discussion, but just adding this here since you closed the other issue. |
Thomas with CockroachDB you can use AS OF SYSTEM TIME to make the scrolling deterministic. I am still confident there is a way to obtain what you desire.
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
|
@knz I guess you meant me with Thomas? In that case: |
So the best option is still doing this #1067 (comment), but with questionable performance. Which is sad, b/c mostly keyset pagination tends to be faster than others, but not quite in this case. |
This isn't going to get done for 19.1. Moving to 19.2. |
@ericharmeling, @rmloveland, another pagination-related issue from a user, this time on our new community slack:
|
This suggestion from @wzrdtales would be an extremely welcome addition. |
Huh I don't understand why can't this be written as WHERE "somethingDate" >= NOW()
AND id > ${item}
ORDER BY "createdAt" DESC LIMIT ${amount} which is arguably simpler? |
@knz I don't mind explaining it again :) Your query does not work with cockroachdbs official recommendations of using Hey, in that sorted key |
@wzrdtales What about:
This should result in the correct page as Tuples are ordered Lexicographically, and we only care about the UUID order if the TimeStamps are equal. It's not important what that order is, just that is an order. The |
Sry, I am getting a bit tired to answer the same answer all the time... . So I will skip the explanation this time, it doesn't change the outcome and is non deterministic. |
@wzrdtales, I've read your comments, and am still unclear on exactly what you're looking for. Are you trying to build a "feed", where you are guaranteed to always see the newest values, in the order they were committed? Also, can you give a specific example where @brendan-hall's suggestion does not work? It would be very helpful to see a concrete set of values that would fail, as it would help me to understand exactly what you're looking to do. |
I see two problems. Outcome of the query SELECT * FROM table
WHERE ((created_at, id) > (${time}, ${id}))
ORDER BY (created_at, id)
LIMIT ${amount} is deterministic because of ordering. But it's not useful, if the
If you start with no page and $amount = 2, then you page item is (10:01, 60). So, next page would be: SELECT * FROM table
WHERE ((created_at, id) > (10:01, 60))
ORDER BY (created_at, id)
LIMIT 2 But you get nothing, because Second problem with the query is, when you have situation like this:
With the page size of two you see (10:01, 50), but never (10:01, 60). |
Thank you @Bessonov I am glad not to argue alone into an emtpy room... |
Except that that's not how < on Tuples works. Your claim is that Instead (In pseudo code), it works like so
Assuming tuples of Select First Page
Select Second Page (after
Select Second Page with both
R.E. usefulness, my argument is that if your created_at timestamps are equal, then you no longer care, nor have the ability to care, about the order the rows were inserted. So what does it matter that uuid's aren't monotonic? If you absolutely need to know the order a row was inserted, then you're going to have to either implement your own higher resolution timestamps, or fall back on auto-incrementing ids or row_number() |
Another related thing to understand is that timestamps aren't guaranteed to be in strict commit order. The
It's possible that transaction B commits first:
and a split second later transaction A commits:
Therefore, if you're trying to use timestamps to implement a "feed", this behavior will cause you to miss items. This is not specific to CRDB. Every database I know of would behave this way for current time and auto-incrementing ids as well. Auto-incrementing ids also would not follow strict commit order in the case of concurrent transactions. |
Assuming that concurrently committed transactions are not a concern for your particular application, here is another trick I've used in the past to further simplify pagination:
I put a retry loop around the
One big caveat: using timestamp columns indexed by the current time creates "hot spots" in your database, because all INSERT transactions will write to the same machine repeatedly (i.e. whatever machine contains the page containing the highest timestamp value). This can have the effect of preventing scale-out of your database beyond a single machine if you try to do this with a high-volume INSERT transaction that is a bottleneck in your application. There are techniques to address this problem, such as using hybrid hash/range partitioning, but it does tend to get complex. We'll be adding support to upcoming versions of CRDB to make this easier. |
Summary of changes: - Explain difference between keyset pagination and LIMIT/OFFSET - Show examples of the former being fast and the latter being slow - Show how to use EXPLAIN to check why the difference exists - Add warning to LIMIT/OFFSET docs recommending keyset pagination Fixes #3743
Summary of changes: - Explain difference between keyset pagination and LIMIT/OFFSET - Show examples of the former being fast and the latter being slow - Show how to use EXPLAIN to check why the difference exists - Add warning to LIMIT/OFFSET docs recommending keyset pagination - ... all of the above for 19.1, 19.2, 20.1 docs Fixes #3743
Summary of changes: - Explain difference between keyset pagination and LIMIT/OFFSET - Show examples of the former being fast and the latter being slow - Show how to use EXPLAIN to check why the difference exists - Add warning to LIMIT/OFFSET docs recommending keyset pagination - ... all of the above for 19.1, 19.2, 20.1 docs Fixes #3743
Hey all--we have a pr in process to address the concerns here #6114. Let us know if you think it is missing something important! |
Summary of changes: - Explain difference between keyset pagination and LIMIT/OFFSET - Show examples of the former being fast and the latter being slow - Show how to use EXPLAIN to check why the difference exists - Add warning to LIMIT/OFFSET docs recommending keyset pagination - ... all of the above for 19.1, 19.2, 20.1 docs Fixes #3743
Summary of changes: - Explain difference between keyset pagination and LIMIT/OFFSET - Show examples of the former being fast and the latter being slow - Show how to use EXPLAIN to check why the difference exists - Add warning to LIMIT/OFFSET docs recommending keyset pagination - ... all of the above for 19.1, 19.2, 20.1 docs Fixes #3743
Summary of changes: - Explain difference between keyset pagination and LIMIT/OFFSET - Show examples of the former being fast and the latter being slow - Show how to use EXPLAIN to check why the difference exists - Add warning to LIMIT/OFFSET docs recommending keyset pagination - ... all of the above for 19.1, 19.2, 20.1 docs Fixes #3743.
bit late as a comment, but @awoods187 this didn't touch the topic at all. This is about the problems generated by UUIDs concerning pagination, which this docs didn't touch at all and didn't presented a solution to the user. Even the tuples mentioned here, which also don't solve it completely, were not mentioned there. I by the time now went for critical tables on a different route. The database is not trusted anymore as a source of truth for such assets, since the implementation just lacks at too many corners. I implemented a logic that spans across services for each domain that needs absolute time critical sorting. Everything else, less important will in doubt go by the timestamp and loose precision. |
Sorry to hear that the recommendations didn't meet your needs, @wzrdtales. Are you aware of any other database that solves this problem in a way that meets your needs? Do you have any links to blogs, doc pages, or other resources showing what you want and how some other system solves it in a way that works for you? The only examples I've seen were outlined by @Bessonov above, and @brendan-hall pointed out why those actually work fine. I have yet to see examples of cases where @brendan-hall's method doesn't "solve it completely", except in cases where the paging is racing with newly committed values, as I explained above. Separately, we can look into adding documentation that describes the best keyset pagination pattern to use for current timestamp columns (like |
I heard from k yesterday that they thought pagination via offset/limit was slow and were looking for recommendations on what to do for large result sets.
I spoke with @jordanlewis and he recommended:
"Doing “client-side pagination” by retrieving a set of records with a limit, and then checking the index key of the last row, using that as an index constraint, and then running the query again
this is normal - not specific to cockroach"
Our current docs: https://www.cockroachlabs.com/docs/stable/selection-queries.html#limiting-row-count-and-pagination actually suggest the wrong solution for pagination.
The reason is that "offset doesn’t know anything smart - it has to get the same data as before, just skips the first n. You have to participate as a client by remembering the index key of the last result set you saw."
We should also reference that some databases offer a feature called cursors to do this. @andy-kimball mentioned that "cursors are generally not a great architecture as well b/c they force server to keep state. client side is way to go for pagination. server pagination just doesn't scale well"
Jordan also mentioned that "you run into trouble with a scale out system like cockroach - if your load balancer moves you to a different server for example the cursor will be lost"
Andy also mentioned that "sql server has cursors, and we spent a lot of time recommending customers not use them"
This way we can make it clear that client-side is the way to go and its not just because we don't have cursors.
The text was updated successfully, but these errors were encountered: