-
Notifications
You must be signed in to change notification settings - Fork 422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cleanup Postgres get_all #1622
Cleanup Postgres get_all #1622
Conversation
Interestingly, the memory backend and the Postgres backend both fail this test, but in different ways.
Because parent_id with wildcards can match different objects that have the same ID, this old use of records indexed by IDs to get the union of a bunch of records no longer works. Instead, rely on the fact that the same record, matched multiple times, is the same object in memory, and use its memory location as its unique ID. This is kind of a hack, but I couldn't think of a better solution. Although this is one obvious place where we rely on every object in a given "result set" having a unique ID, there may be others hidden elsewhere. Refs: Kinto#1616
This DISTINCT id leads to difficulties when, as with wildcard parent_ids, multiple records have the same ID. This manifests as re-selecting the same record in multiple "pages" because its id matched once. This DISTINCT also costs us performance because we have to stick every record in a hash in order to deduplicate it, and then do an expensive join afterwards to get the record back. None of this is really necessary, so just delete it.
Leaving the sorting criteria at the end of the query causes Postgres to fetch all the records first, count them, and then sort them (which is probably O(n log n)) so that it can select the right page. Moving the sorting criteria closest to the data means that Postgres will try to use an index (if possible/available) to generate records in the right order. This eliminates the cost of doing a sort. However, it more or less enforces that we will be scanning records via the index, even if scanning the whole table would be faster and generate comparatively few records (i.e. few enough that sorting would be fast).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some superficial comments...
kinto/core/storage/testing.py
Outdated
records = [] | ||
pagination = None | ||
while True: | ||
(page, total_records) = self.storage.get_all( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: superfluous parenthesis
[Filter(order_field, threshhold_field, utils.COMPARISON.EQ), | ||
Filter('last_modified', threshhold_lm, utils.COMPARISON.LT)], | ||
[Filter(order_field, threshhold_field, pagination_direction)] | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a comment that what happens in reality:
https://github.com/Kinto/kinto/blob/master/kinto/core/resource/__init__.py#L1055-L1057
kinto/core/storage/testing.py
Outdated
for order in [('secret_data', 1), ('secret_data', -1)]: | ||
order_field, order_direction = order | ||
sort = [Sort(*order), Sort('last_modified', -1)] | ||
pagination_direction = GT if order_direction == 1 else LT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it makes no sense to redo that at each loop iteration, but I would have had less trouble reading all this if order_field
and pagination_direction
would have been define right along the pagination rules below :)
kinto/core/storage/testing.py
Outdated
def test_get_all_parent_id_paginates_correctly(self): | ||
"""Verify that pagination doesn't squash or duplicate some records""" | ||
|
||
for parent in range(10): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment like Create records with different parent ids
kinto/core/storage/memory.py
Outdated
@@ -322,7 +322,7 @@ def extract_record_set(records, filters, sorting, | |||
paginated = {} | |||
for rule in pagination_rules: | |||
values = apply_filters(filtered, rule) | |||
paginated.update(((x[id_field], x) for x in values)) | |||
paginated.update((id(x), x) for x in values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
index by memory address? I realize I don't understand why we don't just build a list :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I just tried building a list and none of the tests failed. Maybe the person who wrote it originally was afraid that different pagination rules would re-match a record twice? In that case, the tests probably won't fail because we don't construct pagination rules in such a way where this could happen (but a pathological user could conceptually find a way if they were motivated). Since none of the tests fail, I'll take it out..
10m records accross several collections, I would say yes. 1m records in one collection, it should support it, but maybe not in optimal conditions. Since synchronization is going to be painful anyway (especially on the first time).
I definitely agree that syncing is more important than arbitrary filters (for which we have the ElasticSearch plugin)
That was a naive choice at the beginning of the project. The On the server side, the total count is just used to figure out whether we have a next page. Which, as you say, could be managed differently (ie. paginate until empty page). |
One reason why |
We spoke about this online today. Removing the |
I'm not sure if it's worth it but we could skip the count part if the request is for the next page. E.g.
In this case, the Granted, I don't really know how the pagination works in Kinto but if the number really is only there for helping pagination then storing the first count in the cache would definitely make it possible to skip it for the next couple of pages. Too complicated or every-little-helps? |
Another not-crazy idea is to keep a tally count of records per collection and parent. E.g. if not filters:
cur.execute("""
select count from count_records where
collection_id=:collection_id and parent_id=:parent_id""", ...)
else:
# Crap! Have to count them manually.
cur.execute("""
select count(*) from records where
collection_id=:collection_id and parent_id=:parent_id and
{filters}
""", ...)
count, = cur.fetchone()
# Now we have then count, now we just need to the the paginated records
cur.execute("""
select id, last_modified, data from records
WHERE collection_id=:collection_id and parent_id=:parent_id and {filters}
""", ...)
records = cur.fetchmany() A trigger function would increment and decrement the Obvious caveat (apart from trigger functions being annoyingly subtle) is that if the particular kinto database isn't used ever without complex filters this is a wasted record keeping. |
I extracted the query and compared. Best execution times...
With filtering on
This concludes what @glasserc been trying to explain to me; Sometimes the count is expensive when it has to be separate. Sometimes it's cheap when it can count the (few) returned records. This is also an emphasis about your point above about trying to outsmart the query planner in Postgres. Not sure if this helps. Either way, we have choices of the best approach and they each depend and fit different environments/contexts. |
🎉🎊 \o/ Can't wait to get this into the Kinto that buildhub is using. |
Let's prepare a release then 🌹 |
@glasserc could you add a CHANGELOG entry please? |
Oops, yes, this PR took so long I forgot about the unchecked boxes in the description. I've opened #1635 with a CHANGELOG entry. |
Add changelog entry for #1622
This is a more conservative approach towards addressing #1507. In particular, it doesn't introduce any race conditions, and probably reduces query times by much less.
The existing query's performance comes down to three factors:
The performance characteristics of these steps varies depending on how many records the filters match and how many they don't match, as well as what proportion of the database is under the given parent_id.
Per our discussion in #1616, there are a couple of subtle bugs in the use of wildcard
parent_id
, which I used as the crowbar to try to pry open this query. First, introduce some tests to firm up the behavior we expect for wildcardparent_ids
. It turns out that the third step above introduces a bug in pagination of results when doing a wildcardparent_id
match, so just delete it.Doing the sorting at the end seems wasteful to me; you need to sort anyhow in order to do the pagination, so why not generate records in the right order to begin with? I moved the 2nd step into the first step. This encourages Postgres to use indexes when called for. However, it seems that the benefit to the second step of using an index is outweighed by the cost to the first step of doing an index scan when very few records match, so in a Kinto instance with all records under one parent, doing a query with filters that match comparatively few records, this query is slower than the previous one.
I tested this by filling a Kinto instance with about 10m records, all in the same collection, each of the form
{"id": "id-XXXXXX", "my_data": XXXXXX}
, and then PATCHing one record to add a"foo": true
field to it. I then measured the time ofhttp GET 'localhost:8888/v1/buckets/a/collections/b/records?_limit=10'
,http GET 'localhost:8888/v1/buckets/a/collections/b/records?_limit=10&has_foo=true'
, andhttp GET 'localhost:8888/v1/buckets/a/collections/b/records?_limit=10&has_foo=false'
. The first and last requests tend to perform the same, because both match the vast majority of the records in the database; on these, the new query takes about 60s (to the old query's ~90s, so about 33% faster). The second query is a little slower than the previous one (4.52s to 2.65s, or about 80% slower).Perhaps it's possible to get the benefit of both queries by using heuristics to choose the sequential scan or the index scan, but then we're in the situation of trying to beat the Postgres query planner and I'm not sanguine about our chances. On the other hand, note that the Postgres query planner is hamstrung by the lack of indexing on the
data
column.My conclusions are:
It isn't really clear what the scalability characteristics of Kinto are or should be. Is Kinto meant to store 10m records? I would say probably not. What about 1m records, but all in one collection? I'm not certain.
It isn't really clear (to me, at least) what queries Kinto should support as "first-class" and which are "gravy", i.e. added bonuses that we threw in because they were easy to implement but are not as well-loved, performant etc. as others. Naïvely I would say syncing collections is more important than supporting arbitrary filters, but then again, if that's what you want, CouchDB is available and may be a better fit for you.
It isn't really clear to me what the relative priorities are in the Kinto ecosystem. As an example, in order to produce a
Total-Records
header, we have to add aCOUNT
to our queries, which means enumerating all the matching records rather than just the ones that we're going to render in the response. In other distributed systems, you might just paginate until you get an empty page. Is the ease-of-use/sanity checking provided by theTotal-Records
header worth the performance cost?Add documentation.
Add tests.
Add a changelog entry.
(n/a) Add your name in the contributors file.
(n/a) If you changed the HTTP API, update the API_VERSION constant and add an API changelog entry in the docs
(n/a) If you added a new configuration setting, update the
kinto.tpl
file with it.