Cleanup Postgres get_all #1622

glasserc · 2018-05-04T22:38:48Z

This is a more conservative approach towards addressing #1507. In particular, it doesn't introduce any race conditions, and probably reduces query times by much less.

The existing query's performance comes down to three factors:

It matches every record in the database that matches the queries, and outputs them in any random order.
It then sorts those records according to the order given in the sort order.
It does a distinct/join as part of paginating.

The performance characteristics of these steps varies depending on how many records the filters match and how many they don't match, as well as what proportion of the database is under the given parent_id.

If most of the database is under this parent, then doing a sequential scan is the fastest way to do this. Otherwise, you want to use an index scan to narrow it down as much as possible before applying filters.
Sorting is O(n log n), so if a lot of records match the filters, doing a sort afterwards is expensive. In those cases, you generally want to do an index scan to produce the records in the right order to begin with.
Doing a distinct/join is probably O(n log n), so it becomes more expensive the more records match.

Per our discussion in #1616, there are a couple of subtle bugs in the use of wildcard parent_id, which I used as the crowbar to try to pry open this query. First, introduce some tests to firm up the behavior we expect for wildcard parent_ids. It turns out that the third step above introduces a bug in pagination of results when doing a wildcard parent_id match, so just delete it.

Doing the sorting at the end seems wasteful to me; you need to sort anyhow in order to do the pagination, so why not generate records in the right order to begin with? I moved the 2nd step into the first step. This encourages Postgres to use indexes when called for. However, it seems that the benefit to the second step of using an index is outweighed by the cost to the first step of doing an index scan when very few records match, so in a Kinto instance with all records under one parent, doing a query with filters that match comparatively few records, this query is slower than the previous one.

I tested this by filling a Kinto instance with about 10m records, all in the same collection, each of the form {"id": "id-XXXXXX", "my_data": XXXXXX}, and then PATCHing one record to add a "foo": true field to it. I then measured the time of http GET 'localhost:8888/v1/buckets/a/collections/b/records?_limit=10', http GET 'localhost:8888/v1/buckets/a/collections/b/records?_limit=10&has_foo=true', and http GET 'localhost:8888/v1/buckets/a/collections/b/records?_limit=10&has_foo=false'. The first and last requests tend to perform the same, because both match the vast majority of the records in the database; on these, the new query takes about 60s (to the old query's ~90s, so about 33% faster). The second query is a little slower than the previous one (4.52s to 2.65s, or about 80% slower).

Perhaps it's possible to get the benefit of both queries by using heuristics to choose the sequential scan or the index scan, but then we're in the situation of trying to beat the Postgres query planner and I'm not sanguine about our chances. On the other hand, note that the Postgres query planner is hamstrung by the lack of indexing on the data column.

My conclusions are:

It isn't really clear what the scalability characteristics of Kinto are or should be. Is Kinto meant to store 10m records? I would say probably not. What about 1m records, but all in one collection? I'm not certain.
It isn't really clear (to me, at least) what queries Kinto should support as "first-class" and which are "gravy", i.e. added bonuses that we threw in because they were easy to implement but are not as well-loved, performant etc. as others. Naïvely I would say syncing collections is more important than supporting arbitrary filters, but then again, if that's what you want, CouchDB is available and may be a better fit for you.
It isn't really clear to me what the relative priorities are in the Kinto ecosystem. As an example, in order to produce a Total-Records header, we have to add a COUNT to our queries, which means enumerating all the matching records rather than just the ones that we're going to render in the response. In other distributed systems, you might just paginate until you get an empty page. Is the ease-of-use/sanity checking provided by the Total-Records header worth the performance cost?
Add documentation.
Add tests.
Add a changelog entry.
(n/a) Add your name in the contributors file.
(n/a) If you changed the HTTP API, update the API_VERSION constant and add an API changelog entry in the docs
(n/a) If you added a new configuration setting, update the kinto.tpl file with it.

Interestingly, the memory backend and the Postgres backend both fail this test, but in different ways.

Because parent_id with wildcards can match different objects that have the same ID, this old use of records indexed by IDs to get the union of a bunch of records no longer works. Instead, rely on the fact that the same record, matched multiple times, is the same object in memory, and use its memory location as its unique ID. This is kind of a hack, but I couldn't think of a better solution. Although this is one obvious place where we rely on every object in a given "result set" having a unique ID, there may be others hidden elsewhere. Refs: Kinto#1616

This DISTINCT id leads to difficulties when, as with wildcard parent_ids, multiple records have the same ID. This manifests as re-selecting the same record in multiple "pages" because its id matched once. This DISTINCT also costs us performance because we have to stick every record in a hash in order to deduplicate it, and then do an expensive join afterwards to get the record back. None of this is really necessary, so just delete it.

Leaving the sorting criteria at the end of the query causes Postgres to fetch all the records first, count them, and then sort them (which is probably O(n log n)) so that it can select the right page. Moving the sorting criteria closest to the data means that Postgres will try to use an index (if possible/available) to generate records in the right order. This eliminates the cost of doing a sort. However, it more or less enforces that we will be scanning records via the index, even if scanning the whole table would be faster and generate comparatively few records (i.e. few enough that sorting would be fast).

leplatrem

Some superficial comments...

leplatrem · 2018-05-07T10:16:29Z

kinto/core/storage/testing.py

+                    records = []
+                    pagination = None
+                    while True:
+                        (page, total_records) = self.storage.get_all(


nit: superfluous parenthesis

leplatrem · 2018-05-07T10:18:01Z

kinto/core/storage/testing.py

+                            [Filter(order_field, threshhold_field, utils.COMPARISON.EQ),
+                             Filter('last_modified', threshhold_lm, utils.COMPARISON.LT)],
+                            [Filter(order_field, threshhold_field, pagination_direction)]
+                        ]


Maybe add a comment that what happens in reality:
https://github.com/Kinto/kinto/blob/master/kinto/core/resource/__init__.py#L1055-L1057

leplatrem · 2018-05-07T10:22:33Z

kinto/core/storage/testing.py

+        for order in [('secret_data', 1), ('secret_data', -1)]:
+            order_field, order_direction = order
+            sort = [Sort(*order), Sort('last_modified', -1)]
+            pagination_direction = GT if order_direction == 1 else LT


I know it makes no sense to redo that at each loop iteration, but I would have had less trouble reading all this if order_field and pagination_direction would have been define right along the pagination rules below :)

leplatrem · 2018-05-07T10:23:56Z

kinto/core/storage/testing.py

+    def test_get_all_parent_id_paginates_correctly(self):
+        """Verify that pagination doesn't squash or duplicate some records"""
+
+        for parent in range(10):


Please add a comment like Create records with different parent ids

leplatrem · 2018-05-07T10:26:31Z

kinto/core/storage/memory.py

@@ -322,7 +322,7 @@ def extract_record_set(records, filters, sorting,
        paginated = {}
        for rule in pagination_rules:
            values = apply_filters(filtered, rule)
-            paginated.update(((x[id_field], x) for x in values))
+            paginated.update((id(x), x) for x in values)


index by memory address? I realize I don't understand why we don't just build a list :)

Well, I just tried building a list and none of the tests failed. Maybe the person who wrote it originally was afraid that different pagination rules would re-match a record twice? In that case, the tests probably won't fail because we don't construct pagination rules in such a way where this could happen (but a pathological user could conceptually find a way if they were motivated). Since none of the tests fail, I'll take it out..

leplatrem · 2018-05-07T10:35:49Z

It isn't really clear what the scalability characteristics of Kinto are or should be. Is Kinto meant to store 10m records? I would say probably not. What about 1m records, but all in one collection? I'm not certain.

10m records accross several collections, I would say yes.

1m records in one collection, it should support it, but maybe not in optimal conditions. Since synchronization is going to be painful anyway (especially on the first time).

It isn't really clear (to me, at least) what queries Kinto should support as "first-class" and which are "gravy", i.e. added bonuses that we threw in because they were easy to implement but are not as well-loved, performant etc. as others. Naïvely I would say syncing collections is more important than supporting arbitrary filters, but then again, if that's what you want, CouchDB is available and may be a better fit for you.

I definitely agree that syncing is more important than arbitrary filters (for which we have the ElasticSearch plugin)

It isn't really clear to me what the relative priorities are in the Kinto ecosystem. As an example, in order to produce a Total-Records header, we have to add a COUNT to our queries, which means enumerating all the matching records rather than just the ones that we're going to render in the response. In other distributed systems, you might just paginate until you get an empty page. Is the ease-of-use/sanity checking provided by the Total-Records header worth the performance cost?

That was a naive choice at the beginning of the project. The Total-Records is not even exposed in our clients API...

On the server side, the total count is just used to figure out whether we have a next page. Which, as you say, could be managed differently (ie. paginate until empty page).

glasserc · 2018-05-07T13:15:43Z

One reason why Total-Records might be nice is to show pagination stuff on the client side. (e.g. if Total-Records is 50, we can show pages 1..5.) We do expose it, not as part of other requests, but as a getTotalRecords call in kinto-http.js. PostgREST handles this by exposing a Content-Range header, but that header doesn't contain a total count unless you add Prefer: count=exact header. They even recommend stripping this header in prod: http://postgrest.org/en/stable/admin.html?highlight=Content-Range#count-header-dos.

glasserc · 2018-05-07T13:27:43Z

We spoke about this online today. Removing the Total-Records header is kind of an API-breaking change, even though to our knowledge, our clients don't actually rely on this header.

peterbe · 2018-05-07T18:19:24Z

I'm not sure if it's worth it but we could skip the count part if the request is for the next page. E.g.

$ curl localhost:8888/v1/collection/records
< 200 OK
< Total-Records: 1,234,567
< Next-URL: http://localhost:8888/v1/paginate=somelonghashthing
$ curl http://localhost:8888/v1/paginate=somelonghashthing
< 200 OK
< Total-Records: 1,234,567
< Next-URL: http://localhost:8888/v1/paginate=anotherlonghash

In this case, the Total-Records number could come from the cache rather than having to do another query again.

Granted, I don't really know how the pagination works in Kinto but if the number really is only there for helping pagination then storing the first count in the cache would definitely make it possible to skip it for the next couple of pages.

Too complicated or every-little-helps?

peterbe · 2018-05-07T18:23:57Z

Another not-crazy idea is to keep a tally count of records per collection and parent. E.g.

if not filters:
    cur.execute("""
      select count from count_records where 
      collection_id=:collection_id and parent_id=:parent_id""", ...)
else:
    # Crap! Have to count them manually.
    cur.execute("""
      select count(*) from records where 
      collection_id=:collection_id and parent_id=:parent_id and
      {filters}
""", ...)
count, = cur.fetchone()  

# Now we have then count, now we just need to the the paginated records
cur.execute("""
  select id, last_modified, data from records 
  WHERE collection_id=:collection_id and parent_id=:parent_id and {filters}
""", ...)
records = cur.fetchmany()

A trigger function would increment and decrement the count_records table.

Obvious caveat (apart from trigger functions being annoyingly subtle) is that if the particular kinto database isn't used ever without complex filters this is a wasted record keeping.

peterbe · 2018-05-07T19:21:12Z

I extracted the query and compared.
In this database I have 183K records (almost) all in the same parent_id.

Best execution times...

The query in master: 3477.01ms
This PR: 1313.54ms
Two separate queries (count and select): 118.88ms + 1.69ms

With filtering on &has_foo=true which returns 2 rows

The query in master: 160.06ms
This PR: 356.93ms
Two separate queries (count and select): 104.29ms + 322.52ms

This concludes what @glasserc been trying to explain to me; Sometimes the count is expensive when it has to be separate. Sometimes it's cheap when it can count the (few) returned records.

This is also an emphasis about your point above about trying to outsmart the query planner in Postgres.

Not sure if this helps. Either way, we have choices of the best approach and they each depend and fit different environments/contexts.

peterbe · 2018-05-16T15:24:10Z

🎉🎊 \o/

Can't wait to get this into the Kinto that buildhub is using.

Natim · 2018-05-16T15:25:45Z

Let's prepare a release then 🌹

leplatrem · 2018-05-17T09:09:56Z

@glasserc could you add a CHANGELOG entry please?

glasserc · 2018-05-17T14:30:07Z

Oops, yes, this PR took so long I forgot about the unchecked boxes in the description. I've opened #1635 with a CHANGELOG entry.

Add changelog entry for #1622

glasserc added 5 commits May 4, 2018 14:44

Add passing test to clarify behavior of same id with different parent_id

30801ee

Introduce failing test for parent_id and pagination

b50420f

Interestingly, the memory backend and the Postgres backend both fail this test, but in different ways.

leplatrem reviewed May 7, 2018

View reviewed changes

Address review comments by @leplatrem

b4681c1

glasserc mentioned this pull request May 8, 2018

What should we do with Total-Records? #1624

Closed

leplatrem approved these changes May 16, 2018

View reviewed changes

glasserc merged commit 1394424 into Kinto:master May 16, 2018

glasserc deleted the cleanup-postgres-get_all branch May 16, 2018 13:09

glasserc added a commit to glasserc/kinto that referenced this pull request May 17, 2018

Add changelog entry for Kinto#1622

29fcf54

peterbe mentioned this pull request May 18, 2018

[WIP] Optimize postgresql storage get all fixes 1507 #1615

Closed

6 tasks

glasserc added a commit that referenced this pull request May 18, 2018

Merge pull request #1635 from glasserc/changelog-1622

525b3a1

Add changelog entry for #1622

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cleanup Postgres get_all #1622

Cleanup Postgres get_all #1622

glasserc commented May 4, 2018

leplatrem left a comment

leplatrem May 7, 2018

leplatrem May 7, 2018

leplatrem May 7, 2018

leplatrem May 7, 2018

leplatrem May 7, 2018

glasserc May 8, 2018

leplatrem commented May 7, 2018

glasserc commented May 7, 2018

glasserc commented May 7, 2018

peterbe commented May 7, 2018

peterbe commented May 7, 2018 •

edited

Loading

peterbe commented May 7, 2018

peterbe commented May 16, 2018

Natim commented May 16, 2018

leplatrem commented May 17, 2018

glasserc commented May 17, 2018

Cleanup Postgres get_all #1622

Cleanup Postgres get_all #1622

Conversation

glasserc commented May 4, 2018

leplatrem left a comment

Choose a reason for hiding this comment

leplatrem May 7, 2018

Choose a reason for hiding this comment

leplatrem May 7, 2018

Choose a reason for hiding this comment

leplatrem May 7, 2018

Choose a reason for hiding this comment

leplatrem May 7, 2018

Choose a reason for hiding this comment

leplatrem May 7, 2018

Choose a reason for hiding this comment

glasserc May 8, 2018

Choose a reason for hiding this comment

leplatrem commented May 7, 2018

glasserc commented May 7, 2018

glasserc commented May 7, 2018

peterbe commented May 7, 2018

peterbe commented May 7, 2018 • edited Loading

peterbe commented May 7, 2018

peterbe commented May 16, 2018

Natim commented May 16, 2018

leplatrem commented May 17, 2018

glasserc commented May 17, 2018

peterbe commented May 7, 2018 •

edited

Loading