Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup Postgres get_all #1622

Merged
merged 6 commits into from
May 16, 2018
Merged

Conversation

glasserc
Copy link
Contributor

@glasserc glasserc commented May 4, 2018

This is a more conservative approach towards addressing #1507. In particular, it doesn't introduce any race conditions, and probably reduces query times by much less.

The existing query's performance comes down to three factors:

  1. It matches every record in the database that matches the queries, and outputs them in any random order.
  2. It then sorts those records according to the order given in the sort order.
  3. It does a distinct/join as part of paginating.

The performance characteristics of these steps varies depending on how many records the filters match and how many they don't match, as well as what proportion of the database is under the given parent_id.

  1. If most of the database is under this parent, then doing a sequential scan is the fastest way to do this. Otherwise, you want to use an index scan to narrow it down as much as possible before applying filters.
  2. Sorting is O(n log n), so if a lot of records match the filters, doing a sort afterwards is expensive. In those cases, you generally want to do an index scan to produce the records in the right order to begin with.
  3. Doing a distinct/join is probably O(n log n), so it becomes more expensive the more records match.

Per our discussion in #1616, there are a couple of subtle bugs in the use of wildcard parent_id, which I used as the crowbar to try to pry open this query. First, introduce some tests to firm up the behavior we expect for wildcard parent_ids. It turns out that the third step above introduces a bug in pagination of results when doing a wildcard parent_id match, so just delete it.

Doing the sorting at the end seems wasteful to me; you need to sort anyhow in order to do the pagination, so why not generate records in the right order to begin with? I moved the 2nd step into the first step. This encourages Postgres to use indexes when called for. However, it seems that the benefit to the second step of using an index is outweighed by the cost to the first step of doing an index scan when very few records match, so in a Kinto instance with all records under one parent, doing a query with filters that match comparatively few records, this query is slower than the previous one.

I tested this by filling a Kinto instance with about 10m records, all in the same collection, each of the form {"id": "id-XXXXXX", "my_data": XXXXXX}, and then PATCHing one record to add a "foo": true field to it. I then measured the time of http GET 'localhost:8888/v1/buckets/a/collections/b/records?_limit=10', http GET 'localhost:8888/v1/buckets/a/collections/b/records?_limit=10&has_foo=true', and http GET 'localhost:8888/v1/buckets/a/collections/b/records?_limit=10&has_foo=false'. The first and last requests tend to perform the same, because both match the vast majority of the records in the database; on these, the new query takes about 60s (to the old query's ~90s, so about 33% faster). The second query is a little slower than the previous one (4.52s to 2.65s, or about 80% slower).

Perhaps it's possible to get the benefit of both queries by using heuristics to choose the sequential scan or the index scan, but then we're in the situation of trying to beat the Postgres query planner and I'm not sanguine about our chances. On the other hand, note that the Postgres query planner is hamstrung by the lack of indexing on the data column.

My conclusions are:

  • It isn't really clear what the scalability characteristics of Kinto are or should be. Is Kinto meant to store 10m records? I would say probably not. What about 1m records, but all in one collection? I'm not certain.

  • It isn't really clear (to me, at least) what queries Kinto should support as "first-class" and which are "gravy", i.e. added bonuses that we threw in because they were easy to implement but are not as well-loved, performant etc. as others. Naïvely I would say syncing collections is more important than supporting arbitrary filters, but then again, if that's what you want, CouchDB is available and may be a better fit for you.

  • It isn't really clear to me what the relative priorities are in the Kinto ecosystem. As an example, in order to produce a Total-Records header, we have to add a COUNT to our queries, which means enumerating all the matching records rather than just the ones that we're going to render in the response. In other distributed systems, you might just paginate until you get an empty page. Is the ease-of-use/sanity checking provided by the Total-Records header worth the performance cost?

  • Add documentation.

  • Add tests.

  • Add a changelog entry.

  • (n/a) Add your name in the contributors file.

  • (n/a) If you changed the HTTP API, update the API_VERSION constant and add an API changelog entry in the docs

  • (n/a) If you added a new configuration setting, update the kinto.tpl file with it.

glasserc added 5 commits May 4, 2018 14:44
Interestingly, the memory backend and the Postgres backend both fail
this test, but in different ways.
Because parent_id with wildcards can match different objects that have
the same ID, this old use of records indexed by IDs to get the union
of a bunch of records no longer works. Instead, rely on the fact that
the same record, matched multiple times, is the same object in
memory, and use its memory location as its unique ID. This is kind of
a hack, but I couldn't think of a better solution.

Although this is one obvious place where we rely on every object in a
given "result set" having a unique ID, there may be others hidden
elsewhere.

Refs: Kinto#1616
This DISTINCT id leads to difficulties when, as with wildcard
parent_ids, multiple records have the same ID. This manifests as
re-selecting the same record in multiple "pages" because its id
matched once.

This DISTINCT also costs us performance because we have to stick every
record in a hash in order to deduplicate it, and then do an expensive
join afterwards to get the record back. None of this is really
necessary, so just delete it.
Leaving the sorting criteria at the end of the query causes Postgres
to fetch all the records first, count them, and then sort them (which
is probably O(n log n)) so that it can select the right page.

Moving the sorting criteria closest to the data means that Postgres
will try to use an index (if possible/available) to generate records
in the right order. This eliminates the cost of doing a sort. However,
it more or less enforces that we will be scanning records via the
index, even if scanning the whole table would be faster and generate
comparatively few records (i.e. few enough that sorting would be
fast).
Copy link
Contributor

@leplatrem leplatrem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some superficial comments...

records = []
pagination = None
while True:
(page, total_records) = self.storage.get_all(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: superfluous parenthesis

[Filter(order_field, threshhold_field, utils.COMPARISON.EQ),
Filter('last_modified', threshhold_lm, utils.COMPARISON.LT)],
[Filter(order_field, threshhold_field, pagination_direction)]
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for order in [('secret_data', 1), ('secret_data', -1)]:
order_field, order_direction = order
sort = [Sort(*order), Sort('last_modified', -1)]
pagination_direction = GT if order_direction == 1 else LT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it makes no sense to redo that at each loop iteration, but I would have had less trouble reading all this if order_field and pagination_direction would have been define right along the pagination rules below :)

def test_get_all_parent_id_paginates_correctly(self):
"""Verify that pagination doesn't squash or duplicate some records"""

for parent in range(10):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment like Create records with different parent ids

@@ -322,7 +322,7 @@ def extract_record_set(records, filters, sorting,
paginated = {}
for rule in pagination_rules:
values = apply_filters(filtered, rule)
paginated.update(((x[id_field], x) for x in values))
paginated.update((id(x), x) for x in values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

index by memory address? I realize I don't understand why we don't just build a list :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I just tried building a list and none of the tests failed. Maybe the person who wrote it originally was afraid that different pagination rules would re-match a record twice? In that case, the tests probably won't fail because we don't construct pagination rules in such a way where this could happen (but a pathological user could conceptually find a way if they were motivated). Since none of the tests fail, I'll take it out..

@leplatrem
Copy link
Contributor

It isn't really clear what the scalability characteristics of Kinto are or should be. Is Kinto meant to store 10m records? I would say probably not. What about 1m records, but all in one collection? I'm not certain.

10m records accross several collections, I would say yes.

1m records in one collection, it should support it, but maybe not in optimal conditions. Since synchronization is going to be painful anyway (especially on the first time).

It isn't really clear (to me, at least) what queries Kinto should support as "first-class" and which are "gravy", i.e. added bonuses that we threw in because they were easy to implement but are not as well-loved, performant etc. as others. Naïvely I would say syncing collections is more important than supporting arbitrary filters, but then again, if that's what you want, CouchDB is available and may be a better fit for you.

I definitely agree that syncing is more important than arbitrary filters (for which we have the ElasticSearch plugin)

It isn't really clear to me what the relative priorities are in the Kinto ecosystem. As an example, in order to produce a Total-Records header, we have to add a COUNT to our queries, which means enumerating all the matching records rather than just the ones that we're going to render in the response. In other distributed systems, you might just paginate until you get an empty page. Is the ease-of-use/sanity checking provided by the Total-Records header worth the performance cost?

That was a naive choice at the beginning of the project. The Total-Records is not even exposed in our clients API...

On the server side, the total count is just used to figure out whether we have a next page. Which, as you say, could be managed differently (ie. paginate until empty page).

@glasserc
Copy link
Contributor Author

glasserc commented May 7, 2018

One reason why Total-Records might be nice is to show pagination stuff on the client side. (e.g. if Total-Records is 50, we can show pages 1..5.) We do expose it, not as part of other requests, but as a getTotalRecords call in kinto-http.js. PostgREST handles this by exposing a Content-Range header, but that header doesn't contain a total count unless you add Prefer: count=exact header. They even recommend stripping this header in prod: http://postgrest.org/en/stable/admin.html?highlight=Content-Range#count-header-dos.

@glasserc
Copy link
Contributor Author

glasserc commented May 7, 2018

We spoke about this online today. Removing the Total-Records header is kind of an API-breaking change, even though to our knowledge, our clients don't actually rely on this header.

@peterbe
Copy link
Contributor

peterbe commented May 7, 2018

I'm not sure if it's worth it but we could skip the count part if the request is for the next page. E.g.

$ curl localhost:8888/v1/collection/records
< 200 OK
< Total-Records: 1,234,567
< Next-URL: http://localhost:8888/v1/paginate=somelonghashthing
$ curl http://localhost:8888/v1/paginate=somelonghashthing
< 200 OK
< Total-Records: 1,234,567
< Next-URL: http://localhost:8888/v1/paginate=anotherlonghash

In this case, the Total-Records number could come from the cache rather than having to do another query again.

Granted, I don't really know how the pagination works in Kinto but if the number really is only there for helping pagination then storing the first count in the cache would definitely make it possible to skip it for the next couple of pages.

Too complicated or every-little-helps?

@peterbe
Copy link
Contributor

peterbe commented May 7, 2018

Another not-crazy idea is to keep a tally count of records per collection and parent. E.g.

if not filters:
    cur.execute("""
      select count from count_records where 
      collection_id=:collection_id and parent_id=:parent_id""", ...)
else:
    # Crap! Have to count them manually.
    cur.execute("""
      select count(*) from records where 
      collection_id=:collection_id and parent_id=:parent_id and
      {filters}
""", ...)
count, = cur.fetchone()  

# Now we have then count, now we just need to the the paginated records
cur.execute("""
  select id, last_modified, data from records 
  WHERE collection_id=:collection_id and parent_id=:parent_id and {filters}
""", ...)
records = cur.fetchmany()

A trigger function would increment and decrement the count_records table.

Obvious caveat (apart from trigger functions being annoyingly subtle) is that if the particular kinto database isn't used ever without complex filters this is a wasted record keeping.

@peterbe
Copy link
Contributor

peterbe commented May 7, 2018

I extracted the query and compared.
In this database I have 183K records (almost) all in the same parent_id.

Best execution times...

  1. The query in master: 3477.01ms
  2. This PR: 1313.54ms
  3. Two separate queries (count and select): 118.88ms + 1.69ms

With filtering on &has_foo=true which returns 2 rows

  1. The query in master: 160.06ms
  2. This PR: 356.93ms
  3. Two separate queries (count and select): 104.29ms + 322.52ms

This concludes what @glasserc been trying to explain to me; Sometimes the count is expensive when it has to be separate. Sometimes it's cheap when it can count the (few) returned records.

This is also an emphasis about your point above about trying to outsmart the query planner in Postgres.

Not sure if this helps. Either way, we have choices of the best approach and they each depend and fit different environments/contexts.

@glasserc glasserc merged commit 1394424 into Kinto:master May 16, 2018
@glasserc glasserc deleted the cleanup-postgres-get_all branch May 16, 2018 13:09
@peterbe
Copy link
Contributor

peterbe commented May 16, 2018

🎉🎊 \o/

Can't wait to get this into the Kinto that buildhub is using.

@Natim
Copy link
Member

Natim commented May 16, 2018

Let's prepare a release then 🌹

@leplatrem
Copy link
Contributor

@glasserc could you add a CHANGELOG entry please?

glasserc added a commit to glasserc/kinto that referenced this pull request May 17, 2018
@glasserc
Copy link
Contributor Author

Oops, yes, this PR took so long I forgot about the unchecked boxes in the description. I've opened #1635 with a CHANGELOG entry.

glasserc added a commit that referenced this pull request May 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants