Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: fix missing index on courier list count #3002

Merged
merged 11 commits into from
Jan 10, 2023

Conversation

jonas-jonas
Copy link
Member

Changes the order in which the count of all courier_messages is fetched from the database, to avoid scoping in the query and thus doing full table scans. Also adds missing indexes for the status and recipient filter.

Related issue(s)

Checklist

  • I have read the contributing guidelines.
  • I have referenced an issue containing the design document if my change
    introduces a new feature.
  • I am following the
    contributing code guidelines.
  • I have read the security policy.
  • I confirm that this pull request does not address a security
    vulnerability. If this pull request addresses a security vulnerability, I
    confirm that I got the approval (please contact
    security@ory.sh) from the maintainers to push
    the changes.
  • I have added tests that prove my fix is effective or that my feature
    works.
  • I have added or changed the documentation.

Further Comments

@@ -0,0 +1,3 @@
CREATE INDEX courier_messages_status_idx ON courier_messages (status);

CREATE INDEX courier_messages_status_idx ON courier_messages (recipient);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indices are a way to tell the database which queries it should expect, so that it can optimize the lookup for these. That's why indices should always include all columns for a specific query. Optimally, indices are built in such a way that they can be re-used elsewhere.

Queries can benefit from an index even if they only filter a prefix of its columns. For example, if you create an index of columns (A, B, C), queries filtering (A) or (A, B) can use the index, so you don't need to also index (A).

The more variety a column has, the better it is in the index, because it avoids hot spots. So always put the column with the highest "entropy" first:

Suggested change
CREATE INDEX courier_messages_status_idx ON courier_messages (recipient);
CREATE INDEX courier_messages_recipient_status_idx ON courier_messages (nid, recipient, status);

In the example below, if recipient is not required, you should be able to change the query to

	if filter.Recipient != "" {
 		q = q.Where("recipient=?", filter.Recipient)
 	} else {
 		q = q.Where("recipient != ?", "")
  }

so that you don't need to add another index (please verify this first though).

Furthermore, you're creating the same index name twice, which will result in an error.

This comment was marked as outdated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have now done some more testing, and benchmarking on a CRDB instance.

The following results have been gathered using EXPLAIN (opt, verbose) on CRDB.

To cover all possible queries in the "ListMessages" method with an index, we would need 4 indexes:

  • CREATE INDEX … (nid, created_at DESC, id) if neither status nor recipient is supplied
  • CREATE INDEX … (nid, status, created_at DESC, id) if only status is supplied
  • CREATE INDEX … (nid, recipient, created_at DESC, id) if only recipient is supplied
  • CREATE INDEX … (nid, status, recipient, created_at DESC, id) if both are supplied

IIUC, this would probably hinder the performance, because it would make writes slower. It is also not scalable, if we want to add a filter for say template_type in the future.

I'd propose to only create CREATE INDEX … (nid, created_at DESC, id) to cover the case where neither status nor recipient is supplied, as that is probably the most common case. Additionally, we can add separate indexes on status and recipient: CREATE INDEX ... (status) (and (recipient) respectively).

I have looked at the query plans for the cases and these are the results:

 filter  cost
 no filter  cost=14.82
 status = 3  cost=18.47
 recipient = 'x@ory.sh'  cost=18.47
 status = 3 AND recipient = 'x@ory.sh'  cost=18.48

(Without the separate indices on status and recipient the cost would be cost=21.37 for the queries filtering for them.)

Another improvement would be to use the STORING clause in the index (CREATE INDEX … (nid, created_at DESC, id) STORING (body, template_data, etc...)), to avoid the index join with the table courier_messages and return the data from the index directly, which would bring the cost for the query with no filters down to 9.37. But this requires us to define the columns in the index definition, which is possible, but IMO out of the scope of this PR. Let me know how you want me to proceed here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And to add, since I didn't it include in the comments yet: Adding recipient != '' to the query and creating the index with CREATE INDEX ... (nid, recipient, status, created_at DESC, id) does have an effect:

filter cost
recipient != '' AND status != -1 cost=17.32
recipient != '' AND status = 3 cost=17.32
recipient = 'x@ory.sh' AND status != -1 cost=18.48
status = 3 AND recipient = 'x@ory.sh' cost=14.87

So in essence, it slows down the query with no filters, and speeds up the others a bit. IMO that's not really what we want, because it's more likely that users want to see just any message, until we have proper filtering, using a search engine. But do let me know what you think!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the expected query pattern here, realistically? The status column has low entropy and, if combined with a search for recipient, a scan through the result set is not a problem. A sequential scan isn't a problem if the set to scan is small. Combined indices beyond these:

  • CREATE INDEX … (nid, created_at DESC, id)
  • CREATE INDEX … (nid, status, created_at DESC, id)
  • CREATE INDEX … (nid, recipient, created_at DESC, id)

probably don't give us anything useful.

I recommend you check if those indices get used in ListMessages (even if more filters are active) and call it a day.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alnr. From my POV, users would initially go to the email delivery view (where we show this data) with no filters active. Then they might filter for Abandoned messages and might search for a specific recipient. So we definitely need an index on the query with no filters, might need an index on the status query, and might need one on the recipient filter query. While possible, a query with both a status and a recipient filter is not that useful, so probably not to be expected.

I have settled on creating CREATE INDEX … (nid, created_at DESC, id) and indices for status (already exists) and recipient (new in this PR).

That way, we get these costs:

 filter  cost
 no filter  cost=14.82
 status = 3  cost=18.47
 recipient = 'x@ory.sh'  cost=18.47
 status = 3 AND recipient = 'x@ory.sh'  cost=18.48

If we additionally create:

  • CREATE INDEX … (nid, status, created_at DESC, id)
  • CREATE INDEX … (nid, recipient, created_at DESC, id)

We get cost=14.82 on the queries that have either of the fields, but not for both, which would be fine, IMO. Question is, what is the overhead for writes, if we have all 3 of these indices active? I'd say we will do many more writes to this table, than this admin query is executed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updating 3 indices shouldn't be a problem. I'd say go for it and revisit if there's a problem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I added the 2 missing indices to the PR. Should be good to go now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand, we always have these query components:

  • nid, created_at

If the user applies filters, they also add recipient, status, or both. Thus, this should handle all query cases:

CREATE INDEX … (nid ASC, created_at ASC, recipient ASC, status ASC)
CREATE INDEX … (nid ASC, created_at ASC, status ASC)

Given that you put recipient first:

	q := p.GetConnection(ctx).Where("nid=?", p.NetworkID(ctx))
	if filter.Recipient != "" {
 		q = q.Where("recipient=?", filter.Recipient)
 	}
	if filter.Status != nil {
		q = q.Where("status=?", *filter.Status)
	}

Please note that created_at DESC has no effect, as we explicitly use ORDER BY created_at DESC, which is not using the index for sorting but instead scans all ranges and orders them. Typically, we sort stuff ASC in indicies

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my testing, I can now say, that we need to put recipient and status first, before created_at in the index, otherwise the DB refuses to use the index in the query. Also see below on desc vs asc in the index.

Please note that created_at DESC has no effect, as we explicitly use ORDER BY created_at DESC, which is not using the index for sorting but instead scans all ranges and orders them. Typically, we sort stuff ASC in indicies

That is not true from my testing on a CRDB, as you can see here:

  • CREATE INDEX ... (nid, created_at); // this index is not used in the query plan and cost = 17.27
  • CREATE INDEX ... (nid, created_at DESC); // index is used in the query plan and cost = 14.82

But indeed, id does not seem to be needed. I'll remove it.

@jonas-jonas jonas-jonas self-assigned this Jan 6, 2023
@jonas-jonas jonas-jonas requested a review from hperl January 6, 2023 13:57
@jonas-jonas
Copy link
Member Author

This PR is now ready for another review. Please have a look at #3002 (comment) where I have written down the analysis of the costs of these indices.


require.NoError(t, err)
require.Len(t, ms, 0)
require.Equal(t, int64(0), tc)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's terrifying.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, luckily this was very unlikely, as there had to have been three preconditions:

  1. two messages with a timestamp collision (microsecond precision)
  2. the first message needed to be a page token
  3. the second message needed to have a UUID that had a higher "string" value (e.g. 1: "00...", 2: "11...")

@codecov
Copy link

codecov bot commented Jan 6, 2023

Codecov Report

Merging #3002 (ca35b45) into master (ca35b45) will not change coverage.
The diff coverage is n/a.

❗ Current head ca35b45 differs from pull request most recent head 31cc37b. Consider uploading reports for the commit 31cc37b to get more accurate results

@@           Coverage Diff           @@
##           master    #3002   +/-   ##
=======================================
  Coverage   77.59%   77.59%           
=======================================
  Files         310      310           
  Lines       19265    19265           
=======================================
  Hits        14948    14948           
  Misses       3180     3180           
  Partials     1137     1137           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@@ -126,12 +126,60 @@ func TestPersister(ctx context.Context, newNetworkUnlessExisting NetworkWrapper,
assert.Equal(t, messages[len(messages)-1].ID, ms[0].ID)

t.Run("on another network", func(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice test

@aeneasr aeneasr merged commit 3b50711 into master Jan 10, 2023
@aeneasr aeneasr deleted the fix/missingIndexOnCourierList branch January 10, 2023 23:46
CNLHC pushed a commit to seekthought/kratos that referenced this pull request May 16, 2023
peturgeorgievv pushed a commit to senteca/kratos-fork that referenced this pull request Jun 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants