-
Notifications
You must be signed in to change notification settings - Fork 969
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: fix missing index on courier list count #3002
Conversation
@@ -0,0 +1,3 @@ | |||
CREATE INDEX courier_messages_status_idx ON courier_messages (status); | |||
|
|||
CREATE INDEX courier_messages_status_idx ON courier_messages (recipient); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indices are a way to tell the database which queries it should expect, so that it can optimize the lookup for these. That's why indices should always include all columns for a specific query. Optimally, indices are built in such a way that they can be re-used elsewhere.
- https://www.cockroachlabs.com/docs/v22.2/schema-design-indexes
- https://www.cockroachlabs.com/docs/v22.2/create-index
- https://www.cockroachlabs.com/docs/stable/indexes.html
- https://www.cockroachlabs.com/docs/v22.2/schema-design-indexes#best-practices
- https://www.cockroachlabs.com/docs/v22.2/performance-best-practices-overview#unique-id-best-practices
Queries can benefit from an index even if they only filter a prefix of its columns. For example, if you create an index of columns (A, B, C), queries filtering (A) or (A, B) can use the index, so you don't need to also index (A).
The more variety a column has, the better it is in the index, because it avoids hot spots. So always put the column with the highest "entropy" first:
CREATE INDEX courier_messages_status_idx ON courier_messages (recipient); | |
CREATE INDEX courier_messages_recipient_status_idx ON courier_messages (nid, recipient, status); |
In the example below, if recipient is not required, you should be able to change the query to
if filter.Recipient != "" {
q = q.Where("recipient=?", filter.Recipient)
} else {
q = q.Where("recipient != ?", "")
}
so that you don't need to add another index (please verify this first though).
Furthermore, you're creating the same index name twice, which will result in an error.
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have now done some more testing, and benchmarking on a CRDB instance.
The following results have been gathered using EXPLAIN (opt, verbose)
on CRDB.
To cover all possible queries in the "ListMessages" method with an index, we would need 4 indexes:
CREATE INDEX … (nid, created_at DESC, id)
if neitherstatus
norrecipient
is suppliedCREATE INDEX … (nid, status, created_at DESC, id)
if onlystatus
is suppliedCREATE INDEX … (nid, recipient, created_at DESC, id)
if onlyrecipient
is suppliedCREATE INDEX … (nid, status, recipient, created_at DESC, id)
if both are supplied
IIUC, this would probably hinder the performance, because it would make writes slower. It is also not scalable, if we want to add a filter for say template_type
in the future.
I'd propose to only create CREATE INDEX … (nid, created_at DESC, id)
to cover the case where neither status
nor recipient
is supplied, as that is probably the most common case. Additionally, we can add separate indexes on status
and recipient
: CREATE INDEX ... (status)
(and (recipient)
respectively).
I have looked at the query plans for the cases and these are the results:
filter | cost |
---|---|
no filter |
cost=14.82 |
status = 3 |
cost=18.47 |
recipient = 'x@ory.sh' |
cost=18.47 |
status = 3 AND recipient = 'x@ory.sh' |
cost=18.48 |
(Without the separate indices on status
and recipient
the cost would be cost=21.37
for the queries filtering for them.)
Another improvement would be to use the STORING
clause in the index (CREATE INDEX … (nid, created_at DESC, id) STORING (body, template_data, etc...)
), to avoid the index join with the table courier_messages
and return the data from the index directly, which would bring the cost for the query with no filters down to 9.37
. But this requires us to define the columns in the index definition, which is possible, but IMO out of the scope of this PR. Let me know how you want me to proceed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And to add, since I didn't it include in the comments yet: Adding recipient != ''
to the query and creating the index with CREATE INDEX ... (nid, recipient, status, created_at DESC, id)
does have an effect:
filter | cost |
---|---|
recipient != '' AND status != -1 |
cost=17.32 |
recipient != '' AND status = 3 |
cost=17.32 |
recipient = 'x@ory.sh' AND status != -1 |
cost=18.48 |
status = 3 AND recipient = 'x@ory.sh' |
cost=14.87 |
So in essence, it slows down the query with no filters, and speeds up the others a bit. IMO that's not really what we want, because it's more likely that users want to see just any message, until we have proper filtering, using a search engine. But do let me know what you think!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the expected query pattern here, realistically? The status
column has low entropy and, if combined with a search for recipient
, a scan through the result set is not a problem. A sequential scan isn't a problem if the set to scan is small. Combined indices beyond these:
CREATE INDEX … (nid, created_at DESC, id)
CREATE INDEX … (nid, status, created_at DESC, id)
CREATE INDEX … (nid, recipient, created_at DESC, id)
probably don't give us anything useful.
I recommend you check if those indices get used in ListMessages
(even if more filters are active) and call it a day.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @alnr. From my POV, users would initially go to the email delivery view (where we show this data) with no filters active. Then they might filter for Abandoned
messages and might search for a specific recipient. So we definitely need an index on the query with no filters, might need an index on the status
query, and might need one on the recipient
filter query. While possible, a query with both a status
and a recipient
filter is not that useful, so probably not to be expected.
I have settled on creating CREATE INDEX … (nid, created_at DESC, id)
and indices for status
(already exists) and recipient
(new in this PR).
That way, we get these costs:
filter | cost |
---|---|
no filter |
cost=14.82 |
status = 3 |
cost=18.47 |
recipient = 'x@ory.sh' |
cost=18.47 |
status = 3 AND recipient = 'x@ory.sh' |
cost=18.48 |
If we additionally create:
CREATE INDEX … (nid, status, created_at DESC, id)
CREATE INDEX … (nid, recipient, created_at DESC, id)
We get cost=14.82
on the queries that have either of the fields, but not for both, which would be fine, IMO. Question is, what is the overhead for writes, if we have all 3 of these indices active? I'd say we will do many more writes to this table, than this admin query is executed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updating 3 indices shouldn't be a problem. I'd say go for it and revisit if there's a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I added the 2 missing indices to the PR. Should be good to go now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understand, we always have these query components:
- nid, created_at
If the user applies filters, they also add recipient
, status
, or both. Thus, this should handle all query cases:
CREATE INDEX … (nid ASC, created_at ASC, recipient ASC, status ASC)
CREATE INDEX … (nid ASC, created_at ASC, status ASC)
Given that you put recipient first:
q := p.GetConnection(ctx).Where("nid=?", p.NetworkID(ctx))
if filter.Recipient != "" {
q = q.Where("recipient=?", filter.Recipient)
}
if filter.Status != nil {
q = q.Where("status=?", *filter.Status)
}
Please note that created_at DESC
has no effect, as we explicitly use ORDER BY created_at DESC
, which is not using the index for sorting but instead scans all ranges and orders them. Typically, we sort stuff ASC in indicies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my testing, I can now say, that we need to put recipient
and status
first, before created_at
in the index, otherwise the DB refuses to use the index in the query. Also see below on desc
vs asc
in the index.
Please note that created_at DESC has no effect, as we explicitly use ORDER BY created_at DESC, which is not using the index for sorting but instead scans all ranges and orders them. Typically, we sort stuff ASC in indicies
That is not true from my testing on a CRDB, as you can see here:
CREATE INDEX ... (nid, created_at); // this index is not used in the query plan and cost = 17.27
CREATE INDEX ... (nid, created_at DESC); // index is used in the query plan and cost = 14.82
But indeed, id
does not seem to be needed. I'll remove it.
persistence/sql/migrations/sql/20230104193739000000_courier_list_index.up.sql
Outdated
Show resolved
Hide resolved
This PR is now ready for another review. Please have a look at #3002 (comment) where I have written down the analysis of the costs of these indices. |
|
||
require.NoError(t, err) | ||
require.Len(t, ms, 0) | ||
require.Equal(t, int64(0), tc) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what a failure of the test case looks like: https://github.com/ory/kratos/actions/runs/3856498251/jobs/6572792385#step:15:3563
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's terrifying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, luckily this was very unlikely, as there had to have been three preconditions:
- two messages with a timestamp collision (microsecond precision)
- the first message needed to be a page token
- the second message needed to have a UUID that had a higher "string" value (e.g. 1: "00...", 2: "11...")
Codecov Report
@@ Coverage Diff @@
## master #3002 +/- ##
=======================================
Coverage 77.59% 77.59%
=======================================
Files 310 310
Lines 19265 19265
=======================================
Hits 14948 14948
Misses 3180 3180
Partials 1137 1137 Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
persistence/sql/migrations/sql/20230104193739000000_courier_list_index.up.sql
Outdated
Show resolved
Hide resolved
@@ -126,12 +126,60 @@ func TestPersister(ctx context.Context, newNetworkUnlessExisting NetworkWrapper, | |||
assert.Equal(t, messages[len(messages)-1].ID, ms[0].ID) | |||
|
|||
t.Run("on another network", func(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice test
Changes the order in which the count of all
courier_message
s is fetched from the database, to avoid scoping in the query and thus doing full table scans. Also adds missing indexes for thestatus
andrecipient
filter.Related issue(s)
Checklist
introduces a new feature.
contributing code guidelines.
vulnerability. If this pull request addresses a security vulnerability, I
confirm that I got the approval (please contact
security@ory.sh) from the maintainers to push
the changes.
works.
Further Comments