enhance: remove redundant indices from submission_defs table #805

sadiqkhoja · 2023-03-21T18:57:42Z

Improves submission throughput by 20% without any impact on select queries.

What has been done to verify that this works as intended?

Performed load testing with jmeter.

Used 250_questions form. For every run, I kept only 20K submissions in the database (having 0 existing submission doesn't truly reflect impact of indices).

Parameters:

Threads: 300
Total test duration: 60s
Precise throughput target: 1500 per 10 sec. Release 40 threads in a batch
(I run tests with various parameters to find out what is the maximum throughput that can be achieved with no index on submission_defs table)

Result:
95/sec throughput with 4% failed submissions without this change
115/sec throughput without any failure with this change

Why is this the best possible solution? Were any other approaches considered?

I want this change to be reviewed under the microscope.

Thinking to delete submissions_draft_index, submissions_formid_index and submissions_formid_instanceid_index from submissions table as well

How does this change affect users? Describe intentional changes to behavior and behavior that could have accidentally been affected by code changes. In other words, what are the regression risks?

We need to really make sure that there is no negative impact of this change on submission retrieval (odata, csv, zip)

Does this change require updates to the API documentation? If so, please update docs/api.md as part of this PR.

NA

Before submitting this PR, please make sure you have:

run make test-full and confirmed all checks still pass OR confirm CircleCI build passes
verified that any code from external sources are properly credited in comments or that everything is internally sourced

lognaturel · 2023-04-17T17:41:08Z

Really great to see some throughput numbers and ideas for increasing it.

A couple of high-level thoughts:

It's hard to weigh tradeoffs between submission throughput and export speed and it's too bad that they're linked (but of course they are if we use the same store for both). Export performance is important because that's where the row count can get really significant and it's common to request the full data frequently.
It might be beneficial to look at some real query plans. I feel like some of these we can reason about but others we might be surprised by. Maybe you have?

Performed load testing with jmeter.

This was on the dev server, right? 2 vCPUs and 2GB RAM

I believe the only read operation with performance requirements is data export. There are different variations:

Format: CSV (zipped or not), OData
Attachments: submission attachments and client audit export
Extras: split select multiple, repeats

I believe we know that these are typically RAM-bound.

make sure that there is no negative impact of this change on submission retrieval (odata, csv, zip)

As far as I can tell, there will have to be some amount of negative impact on the queries for all of these, right? In particular, they all include joins on submissionId and several of them have current=true conditions. I believe createdAt only comes up for submission attachments. What about id, "submissionId", does that come into play when the submissionId is known and the id is needed (so in some of the joins)? That said, all of the conditions are pretty deep in complex queries so I imagine the working set is already quite small when they are evaluated, does that sound right? Maybe that means the indexes don't make much of a difference, if any.

How did you pick these specific indexes to target? Was it only based on the impact on submission processing time or also on some estimate of impact on exports (e.g. something like the high-level reasoning I went through above).

Thinking to delete submissions_draft_index, submissions_formid_index and submissions_formid_instanceid_index from submissions table as well

It would be helpful to get a quick narrative on how you think about these as well.

submissions_formid_index - this one is the one that seems the riskiest to me. It feels like if we have a lot of submissions across a lot of different forms not having this index could slow things down noticeably. We already do have an issue in that area so the tradeoff might not be worth it. This is where looking at some real query plans might be helpful.

lognaturel · 2023-04-17T23:19:30Z

From @sadiqkhoja: there are still multicolumn indexes that cover the same cases as the removed ones.

From postgres docs:

A multicolumn B-tree index can be used with query conditions that involve any subset of the index's columns, but the index is most efficient when there are constraints on the leading (leftmost) columns. The exact rule is that equality constraints on leading columns, plus any inequality constraints on the first column that does not have an equality constraint, will be used to limit the portion of the index that is scanned.

I did not know this.

I can confirm the following indexes remain on submission_defs:

id
"submissionId", current
"submissionId", "instanceId"

I believe that leaves submission attachments without index support because they use createdAt. Could they use the index instead? Or maybe it's deep enough in the query that it doesn't matter.

I verified with explain that read queries with just submissionId equality are cheap. Queries with just instanceId or current equality are at least half as expensive as linear searches (still pretty expensive). That looks ok to me given my understanding of the export-related queries.

For submissions, with the proposed changes there would remain:

id
"formId", "instanceId", draft
"formId", "createdAt"

All of the removed ones are subsets of those. We only use equality conditions against any of those fields. According to the experimentation I did above and my understanding of the queries on submissions, removing draft, formId, (formId, instanceId) are very likely to be covered by the remaining indexes. In fact, this looks more unambiguously safe than the submission_defs changes.

sadiqkhoja · 2023-04-18T16:03:02Z

I believe that leaves submission attachments without index support because they use createdAt. Could they use the index instead? Or maybe it's deep enough in the query that it doesn't matter.

I don't see createdAt being used in the where clause of submission attachments

lognaturel · 2023-04-18T21:57:11Z

I don't see createdAt being used in the where clause of submission attachments

I'm sorry, I meant client audits: https://github.com/getodk/central-backend/blob/master/lib/model/query/client-audits.js#L41

My understanding is that the createdAt index would make a difference for the sorting and that the size of the working set could be fairly large. Do we have a good way to estimate what that impact would be?

sadiqkhoja · 2023-04-19T15:45:49Z

ah! there will be significant impact definitely.

Q: Do we really need to order by createdAt here? Isn't sorting by id sufficient?
A: No, concurrent requests can create submission_defs that have different order of id and createdAt :(

-- in test.getodk
SELECT count(1) FROM 
(
  SELECT id, "createdAt", ROW_NUMBER() OVER (ORDER BY id) rownum FROM submission_defs sd 
) orderbyid 
JOIN (
  SELECT id, "createdAt", ROW_NUMBER() OVER (ORDER BY "createdAt") rownum FROM submission_defs sd 
) orderbycreatedat ON orderbyid.rownum = orderbycreatedat.rownum
WHERE orderbyid.id != orderbycreatedat.id

-- output
-- 19692

I am going to create composite index on (createdAt, id)

improves submission throughput by 20% without any impact on select queries Benchmark: 250_questions form with 20K existing submissions JMeter parameters: 300 thread 60 test duration 1500 per 10 sec target throughput release 40 threads in batch Result: 95/sec throughput with 4% error without this change 115/sec throughput without any error with this change

add (createdAt, id) index on submission_defs add (formId, createdAt, id) index on submissions

lognaturel · 2023-04-19T16:57:26Z

concurrent requests can create submission_defs that have different order of id and createdAt

That's interesting and intersects with discussions we've been having about entity updates through submissions, I think. Does this have to do with postgres transaction rules or something? Like is the id assigned at the beginning of the transaction and createdAt at the end or something like that? Or am I thinking along the wrong lines?

That said, maybe sorting by id still is ok. I don't know that the specific order matters, only that it's relatively stable. My understanding is that order is meaningful only because sometimes humans manually look at the audit export and it's helpful in that case to be able to relate that order to the submission table. The order of the individual log entries within a submission is the most important thing for that. The absolute ideal would be that it be sorted by date/time that the submissions were created on the client but we know that's literally impossible (because clients could be offline and who knows how their clocks are set).

sadiqkhoja · 2023-04-19T17:49:11Z

Does this have to do with postgres transaction rules or something? Like is the id assigned at the beginning of the transaction and createdAt at the end or something like that? Or am I thinking along the wrong lines?

Values for id and createdAt are generated in the same SQL statement. Even then it is possible that both get out of order because same statement can be executed by concurrent transactions independently. DB will call nextval() and clock_timestamp() for each statement without knowledge of each other.

lognaturel · 2023-04-19T21:16:47Z

DB will call nextval() and clock_timestamp() for each statement without knowledge of each other.

Fascinating.

What did you think of my reasoning for submission_defs.createdAt not actually being needed? I'm fine with going with what you have for now but also if you find that reasoning compelling it would be nice to leverage an existing index.

Do users need to run analyze explicitly after indexes change or will they be rebuilt automatically as part of either the migration process or routine vacuum?

sadiqkhoja · 2023-04-19T21:44:54Z

For client audit logs, sorting by just id is fine. But we are sorting submissions by createdAt, id as well and since we don't return id in the response, users may get confused about the order. May be it's not a big deal, you have more experience with the users, I think you should make the call.

We haven't put analyze in the migrations up till now, I guess routine vacuuming should be enough for this case as well.

lognaturel

Couple questions. I'm not confident enough about changing the client audit sort so let's leave that as-is.

lib/model/migrations/20230419-01-optimize-indices-sub-defs.js

lognaturel · 2023-04-20T18:36:29Z

I'm satisfied by the answers to my latest questions and the analysis I wrote up above. I think it can be merged without another pair of eyes but also happy for you to loop in another reviewer if you still find this risky, @sadiqkhoja.

matthew-white assigned sadiqkhoja Apr 2, 2023

sadiqkhoja added 2 commits April 19, 2023 12:51

remove redundant indices from submissions table

1a2eba8

add (createdAt, id) index on submission_defs add (formId, createdAt, id) index on submissions

sadiqkhoja force-pushed the enhance/optimize-indices-sub-defs branch from 596a114 to 1a2eba8 Compare April 19, 2023 16:51

rename migration file to current date

33da5fc

sadiqkhoja marked this pull request as ready for review April 19, 2023 17:49

lognaturel reviewed Apr 19, 2023

View reviewed changes

lib/model/migrations/20230419-01-optimize-indices-sub-defs.js Show resolved Hide resolved

lib/model/migrations/20230419-01-optimize-indices-sub-defs.js Show resolved Hide resolved

lognaturel approved these changes Apr 20, 2023

View reviewed changes

sadiqkhoja merged commit a0b8d52 into getodk:master Apr 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enhance: remove redundant indices from submission_defs table #805

enhance: remove redundant indices from submission_defs table #805

sadiqkhoja commented Mar 21, 2023 •

edited

Loading

lognaturel commented Apr 17, 2023

lognaturel commented Apr 17, 2023 •

edited

Loading

sadiqkhoja commented Apr 18, 2023

lognaturel commented Apr 18, 2023

sadiqkhoja commented Apr 19, 2023

lognaturel commented Apr 19, 2023

sadiqkhoja commented Apr 19, 2023

lognaturel commented Apr 19, 2023

sadiqkhoja commented Apr 19, 2023

lognaturel left a comment

lognaturel commented Apr 20, 2023

enhance: remove redundant indices from submission_defs table #805

enhance: remove redundant indices from submission_defs table #805

Conversation

sadiqkhoja commented Mar 21, 2023 • edited Loading

What has been done to verify that this works as intended?

Why is this the best possible solution? Were any other approaches considered?

How does this change affect users? Describe intentional changes to behavior and behavior that could have accidentally been affected by code changes. In other words, what are the regression risks?

Does this change require updates to the API documentation? If so, please update docs/api.md as part of this PR.

Before submitting this PR, please make sure you have:

lognaturel commented Apr 17, 2023

lognaturel commented Apr 17, 2023 • edited Loading

sadiqkhoja commented Apr 18, 2023

lognaturel commented Apr 18, 2023

sadiqkhoja commented Apr 19, 2023

lognaturel commented Apr 19, 2023

sadiqkhoja commented Apr 19, 2023

lognaturel commented Apr 19, 2023

sadiqkhoja commented Apr 19, 2023

lognaturel left a comment

Choose a reason for hiding this comment

lognaturel commented Apr 20, 2023

sadiqkhoja commented Mar 21, 2023 •

edited

Loading

lognaturel commented Apr 17, 2023 •

edited

Loading