Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhance: remove redundant indices from submission_defs table #805

Merged

Conversation

sadiqkhoja
Copy link
Contributor

@sadiqkhoja sadiqkhoja commented Mar 21, 2023

Improves submission throughput by 20% without any impact on select queries.

What has been done to verify that this works as intended?

Performed load testing with jmeter.

Used 250_questions form. For every run, I kept only 20K submissions in the database (having 0 existing submission doesn't truly reflect impact of indices).

Parameters:

Threads: 300
Total test duration: 60s
Precise throughput target: 1500 per 10 sec. Release 40 threads in a batch
(I run tests with various parameters to find out what is the maximum throughput that can be achieved with no index on submission_defs table)

Result:
95/sec throughput with 4% failed submissions without this change
115/sec throughput without any failure with this change

Why is this the best possible solution? Were any other approaches considered?

I want this change to be reviewed under the microscope.

Thinking to delete submissions_draft_index, submissions_formid_index and submissions_formid_instanceid_index from submissions table as well

How does this change affect users? Describe intentional changes to behavior and behavior that could have accidentally been affected by code changes. In other words, what are the regression risks?

We need to really make sure that there is no negative impact of this change on submission retrieval (odata, csv, zip)

Does this change require updates to the API documentation? If so, please update docs/api.md as part of this PR.

NA

Before submitting this PR, please make sure you have:

  • run make test-full and confirmed all checks still pass OR confirm CircleCI build passes
  • verified that any code from external sources are properly credited in comments or that everything is internally sourced

@lognaturel
Copy link
Member

Really great to see some throughput numbers and ideas for increasing it.

A couple of high-level thoughts:

  • It's hard to weigh tradeoffs between submission throughput and export speed and it's too bad that they're linked (but of course they are if we use the same store for both). Export performance is important because that's where the row count can get really significant and it's common to request the full data frequently.
  • It might be beneficial to look at some real query plans. I feel like some of these we can reason about but others we might be surprised by. Maybe you have?

Performed load testing with jmeter.

This was on the dev server, right? 2 vCPUs and 2GB RAM

I believe the only read operation with performance requirements is data export. There are different variations:

  • Format: CSV (zipped or not), OData
  • Attachments: submission attachments and client audit export
  • Extras: split select multiple, repeats

I believe we know that these are typically RAM-bound.

make sure that there is no negative impact of this change on submission retrieval (odata, csv, zip)

As far as I can tell, there will have to be some amount of negative impact on the queries for all of these, right? In particular, they all include joins on submissionId and several of them have current=true conditions. I believe createdAt only comes up for submission attachments. What about id, "submissionId", does that come into play when the submissionId is known and the id is needed (so in some of the joins)? That said, all of the conditions are pretty deep in complex queries so I imagine the working set is already quite small when they are evaluated, does that sound right? Maybe that means the indexes don't make much of a difference, if any.

How did you pick these specific indexes to target? Was it only based on the impact on submission processing time or also on some estimate of impact on exports (e.g. something like the high-level reasoning I went through above).

Thinking to delete submissions_draft_index, submissions_formid_index and submissions_formid_instanceid_index from submissions table as well

It would be helpful to get a quick narrative on how you think about these as well.

submissions_formid_index - this one is the one that seems the riskiest to me. It feels like if we have a lot of submissions across a lot of different forms not having this index could slow things down noticeably. We already do have an issue in that area so the tradeoff might not be worth it. This is where looking at some real query plans might be helpful.

@lognaturel
Copy link
Member

lognaturel commented Apr 17, 2023

From @sadiqkhoja: there are still multicolumn indexes that cover the same cases as the removed ones.

From postgres docs:

A multicolumn B-tree index can be used with query conditions that involve any subset of the index's columns, but the index is most efficient when there are constraints on the leading (leftmost) columns. The exact rule is that equality constraints on leading columns, plus any inequality constraints on the first column that does not have an equality constraint, will be used to limit the portion of the index that is scanned.

I did not know this.

I can confirm the following indexes remain on submission_defs:

  • id
  • "submissionId", current
  • "submissionId", "instanceId"

I believe that leaves submission attachments without index support because they use createdAt. Could they use the index instead? Or maybe it's deep enough in the query that it doesn't matter.

I verified with explain that read queries with just submissionId equality are cheap. Queries with just instanceId or current equality are at least half as expensive as linear searches (still pretty expensive). That looks ok to me given my understanding of the export-related queries.

For submissions, with the proposed changes there would remain:

  • id
  • "formId", "instanceId", draft
  • "formId", "createdAt"

All of the removed ones are subsets of those. We only use equality conditions against any of those fields. According to the experimentation I did above and my understanding of the queries on submissions, removing draft, formId, (formId, instanceId) are very likely to be covered by the remaining indexes. In fact, this looks more unambiguously safe than the submission_defs changes.

@sadiqkhoja
Copy link
Contributor Author

I believe that leaves submission attachments without index support because they use createdAt. Could they use the index instead? Or maybe it's deep enough in the query that it doesn't matter.

I don't see createdAt being used in the where clause of submission attachments

@lognaturel
Copy link
Member

I don't see createdAt being used in the where clause of submission attachments

I'm sorry, I meant client audits: https://github.com/getodk/central-backend/blob/master/lib/model/query/client-audits.js#L41

My understanding is that the createdAt index would make a difference for the sorting and that the size of the working set could be fairly large. Do we have a good way to estimate what that impact would be?

@sadiqkhoja
Copy link
Contributor Author

ah! there will be significant impact definitely.

Q: Do we really need to order by createdAt here? Isn't sorting by id sufficient?
A: No, concurrent requests can create submission_defs that have different order of id and createdAt :(

-- in test.getodk
SELECT count(1) FROM 
(
  SELECT id, "createdAt", ROW_NUMBER() OVER (ORDER BY id) rownum FROM submission_defs sd 
) orderbyid 
JOIN (
  SELECT id, "createdAt", ROW_NUMBER() OVER (ORDER BY "createdAt") rownum FROM submission_defs sd 
) orderbycreatedat ON orderbyid.rownum = orderbycreatedat.rownum
WHERE orderbyid.id != orderbycreatedat.id

-- output
-- 19692

I am going to create composite index on (createdAt, id)

improves submission throughput by 20% without any impact on select queries

Benchmark:
250_questions form with 20K existing submissions
JMeter parameters:
300 thread
60 test duration
1500 per 10 sec target throughput
release 40 threads in batch

Result:
95/sec throughput with 4% error without this change
115/sec throughput without any error with this change
add (createdAt, id) index on submission_defs
add (formId, createdAt, id) index on submissions
@sadiqkhoja sadiqkhoja force-pushed the enhance/optimize-indices-sub-defs branch from 596a114 to 1a2eba8 Compare April 19, 2023 16:51
@lognaturel
Copy link
Member

concurrent requests can create submission_defs that have different order of id and createdAt

That's interesting and intersects with discussions we've been having about entity updates through submissions, I think. Does this have to do with postgres transaction rules or something? Like is the id assigned at the beginning of the transaction and createdAt at the end or something like that? Or am I thinking along the wrong lines?

That said, maybe sorting by id still is ok. I don't know that the specific order matters, only that it's relatively stable. My understanding is that order is meaningful only because sometimes humans manually look at the audit export and it's helpful in that case to be able to relate that order to the submission table. The order of the individual log entries within a submission is the most important thing for that. The absolute ideal would be that it be sorted by date/time that the submissions were created on the client but we know that's literally impossible (because clients could be offline and who knows how their clocks are set).

@sadiqkhoja
Copy link
Contributor Author

Does this have to do with postgres transaction rules or something? Like is the id assigned at the beginning of the transaction and createdAt at the end or something like that? Or am I thinking along the wrong lines?

Values for id and createdAt are generated in the same SQL statement. Even then it is possible that both get out of order because same statement can be executed by concurrent transactions independently. DB will call nextval() and clock_timestamp() for each statement without knowledge of each other.

@sadiqkhoja sadiqkhoja marked this pull request as ready for review April 19, 2023 17:49
@lognaturel
Copy link
Member

DB will call nextval() and clock_timestamp() for each statement without knowledge of each other.

Fascinating.

What did you think of my reasoning for submission_defs.createdAt not actually being needed? I'm fine with going with what you have for now but also if you find that reasoning compelling it would be nice to leverage an existing index.

Do users need to run analyze explicitly after indexes change or will they be rebuilt automatically as part of either the migration process or routine vacuum?

@sadiqkhoja
Copy link
Contributor Author

For client audit logs, sorting by just id is fine. But we are sorting submissions by createdAt, id as well and since we don't return id in the response, users may get confused about the order. May be it's not a big deal, you have more experience with the users, I think you should make the call.

We haven't put analyze in the migrations up till now, I guess routine vacuuming should be enough for this case as well.

Copy link
Member

@lognaturel lognaturel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple questions. I'm not confident enough about changing the client audit sort so let's leave that as-is.

@lognaturel
Copy link
Member

I'm satisfied by the answers to my latest questions and the analysis I wrote up above. I think it can be merged without another pair of eyes but also happy for you to loop in another reviewer if you still find this risky, @sadiqkhoja.

@sadiqkhoja sadiqkhoja merged commit a0b8d52 into getodk:master Apr 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: ✅ done
Development

Successfully merging this pull request may close these issues.

2 participants