Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reindex process slows after some hours when Postgres vacuuming falls behind #1822

Closed
punktilious opened this issue Dec 14, 2020 · 6 comments
Closed
Assignees
Labels
bug Something isn't working P1 Priority 1 - Must Have schema-change a schema change

Comments

@punktilious
Copy link
Collaborator

punktilious commented Dec 14, 2020

After some time, queries against LOGICAL_RESOURCES slow down. After a couple of hours, the reindex status update on this table ends up dominating the overall processing time. It turns out this is a byproduct of doing so many updates to this table and the way Postgres handles those updates. The update leaves behind dead records which need to be cleaned up by the vacuum process. This is supposed to be run in the background but it looks like the automatic vacuum can't keep up.

After stopping the reindex and run a manual vacuum the performance recovers (from 30s per call to 10s):

VACUUM (ANALYZE,VERBOSE) fhirdata.logical_resources;

After restarting reindex, the time slowly creeps up again, so this has to be repeated.

Investigate other solutions which avoid the need to perform an update against an indexed column. One possible solution is to copy all the rows needing to be processed into a working queue table and delete each row from there as it is processed. This table probably does not even require an index, because processing can be done in row order (somewhat arbitrary, but that is OK).

@punktilious punktilious added the bug Something isn't working label Dec 14, 2020
@punktilious punktilious changed the title reindex reindex process slows after some hours when Postgres vacuuming falls behind Dec 14, 2020
@lmsurpre lmsurpre added the P2 Priority 2 - Should Have label Feb 22, 2021
@lmsurpre
Copy link
Member

options:

  • make vacuuming more aggressive
  • manual (code-driven vs db-driven) vacuuming
  • other?

need investigation as too which is best

@lmsurpre lmsurpre added this to the Sprint 2021-08 milestone Jun 1, 2021
@lmsurpre
Copy link
Member

The LOGICAL_RESOURCES table will still be updated as part of reindex after #2524 but we don't expect it to cause the same contention as what we were seeing from updating the tstamp.

@lmsurpre
Copy link
Member

After trying to reindex a large postgres db using the new client-driven reindex behavior, I don't think this one is fully addressed.

It ran for over a day, but after some amount of time we ran out of disk space. I don't know if it was vacuuming related or just that the space was needed for the new tables/indices in 4.9.0.

However, after increasing disk capacity, client-driven reindex repeatedly failed with error like:

2021-07-11 22:37:13.879 fhir-test-server-86b656757d-9drdx fhir-test-server SEVERE inserting parameters
org.postgresql.util.PSQLException: ERROR: multixact "members" limit exceeded
  Detail: This command would create a multixact with 2 members, but the remaining space is only enough for 1 member.
  Hint: Execute a database-wide VACUUM in database with OID 16478 with reduced vacuum_multixact_freeze_min_age and vacuum_multixact_freeze_table_age settings.
  Where: SQL statement "SELECT 1 FROM ONLY "fhirdata"."parameter_names" x WHERE "parameter_name_id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"

I found the following command from the postgresql forums for finding tables with the oldest xids:

SELECT oid::regclass, relminmxid, mxid_age(relminmxid) FROM pg_class WHERE relminmxid <> '0' ORDER BY mxid_age(relminmxid) DESC;

Our observation_resource_token_refs was on the top of this list and so I performed a manual vacuum of that table:
vacuum fhirdata.observation_resource_token_refs

The next morning I noticed that the reindex response times were much faster.

@lmsurpre lmsurpre removed this from the Sprint 2021-09 milestone Jul 12, 2021
@lmsurpre lmsurpre added this to the Sprint 2021-10 milestone Jul 12, 2021
@lmsurpre lmsurpre added P1 Priority 1 - Must Have and removed P2 Priority 2 - Should Have labels Jul 12, 2021
@lmsurpre
Copy link
Member

lmsurpre commented Jul 15, 2021

Discussed with Robin and we think that we can further mitigate this issue by setting the properties mentioned at https://ibm.github.io/FHIR/guides/FHIRPerformanceGuide#412-tuning-auto-vacuum on the postgresql tables that see a lot of update/deletes. Specifically:

  • logical_resources
  • all search parameter tables

So basically everything other than the xx_RESOURCES and COMMON_TOKEN_VALUE tables :-)

@prb112
Copy link
Contributor

prb112 commented Jul 20, 2021

PR #2628

@d0roppe
Copy link
Collaborator

d0roppe commented Jul 28, 2021

Ran the latest reindex of 79 million resources, and it finished without error in 39 hours. this was with the latest fixes for vacuum settings. I am closing this issue based on last run.

@d0roppe d0roppe closed this as completed Jul 28, 2021
@lmsurpre lmsurpre added the schema-change a schema change label Aug 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P1 Priority 1 - Must Have schema-change a schema change
Projects
None yet
Development

No branches or pull requests

5 participants