-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem Performance (very busy database) #2826
Comments
This problem occurred repeatedly over Christmas, 2023, making it a little less cheerful. |
Occurance Sept, 2023: https://helpdesk.library.ualberta.ca/otrs/index.pl?Action=AgentTicketZoom;TicketID=67376 |
Occurrence Xmas '23 + Jan. '24: https://helpdesk.library.ualberta.ca/otrs/index.pl?Action=AgentTicketZoom;TicketID=68358 In particular, the 24 hour period highlighted in this graph seems to show the PostgreSQL database reading 10 million rows-per-second, on average, for Jupiter. Also troubling, Jupiter is regularly averaging reading "several" millions of rows-per-second ... while DMP does not, although DMP is not generally subject to the abusive level of web requests seen sometimes with Jupiter. (This might be an anomaly with metrics... or maybe it's something to worry about.) I have developed a dashboard for our PostgreSQL server in Grafana: |
I may have way to generate this scale of DB traffic in Staging. First off, know that we're unwilling to dedicate a lot of disk space (in Staging) to retaining Prometheus' data for more than the default 14 days, so you can't graph backwards in Grafana for longer than a fortnight. |
What are the biggest tables in the DB? How big are they? (In Prod)
(Oh oh - feels like there's not enough data.)
This isn't right! How can the graph show tens-of-millions of row operations, when our largest table is only 390k rows? I think I get it!I enabled database "query-logging" for 2 hours in Staging while Omar was running a PMPY test that took >24 hours and ingested millions of things. The results found that a query of the table The database is small enough to fit in memory, so our hardware is making up for our lack of query optimization by returning results quickly from memory, but this doesn't hold up under load! Look at any query using tables |
See also, TicketID=68458, Production was "attacked" again this weekend. I turned on slow-query-logs in Production, capturing any SQL query that takes longer than 1 second to complete. This will be just the 1% outliers, as the vast majority of these queries are completing in < 250ms. This mirrors what I saw in Staging:
That
|
Looking at the structure of the
|
A quick, unsystematic test. In a local dev instance, I bulk ingested ~90k items and tested without and with an index on the Results:
Next steps: passing to others as my sprint is coming to an end Baseline without an index:
Create an index:
Let's drop the index to see if the initial query was slowed by reading the table from disk into memory and the second query benefited. The result: the query was similar to the first without the index
|
* add index on active_storage_attachments.record_id
We suffered another event late last week, which drove ERA's query response time up over 8 seconds - so pretty serious operational impact. See: TicketID=69033 for details.
Pretty fun graphs! 15-minute load averages > 10 per CPU, sustained: |
Improve performance of Active Storage lookups #2826
This issue was resolved by the efforts of @ConnorSheremeta & @jefferya culminating in the 2.9.1 release. This was installed in Production on May 27, 2024, and has eliminated operational issues with slow response-times in Production when queries reach above 6Hz. |
Describe the bug
Neil observed a very busy database via Jupiter starting April 17th, it is intermittent and ongoing. Skylight observes the same trend https://www.skylight.io/app/applications/Q2pLKz0AwRat/recent/1d/endpoints which gives us some things to look into from our end. Beau and I deployed Jupiter 2.3.5 on April 11th with a new version of pg-client bumped from 9 to 12 is the other thing that might be relevant. Coincident with a user depositing their journal articles (>100) one at a time.
The text was updated successfully, but these errors were encountered: