-
Notifications
You must be signed in to change notification settings - Fork 643
Use the read-only replica for some queries in production #2073
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
☔ The latest upstream changes (presumably #2032) made this pull request unmergeable. Please resolve the merge conflicts. |
64cddc5
to
1179642
Compare
@sgrif If you don't have time to review the PR itself I can reassign, but I'd love to hear your opinion on this change from an architectural standpoint. This PR focuses on endpoints with a lot of traffic or that are known to be expensive. We could switch more endpoints over time, but even long-term I think we would avoid converting any endpoints related to authentication. My biggest concern would be that if the follower database falls behind, then we wouldn't want to block a user from logging in (to a newly created account) or from using a newly added API token. (Most endpoint using authentication will need write access for other reasons anyway.) For the endpoints listed above, I don't think there are any issues if we temporarily serve some stale data. Open questions/concerns:
|
This should never happen to any extent that would be significant (e.g. more than a fraction of a second)
I don't think we'll be able to build anything for this better than what already exists in PG and Heroku's infra. If we're really concerned about this (I'm not personally), I think your previous point of restricting this only to endpoints which can see stale data acceptably is fine.
Probably, but I think it's fine to try it. We should set some specific performance goals from this change, and if we don't see movement in that direction we can revert it. I think we can probably do this only for search and that would see the largest performance improvement. Have we confirmed that materialized views being rebuilt is replicating correctly? |
☔ The latest upstream changes (presumably #2077) made this pull request unmergeable. Please resolve the merge conflicts. |
1179642
to
c503489
Compare
I agree with this in principle, but it may be a bit difficult to evaluate in practice. The metrics dashboard is swamped with traffic and response times for crate downloads, so for endpoints of interest we'll have to extract the response time data from our logs. To throw a concrete number out there, I'd like to see a 10% decrease in response time (for some of 50th, 95th, and 99th percentile). I'd also like to see if this change reduces the frequency or size of the momentary slow download request spikes.
I haven't done so, but will before enabling in production. For now, I'm going to push this to staging and configure the new environment variable to point to the existing staging database. This will ensure that the "secondary" connection pool is enabled and configured as read-only in staging, even though it is accessing the same read/write database. |
I've compared I plan to deploy to production in the next hour or so. (I recently deployed master, so we can quickly roll this back if needed.) @bors: r=sgrif |
📌 Commit c503489 has been approved by |
Use the read-only replica for some queries in production Now that we have a read-only replica of the database, we can offload some read-only queries to it. This commit series updates the `Config` and `App` structs to support a primary database and and optional read-only replica. If the `READ_ONLY_REPLICA_URL` environment variable is set, a database pool of read-only connections will be initialized. If the variable is not set, converted endpoints will fall back to the primary database pool (and are given read/write connections). Tests do not configure a read-only replica and each test still operates within a single transaction. Potential over-use of the read-only pool (in an endpoint that needs write access) can be detected in staging, by configuring `READ_ONLY_REPLICA_URL` to the same value as `DATABASE_URL`. Current set of Endpoints: * `/crates/:crate_id/downloads` and `/crates/:crate_id/:version/downloads` * Search endpoint * Summary endpoint for front page * `/crates/:crate_id` * `/crates/:crate_id/reverse_dependencies` * `/crates/:crate_id/versions` r? @sgrif
💥 Test timed out |
It looks like travis received the job, but didn't start running any jobs until 4 minutes ago. I've canceled those jobs, lets try again. @bors retry |
Use the read-only replica for some queries in production Now that we have a read-only replica of the database, we can offload some read-only queries to it. This commit series updates the `Config` and `App` structs to support a primary database and and optional read-only replica. If the `READ_ONLY_REPLICA_URL` environment variable is set, a database pool of read-only connections will be initialized. If the variable is not set, converted endpoints will fall back to the primary database pool (and are given read/write connections). Tests do not configure a read-only replica and each test still operates within a single transaction. Potential over-use of the read-only pool (in an endpoint that needs write access) can be detected in staging, by configuring `READ_ONLY_REPLICA_URL` to the same value as `DATABASE_URL`. Current set of Endpoints: * `/crates/:crate_id/downloads` and `/crates/:crate_id/:version/downloads` * Search endpoint * Summary endpoint for front page * `/crates/:crate_id` * `/crates/:crate_id/reverse_dependencies` * `/crates/:crate_id/versions` r? @sgrif
💥 Test timed out |
Travis again waited 25+ minutes to queue the jobs, so of course they finished 2 minutes after the timeout! There are no status issues reported for Travis, one last try for now. @bors retry |
Use the read-only replica for some queries in production Now that we have a read-only replica of the database, we can offload some read-only queries to it. This commit series updates the `Config` and `App` structs to support a primary database and and optional read-only replica. If the `READ_ONLY_REPLICA_URL` environment variable is set, a database pool of read-only connections will be initialized. If the variable is not set, converted endpoints will fall back to the primary database pool (and are given read/write connections). Tests do not configure a read-only replica and each test still operates within a single transaction. Potential over-use of the read-only pool (in an endpoint that needs write access) can be detected in staging, by configuring `READ_ONLY_REPLICA_URL` to the same value as `DATABASE_URL`. Current set of Endpoints: * `/crates/:crate_id/downloads` and `/crates/:crate_id/:version/downloads` * Search endpoint * Summary endpoint for front page * `/crates/:crate_id` * `/crates/:crate_id/reverse_dependencies` * `/crates/:crate_id/versions` r? @sgrif
☀️ Test successful - checks-travis |
To follow up on this PR, I'd like to document some of the performance improvements I'm seeing in the Heroku PG diagnostics tab. These numbers were collected on Monday evening and are a rough representation of the data shown there. Since the before and after query performance data are from 2 different tabs, the I manually scrolled through the graphs to find the min and max values. I can't guarantee that I've found the min/max values over this time range, but I'm confident the numbers below are close. I've also dropped (and noted) up to 2 outliers from each graph, to avoid including points that were far outside the normal variance. I have notes locally on the specific queries, but for the summary below I've elided the detailed and just provide a rough name for each query. Crates ordered by
|
Additionally, I haven't received any emails from Papertrail triggered by slow (>1s) downloads since deploying this over the weekend.
We also moved to Performance-M, so let's not read too much into that one
…On Wed, Jan 22, 2020 at 5:40 PM Justin Geibel ***@***.***> wrote:
To follow up on this PR, I'd like to document some of the performance improvements I'm seeing in the Heroku PG diagnostics tab. These numbers were collected on Monday evening and are a rough representation of the data shown there. Since the before and after query performance data are from 2 different tabs, the I manually scrolled through the graphs to find the min and max values. I can't guarantee that I've found the min/max values over this time range, but I'm confident the numbers below are close.
I've also dropped (and noted) up to 2 outliers from each graph, to avoid including points that were far outside the normal variance. I have notes locally on the specific queries, but for the summary below I've elided the detailed and just provide a rough name for each query.
Crates ordered by downloads descending
Before (when run on primary)
134ms to 168ms (outlier: 191ms)
After (when run on read-only replica)
125ms to 141ms (outlier: 149ms)
Crates ordered by updated_at descending
Before
131ms to 449ms (outlier: 568ms)
After
126ms to 136ms (outlier: 141ms)
Crates ordered by name ascending
Before
144ms to 789ms
After
129ms to 150ms
Crates ordered by name ascending (with offset)
Before
142ms to 223ms (outliers: 453ms, 668ms)
After
123ms to 141ms (outlier: 312ms)
Summary
Overall these numbers look really promising. In many cases, the longest times when run against the replica align pretty closely with the shortest time observed when running the query against the primary database.
Additionally, I haven't received any emails from Papertrail triggered by slow (>1s) downloads since deploying this over the weekend. Previously, it was common to receive 1 or 2 emails a day showing a small batch of slow downloads. (That alert triggers with a threshold of 25 slow downloads in a 10 minute window. I see only 21 slow downloads in total since the deploy, and I have an idea on what is causing that which I'll describe in a separate issue.)
Finally, the following search can be run against the logs showing that (unexpectedly) slow queries have been virtually eliminated (as logged by PG): program:app/postgres duration -CURRENT_TIMESTAMP -COPY -refresh_recent_crate -!=. (This filters our the 4 known slow queries, but its possible this accidentally filters out extra queries.) There is only 1 batch of 6 slow queries occurring after the read-only replica deploy.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
--
Thanks,
Sean Griffin
|
But these are query times reported by the database itself, so I don't think that has an impact here. Also, I still received several emails related to slow downloads between that change and the read-only replica deploy. However, the performance dynos certainly provide much more consistent overall performance for our |
Right, all the query times you report make sense. My point was
specifically about slow requests to the downloads endpoint.
Historically spikes appear to have been caused by exhausted IO
resources on the dyno
…On Wed, Jan 22, 2020 at 6:31 PM Justin Geibel ***@***.***> wrote:
We also moved to Performance-M, so let's not read too much into that one
But these are query times reported by the database itself, so I don't think that has an impact here. Also, I still received several emails related to slow downloads between that change and the read-only replica deploy.
However, the performance dynos certainly provide much more consistent overall performance for our web app.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
--
Thanks,
Sean Griffin
|
Now that we have a read-only replica of the database, we can offload some read-only queries to it. This commit series updates the
Config
andApp
structs to support a primary database and and optional read-only replica.If the
READ_ONLY_REPLICA_URL
environment variable is set, a database pool of read-only connections will be initialized. If the variable is not set, converted endpoints will fall back to the primary database pool (and are given read/write connections).Tests do not configure a read-only replica and each test still operates within a single transaction. Potential over-use of the read-only pool (in an endpoint that needs write access) can be detected in staging, by configuring
READ_ONLY_REPLICA_URL
to the same value asDATABASE_URL
.Current set of Endpoints:
/crates/:crate_id/downloads
and/crates/:crate_id/:version/downloads
/crates/:crate_id
/crates/:crate_id/reverse_dependencies
/crates/:crate_id/versions
r? @sgrif