Use the read-only replica for some queries in production #2073

jtgeibel · 2019-12-26T20:39:41Z

Now that we have a read-only replica of the database, we can offload some read-only queries to it. This commit series updates the Config and App structs to support a primary database and and optional read-only replica.

If the READ_ONLY_REPLICA_URL environment variable is set, a database pool of read-only connections will be initialized. If the variable is not set, converted endpoints will fall back to the primary database pool (and are given read/write connections).

Tests do not configure a read-only replica and each test still operates within a single transaction. Potential over-use of the read-only pool (in an endpoint that needs write access) can be detected in staging, by configuring READ_ONLY_REPLICA_URL to the same value as DATABASE_URL.

Current set of Endpoints:

/crates/:crate_id/downloads and /crates/:crate_id/:version/downloads
Search endpoint
Summary endpoint for front page
/crates/:crate_id
/crates/:crate_id/reverse_dependencies
/crates/:crate_id/versions

r? @sgrif

bors · 2020-01-06T02:38:49Z

☔ The latest upstream changes (presumably #2032) made this pull request unmergeable. Please resolve the merge conflicts.

jtgeibel · 2020-01-09T03:00:29Z

@sgrif If you don't have time to review the PR itself I can reassign, but I'd love to hear your opinion on this change from an architectural standpoint. This PR focuses on endpoints with a lot of traffic or that are known to be expensive.

We could switch more endpoints over time, but even long-term I think we would avoid converting any endpoints related to authentication. My biggest concern would be that if the follower database falls behind, then we wouldn't want to block a user from logging in (to a newly created account) or from using a newly added API token. (Most endpoint using authentication will need write access for other reasons anyway.) For the endpoints listed above, I don't think there are any issues if we temporarily serve some stale data.

Open questions/concerns:

Should we add our own alerting here to monitor that the follower stays up-to-date? I assume Heroku already has something much more robust than what we could implement, but it's worth considering.
This does move our production environment a bit further away from our staging, CI, and development environments.
Is this a good change, but just "too soon"? I think that would be a very reasonable argument. At steady-state we don't appear to be at capacity with our database, but we have been seeing the occasional spikes (a few times a day typically) where our response times jump. In particular, I've added a "Slow Downloads (>1s)" query to Papertrail and the 48 hour velocity graph gives a good overview of the frequency and quantity of affected requests. Given that the spikes seem to affect both web dynos at the same time and are extremely transient leads me to believe that this is not a noisy neighbor situation. (In some cases, it is very obvious that we were hit with a lot of expensive reverse-dependency queries. Ruling those out, the smaller spikes don't seem to align with any obvious excessive load or expensive queries that would be expected to impact other request.)

sgrif · 2020-01-09T18:07:42Z

My biggest concern would be that if the follower database falls behind

This should never happen to any extent that would be significant (e.g. more than a fraction of a second)

Should we add our own alerting here to monitor that the follower stays up-to-date? I assume Heroku already has something much more robust than what we could implement, but it's worth considering.

I don't think we'll be able to build anything for this better than what already exists in PG and Heroku's infra. If we're really concerned about this (I'm not personally), I think your previous point of restricting this only to endpoints which can see stale data acceptably is fine.

Is this a good change, but just "too soon"?

Probably, but I think it's fine to try it. We should set some specific performance goals from this change, and if we don't see movement in that direction we can revert it.

I think we can probably do this only for search and that would see the largest performance improvement. Have we confirmed that materialized views being rebuilt is replicating correctly?

bors · 2020-01-16T04:08:56Z

☔ The latest upstream changes (presumably #2077) made this pull request unmergeable. Please resolve the merge conflicts.

jtgeibel · 2020-01-18T02:18:08Z

We should set some specific performance goals from this change, and if we don't see movement in that direction we can revert it.

I agree with this in principle, but it may be a bit difficult to evaluate in practice. The metrics dashboard is swamped with traffic and response times for crate downloads, so for endpoints of interest we'll have to extract the response time data from our logs. To throw a concrete number out there, I'd like to see a 10% decrease in response time (for some of 50th, 95th, and 99th percentile). I'd also like to see if this change reduces the frequency or size of the momentary slow download request spikes.

Have we confirmed that materialized views being rebuilt is replicating correctly?

I haven't done so, but will before enabling in production.

For now, I'm going to push this to staging and configure the new environment variable to point to the existing staging database. This will ensure that the "secondary" connection pool is enabled and configured as read-only in staging, even though it is accessing the same read/write database.

jtgeibel · 2020-01-18T16:58:14Z

Have we confirmed that materialized views being rebuilt is replicating correctly?

I've compared select * from recent_crate_downloads; between the leader and follower databases and the first few pages of results were the same. I've also tested on staging and see no issues.

I plan to deploy to production in the next hour or so. (I recently deployed master, so we can quickly roll this back if needed.)

@bors: r=sgrif

bors · 2020-01-18T16:58:15Z

📌 Commit c503489 has been approved by sgrif

bors · 2020-01-18T16:58:23Z

⌛ Testing commit c503489 with merge 9525032...

@sgrif

Use the read-only replica for some queries in production Now that we have a read-only replica of the database, we can offload some read-only queries to it. This commit series updates the `Config` and `App` structs to support a primary database and and optional read-only replica. If the `READ_ONLY_REPLICA_URL` environment variable is set, a database pool of read-only connections will be initialized. If the variable is not set, converted endpoints will fall back to the primary database pool (and are given read/write connections). Tests do not configure a read-only replica and each test still operates within a single transaction. Potential over-use of the read-only pool (in an endpoint that needs write access) can be detected in staging, by configuring `READ_ONLY_REPLICA_URL` to the same value as `DATABASE_URL`. Current set of Endpoints: * `/crates/:crate_id/downloads` and `/crates/:crate_id/:version/downloads` * Search endpoint * Summary endpoint for front page * `/crates/:crate_id` * `/crates/:crate_id/reverse_dependencies` * `/crates/:crate_id/versions` r? @sgrif

bors · 2020-01-18T17:28:23Z

💥 Test timed out

jtgeibel · 2020-01-18T17:32:14Z

Test timed out

It looks like travis received the job, but didn't start running any jobs until 4 minutes ago. I've canceled those jobs, lets try again.

@bors retry

@sgrif

Use the read-only replica for some queries in production Now that we have a read-only replica of the database, we can offload some read-only queries to it. This commit series updates the `Config` and `App` structs to support a primary database and and optional read-only replica. If the `READ_ONLY_REPLICA_URL` environment variable is set, a database pool of read-only connections will be initialized. If the variable is not set, converted endpoints will fall back to the primary database pool (and are given read/write connections). Tests do not configure a read-only replica and each test still operates within a single transaction. Potential over-use of the read-only pool (in an endpoint that needs write access) can be detected in staging, by configuring `READ_ONLY_REPLICA_URL` to the same value as `DATABASE_URL`. Current set of Endpoints: * `/crates/:crate_id/downloads` and `/crates/:crate_id/:version/downloads` * Search endpoint * Summary endpoint for front page * `/crates/:crate_id` * `/crates/:crate_id/reverse_dependencies` * `/crates/:crate_id/versions` r? @sgrif

bors · 2020-01-18T17:32:22Z

⌛ Testing commit c503489 with merge 5420515...

bors · 2020-01-18T18:02:22Z

💥 Test timed out

jtgeibel · 2020-01-18T18:16:38Z

Travis again waited 25+ minutes to queue the jobs, so of course they finished 2 minutes after the timeout! There are no status issues reported for Travis, one last try for now.

@bors retry

@sgrif

Use the read-only replica for some queries in production Now that we have a read-only replica of the database, we can offload some read-only queries to it. This commit series updates the `Config` and `App` structs to support a primary database and and optional read-only replica. If the `READ_ONLY_REPLICA_URL` environment variable is set, a database pool of read-only connections will be initialized. If the variable is not set, converted endpoints will fall back to the primary database pool (and are given read/write connections). Tests do not configure a read-only replica and each test still operates within a single transaction. Potential over-use of the read-only pool (in an endpoint that needs write access) can be detected in staging, by configuring `READ_ONLY_REPLICA_URL` to the same value as `DATABASE_URL`. Current set of Endpoints: * `/crates/:crate_id/downloads` and `/crates/:crate_id/:version/downloads` * Search endpoint * Summary endpoint for front page * `/crates/:crate_id` * `/crates/:crate_id/reverse_dependencies` * `/crates/:crate_id/versions` r? @sgrif

bors · 2020-01-18T18:16:46Z

⌛ Testing commit c503489 with merge 66c45e0...

bors · 2020-01-18T18:23:08Z

☀️ Test successful - checks-travis
Approved by: sgrif
Pushing 66c45e0 to master...

jtgeibel · 2020-01-23T00:40:54Z

To follow up on this PR, I'd like to document some of the performance improvements I'm seeing in the Heroku PG diagnostics tab. These numbers were collected on Monday evening and are a rough representation of the data shown there. Since the before and after query performance data are from 2 different tabs, the I manually scrolled through the graphs to find the min and max values. I can't guarantee that I've found the min/max values over this time range, but I'm confident the numbers below are close.

I've also dropped (and noted) up to 2 outliers from each graph, to avoid including points that were far outside the normal variance. I have notes locally on the specific queries, but for the summary below I've elided the detailed and just provide a rough name for each query.

Crates ordered by `downloads` descending

Before (when run on primary)
- 134ms to 168ms (outlier: 191ms)
After (when run on read-only replica)
- 125ms to 141ms (outlier: 149ms)

Crates ordered by `updated_at` descending

Before
- 131ms to 449ms (outlier: 568ms)
After
- 126ms to 136ms (outlier: 141ms)

Crates ordered by `name` ascending

Before
- 144ms to 789ms
After
- 129ms to 150ms

Crates ordered by `name` ascending (with offset)

Before
- 142ms to 223ms (outliers: 453ms, 668ms)
After
- 123ms to 141ms (outlier: 312ms)

Summary

Overall these numbers look really promising. In many cases, the longest times when run against the replica align pretty closely with the shortest time observed when running the query against the primary database.

Additionally, I haven't received any emails from Papertrail triggered by slow (>1s) downloads since deploying this over the weekend. Previously, it was common to receive 1 or 2 emails a day showing a small batch of slow downloads. (That alert triggers with a threshold of 25 slow downloads in a 10 minute window. I see only 21 slow downloads in total since the deploy, and I have an idea on what is causing that which I'll describe in a separate issue.)

Finally, the following search can be run against the logs showing that (unexpectedly) slow queries have been virtually eliminated (as logged by PG): program:app/postgres duration -CURRENT_TIMESTAMP -COPY -refresh_recent_crate -!=. (This filters our the 4 known slow queries, but its possible this accidentally filters out extra queries.) There is only 1 batch of 6 slow queries occurring after the read-only replica deploy.

sgrif · 2020-01-23T00:45:57Z

Additionally, I haven't received any emails from Papertrail triggered by slow (>1s) downloads since deploying this over the weekend.

We also moved to Performance-M, so let's not read too much into that one

…

On Wed, Jan 22, 2020 at 5:40 PM Justin Geibel ***@***.***> wrote: To follow up on this PR, I'd like to document some of the performance improvements I'm seeing in the Heroku PG diagnostics tab. These numbers were collected on Monday evening and are a rough representation of the data shown there. Since the before and after query performance data are from 2 different tabs, the I manually scrolled through the graphs to find the min and max values. I can't guarantee that I've found the min/max values over this time range, but I'm confident the numbers below are close. I've also dropped (and noted) up to 2 outliers from each graph, to avoid including points that were far outside the normal variance. I have notes locally on the specific queries, but for the summary below I've elided the detailed and just provide a rough name for each query. Crates ordered by downloads descending Before (when run on primary) 134ms to 168ms (outlier: 191ms) After (when run on read-only replica) 125ms to 141ms (outlier: 149ms) Crates ordered by updated_at descending Before 131ms to 449ms (outlier: 568ms) After 126ms to 136ms (outlier: 141ms) Crates ordered by name ascending Before 144ms to 789ms After 129ms to 150ms Crates ordered by name ascending (with offset) Before 142ms to 223ms (outliers: 453ms, 668ms) After 123ms to 141ms (outlier: 312ms) Summary Overall these numbers look really promising. In many cases, the longest times when run against the replica align pretty closely with the shortest time observed when running the query against the primary database. Additionally, I haven't received any emails from Papertrail triggered by slow (>1s) downloads since deploying this over the weekend. Previously, it was common to receive 1 or 2 emails a day showing a small batch of slow downloads. (That alert triggers with a threshold of 25 slow downloads in a 10 minute window. I see only 21 slow downloads in total since the deploy, and I have an idea on what is causing that which I'll describe in a separate issue.) Finally, the following search can be run against the logs showing that (unexpectedly) slow queries have been virtually eliminated (as logged by PG): program:app/postgres duration -CURRENT_TIMESTAMP -COPY -refresh_recent_crate -!=. (This filters our the 4 known slow queries, but its possible this accidentally filters out extra queries.) There is only 1 batch of 6 slow queries occurring after the read-only replica deploy. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

-- Thanks, Sean Griffin

jtgeibel · 2020-01-23T01:31:11Z

We also moved to Performance-M, so let's not read too much into that one

But these are query times reported by the database itself, so I don't think that has an impact here. Also, I still received several emails related to slow downloads between that change and the read-only replica deploy.

However, the performance dynos certainly provide much more consistent overall performance for our web app.

sgrif · 2020-01-23T01:34:17Z

Right, all the query times you report make sense. My point was specifically about slow requests to the downloads endpoint. Historically spikes appear to have been caused by exhausted IO resources on the dyno

…

On Wed, Jan 22, 2020 at 6:31 PM Justin Geibel ***@***.***> wrote: We also moved to Performance-M, so let's not read too much into that one But these are query times reported by the database itself, so I don't think that has an impact here. Also, I still received several emails related to slow downloads between that change and the read-only replica deploy. However, the performance dynos certainly provide much more consistent overall performance for our web app. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

-- Thanks, Sean Griffin

rust-highfive assigned sgrif Dec 26, 2019

rust-highfive added the S-waiting-on-review label Dec 26, 2019

jtgeibel force-pushed the prod/read-only-replica branch from 64cddc5 to 1179642 Compare January 6, 2020 03:11

sgrif approved these changes Jan 9, 2020

View reviewed changes

jtgeibel added 6 commits January 17, 2020 19:47

Setup read-only replica pool if READ_ONLY_REPLICA_URL is set

8580b15

Use read-only replica pool for download count endpoints

8a5edf9

Use read-only replica pool for search endpoint

ade5ebb

Use read-only replica pool for summary endpoint

d6cf183

Use read-only replica pool for crate show endpoint

17aaf31

Use read-only replica pool for crate versions endpoint

c503489

jtgeibel force-pushed the prod/read-only-replica branch from 1179642 to c503489 Compare January 18, 2020 01:49

bors merged commit c503489 into rust-lang:master Jan 18, 2020

jtgeibel mentioned this pull request Jan 22, 2020

Allow database dumps to be run against the read-only replica #2144

Merged

jtgeibel deleted the prod/read-only-replica branch January 22, 2020 23:34

jtgeibel mentioned this pull request Jan 23, 2020

Possible mitigations for occasionaly slow database requests #2153

Closed

Use the read-only replica for some queries in production #2073

Use the read-only replica for some queries in production #2073

Uh oh!

Conversation

jtgeibel commented Dec 26, 2019

Uh oh!

bors commented Jan 6, 2020

Uh oh!

jtgeibel commented Jan 9, 2020

Uh oh!

sgrif commented Jan 9, 2020

Uh oh!

bors commented Jan 16, 2020

Uh oh!

jtgeibel commented Jan 18, 2020

Uh oh!

jtgeibel commented Jan 18, 2020

Uh oh!

bors commented Jan 18, 2020

Uh oh!

bors commented Jan 18, 2020

Uh oh!

bors commented Jan 18, 2020

Uh oh!

jtgeibel commented Jan 18, 2020

Uh oh!

bors commented Jan 18, 2020

Uh oh!

bors commented Jan 18, 2020

Uh oh!

jtgeibel commented Jan 18, 2020

Uh oh!

bors commented Jan 18, 2020

Uh oh!

bors commented Jan 18, 2020

Uh oh!

jtgeibel commented Jan 23, 2020

Crates ordered by downloads descending

Crates ordered by updated_at descending

Crates ordered by name ascending

Crates ordered by name ascending (with offset)

Summary

Uh oh!

sgrif commented Jan 23, 2020 via email

Uh oh!

jtgeibel commented Jan 23, 2020

Uh oh!

sgrif commented Jan 23, 2020 via email

Uh oh!

Uh oh!

Crates ordered by `downloads` descending

Crates ordered by `updated_at` descending

Crates ordered by `name` ascending

Crates ordered by `name` ascending (with offset)