Skip to content

Use the read-only replica for some queries in production #2073

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jan 18, 2020

Conversation

jtgeibel
Copy link
Member

Now that we have a read-only replica of the database, we can offload some read-only queries to it. This commit series updates the Config and App structs to support a primary database and and optional read-only replica.

If the READ_ONLY_REPLICA_URL environment variable is set, a database pool of read-only connections will be initialized. If the variable is not set, converted endpoints will fall back to the primary database pool (and are given read/write connections).

Tests do not configure a read-only replica and each test still operates within a single transaction. Potential over-use of the read-only pool (in an endpoint that needs write access) can be detected in staging, by configuring READ_ONLY_REPLICA_URL to the same value as DATABASE_URL.

Current set of Endpoints:

  • /crates/:crate_id/downloads and /crates/:crate_id/:version/downloads
  • Search endpoint
  • Summary endpoint for front page
  • /crates/:crate_id
  • /crates/:crate_id/reverse_dependencies
  • /crates/:crate_id/versions

r? @sgrif

@bors
Copy link
Contributor

bors commented Jan 6, 2020

☔ The latest upstream changes (presumably #2032) made this pull request unmergeable. Please resolve the merge conflicts.

@jtgeibel jtgeibel force-pushed the prod/read-only-replica branch from 64cddc5 to 1179642 Compare January 6, 2020 03:11
@jtgeibel
Copy link
Member Author

jtgeibel commented Jan 9, 2020

@sgrif If you don't have time to review the PR itself I can reassign, but I'd love to hear your opinion on this change from an architectural standpoint. This PR focuses on endpoints with a lot of traffic or that are known to be expensive.

We could switch more endpoints over time, but even long-term I think we would avoid converting any endpoints related to authentication. My biggest concern would be that if the follower database falls behind, then we wouldn't want to block a user from logging in (to a newly created account) or from using a newly added API token. (Most endpoint using authentication will need write access for other reasons anyway.) For the endpoints listed above, I don't think there are any issues if we temporarily serve some stale data.

Open questions/concerns:

  • Should we add our own alerting here to monitor that the follower stays up-to-date? I assume Heroku already has something much more robust than what we could implement, but it's worth considering.
  • This does move our production environment a bit further away from our staging, CI, and development environments.
  • Is this a good change, but just "too soon"? I think that would be a very reasonable argument. At steady-state we don't appear to be at capacity with our database, but we have been seeing the occasional spikes (a few times a day typically) where our response times jump. In particular, I've added a "Slow Downloads (>1s)" query to Papertrail and the 48 hour velocity graph gives a good overview of the frequency and quantity of affected requests. Given that the spikes seem to affect both web dynos at the same time and are extremely transient leads me to believe that this is not a noisy neighbor situation. (In some cases, it is very obvious that we were hit with a lot of expensive reverse-dependency queries. Ruling those out, the smaller spikes don't seem to align with any obvious excessive load or expensive queries that would be expected to impact other request.)

@sgrif
Copy link
Contributor

sgrif commented Jan 9, 2020

My biggest concern would be that if the follower database falls behind

This should never happen to any extent that would be significant (e.g. more than a fraction of a second)

Should we add our own alerting here to monitor that the follower stays up-to-date? I assume Heroku already has something much more robust than what we could implement, but it's worth considering.

I don't think we'll be able to build anything for this better than what already exists in PG and Heroku's infra. If we're really concerned about this (I'm not personally), I think your previous point of restricting this only to endpoints which can see stale data acceptably is fine.

Is this a good change, but just "too soon"?

Probably, but I think it's fine to try it. We should set some specific performance goals from this change, and if we don't see movement in that direction we can revert it.

I think we can probably do this only for search and that would see the largest performance improvement. Have we confirmed that materialized views being rebuilt is replicating correctly?

@bors
Copy link
Contributor

bors commented Jan 16, 2020

☔ The latest upstream changes (presumably #2077) made this pull request unmergeable. Please resolve the merge conflicts.

@jtgeibel jtgeibel force-pushed the prod/read-only-replica branch from 1179642 to c503489 Compare January 18, 2020 01:49
@jtgeibel
Copy link
Member Author

We should set some specific performance goals from this change, and if we don't see movement in that direction we can revert it.

I agree with this in principle, but it may be a bit difficult to evaluate in practice. The metrics dashboard is swamped with traffic and response times for crate downloads, so for endpoints of interest we'll have to extract the response time data from our logs. To throw a concrete number out there, I'd like to see a 10% decrease in response time (for some of 50th, 95th, and 99th percentile). I'd also like to see if this change reduces the frequency or size of the momentary slow download request spikes.

Have we confirmed that materialized views being rebuilt is replicating correctly?

I haven't done so, but will before enabling in production.

For now, I'm going to push this to staging and configure the new environment variable to point to the existing staging database. This will ensure that the "secondary" connection pool is enabled and configured as read-only in staging, even though it is accessing the same read/write database.

@jtgeibel
Copy link
Member Author

Have we confirmed that materialized views being rebuilt is replicating correctly?

I've compared select * from recent_crate_downloads; between the leader and follower databases and the first few pages of results were the same. I've also tested on staging and see no issues.

I plan to deploy to production in the next hour or so. (I recently deployed master, so we can quickly roll this back if needed.)

@bors: r=sgrif

@bors
Copy link
Contributor

bors commented Jan 18, 2020

📌 Commit c503489 has been approved by sgrif

@bors
Copy link
Contributor

bors commented Jan 18, 2020

⌛ Testing commit c503489 with merge 9525032...

bors added a commit that referenced this pull request Jan 18, 2020
Use the read-only replica for some queries in production

Now that we have a read-only replica of the database, we can offload some read-only queries to it.  This commit series updates the `Config` and `App` structs to support a primary database and and optional read-only replica.

If the `READ_ONLY_REPLICA_URL` environment variable is set, a database pool of read-only connections will be initialized.  If the variable is not set, converted endpoints will fall back to the primary database pool (and are given read/write connections).

Tests do not configure a read-only replica and each test still operates within a single transaction.  Potential over-use of the read-only pool (in an endpoint that needs write access) can be detected in staging, by configuring `READ_ONLY_REPLICA_URL` to the same value as `DATABASE_URL`.

Current set of Endpoints:

* `/crates/:crate_id/downloads` and `/crates/:crate_id/:version/downloads`
* Search endpoint
* Summary endpoint for front page
* `/crates/:crate_id`
* `/crates/:crate_id/reverse_dependencies`
* `/crates/:crate_id/versions`

r? @sgrif
@bors
Copy link
Contributor

bors commented Jan 18, 2020

💥 Test timed out

@jtgeibel
Copy link
Member Author

Test timed out

It looks like travis received the job, but didn't start running any jobs until 4 minutes ago. I've canceled those jobs, lets try again.

@bors retry

bors added a commit that referenced this pull request Jan 18, 2020
Use the read-only replica for some queries in production

Now that we have a read-only replica of the database, we can offload some read-only queries to it.  This commit series updates the `Config` and `App` structs to support a primary database and and optional read-only replica.

If the `READ_ONLY_REPLICA_URL` environment variable is set, a database pool of read-only connections will be initialized.  If the variable is not set, converted endpoints will fall back to the primary database pool (and are given read/write connections).

Tests do not configure a read-only replica and each test still operates within a single transaction.  Potential over-use of the read-only pool (in an endpoint that needs write access) can be detected in staging, by configuring `READ_ONLY_REPLICA_URL` to the same value as `DATABASE_URL`.

Current set of Endpoints:

* `/crates/:crate_id/downloads` and `/crates/:crate_id/:version/downloads`
* Search endpoint
* Summary endpoint for front page
* `/crates/:crate_id`
* `/crates/:crate_id/reverse_dependencies`
* `/crates/:crate_id/versions`

r? @sgrif
@bors
Copy link
Contributor

bors commented Jan 18, 2020

⌛ Testing commit c503489 with merge 5420515...

@bors
Copy link
Contributor

bors commented Jan 18, 2020

💥 Test timed out

@jtgeibel
Copy link
Member Author

Travis again waited 25+ minutes to queue the jobs, so of course they finished 2 minutes after the timeout! There are no status issues reported for Travis, one last try for now.

@bors retry

bors added a commit that referenced this pull request Jan 18, 2020
Use the read-only replica for some queries in production

Now that we have a read-only replica of the database, we can offload some read-only queries to it.  This commit series updates the `Config` and `App` structs to support a primary database and and optional read-only replica.

If the `READ_ONLY_REPLICA_URL` environment variable is set, a database pool of read-only connections will be initialized.  If the variable is not set, converted endpoints will fall back to the primary database pool (and are given read/write connections).

Tests do not configure a read-only replica and each test still operates within a single transaction.  Potential over-use of the read-only pool (in an endpoint that needs write access) can be detected in staging, by configuring `READ_ONLY_REPLICA_URL` to the same value as `DATABASE_URL`.

Current set of Endpoints:

* `/crates/:crate_id/downloads` and `/crates/:crate_id/:version/downloads`
* Search endpoint
* Summary endpoint for front page
* `/crates/:crate_id`
* `/crates/:crate_id/reverse_dependencies`
* `/crates/:crate_id/versions`

r? @sgrif
@bors
Copy link
Contributor

bors commented Jan 18, 2020

⌛ Testing commit c503489 with merge 66c45e0...

@bors
Copy link
Contributor

bors commented Jan 18, 2020

☀️ Test successful - checks-travis
Approved by: sgrif
Pushing 66c45e0 to master...

@bors bors merged commit c503489 into rust-lang:master Jan 18, 2020
@jtgeibel
Copy link
Member Author

To follow up on this PR, I'd like to document some of the performance improvements I'm seeing in the Heroku PG diagnostics tab. These numbers were collected on Monday evening and are a rough representation of the data shown there. Since the before and after query performance data are from 2 different tabs, the I manually scrolled through the graphs to find the min and max values. I can't guarantee that I've found the min/max values over this time range, but I'm confident the numbers below are close.

I've also dropped (and noted) up to 2 outliers from each graph, to avoid including points that were far outside the normal variance. I have notes locally on the specific queries, but for the summary below I've elided the detailed and just provide a rough name for each query.

Crates ordered by downloads descending

  • Before (when run on primary)
    • 134ms to 168ms (outlier: 191ms)
  • After (when run on read-only replica)
    • 125ms to 141ms (outlier: 149ms)

Crates ordered by updated_at descending

  • Before
    • 131ms to 449ms (outlier: 568ms)
  • After
    • 126ms to 136ms (outlier: 141ms)

Crates ordered by name ascending

  • Before
    • 144ms to 789ms
  • After
    • 129ms to 150ms

Crates ordered by name ascending (with offset)

  • Before
    • 142ms to 223ms (outliers: 453ms, 668ms)
  • After
    • 123ms to 141ms (outlier: 312ms)

Summary

Overall these numbers look really promising. In many cases, the longest times when run against the replica align pretty closely with the shortest time observed when running the query against the primary database.

Additionally, I haven't received any emails from Papertrail triggered by slow (>1s) downloads since deploying this over the weekend. Previously, it was common to receive 1 or 2 emails a day showing a small batch of slow downloads. (That alert triggers with a threshold of 25 slow downloads in a 10 minute window. I see only 21 slow downloads in total since the deploy, and I have an idea on what is causing that which I'll describe in a separate issue.)

Finally, the following search can be run against the logs showing that (unexpectedly) slow queries have been virtually eliminated (as logged by PG): program:app/postgres duration -CURRENT_TIMESTAMP -COPY -refresh_recent_crate -!=. (This filters our the 4 known slow queries, but its possible this accidentally filters out extra queries.) There is only 1 batch of 6 slow queries occurring after the read-only replica deploy.

@sgrif
Copy link
Contributor

sgrif commented Jan 23, 2020 via email

@jtgeibel
Copy link
Member Author

We also moved to Performance-M, so let's not read too much into that one

But these are query times reported by the database itself, so I don't think that has an impact here. Also, I still received several emails related to slow downloads between that change and the read-only replica deploy.

However, the performance dynos certainly provide much more consistent overall performance for our web app.

@sgrif
Copy link
Contributor

sgrif commented Jan 23, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants