Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many database connections are being opened #3316

Closed
misaugstad opened this issue Jul 27, 2023 · 9 comments · Fixed by #3333
Closed

Too many database connections are being opened #3316

misaugstad opened this issue Jul 27, 2023 · 9 comments · Fixed by #3333
Labels

Comments

@misaugstad
Copy link
Member

Brief description of problem/feature

When I try to connect to our prod databases via the command line, I'm getting the following error:

psql: error: FATAL:  remaining connection slots are reserved for non-replication superuser connections

This had happened about a year ago as well to a worse degree, where we were getting proxy errors on our servers. It's possible that this has been happening occasionally since then though.

Unfortunately, I did not provide enough detail in my email thread with @mechanicjay... I even say that I thought I figured out the offending code, but I didn't say where that was, and I don't see any commits in the that timeframe that would have addressed it 🥴 What we did conclude is that our application is creating too many database connections for some reason.

From me in the thread:

Based on this and some more testing I just did, I think I've found the offending code. Not sure why we are creating quite that many connections right now, but I'll see what I can do.

And later on this is me quoting @mechanicjay and responding:

Based on observed behavior right now, it looks sidewalk opens 4 connections to the db when it starts and sits on them, the appears to open/close connections as needed while the software does things

I'm seeing something similar. I'm also seeing that after the code that requires more connections is run, our application might just not be closing those connections in a timely fashion? I ran the code that brought the sidewalk-pittsburgh database to jump up to 26 connections, but the code has finished running, and all of those connections are still open.

Do you think the issue is that each application is not letting go of connections, and since we are actually running 10 copies of the application, the connection pooler isn't aware of the other applications and doesn't know to stop hoarding the connections?

Potential solution(s)

We'll want to test our application on our local dev env, looking at how many connections are being opened. Then we should try doing different operations to figure out what might be causing the spike in database connections. I think I remember it being queries for Validate and/or Gallery that were the culprit.

@misaugstad misaugstad added the Bug label Jul 27, 2023
@misaugstad
Copy link
Member Author

Could be related to #1809

@mechanicjay
Copy link
Collaborator

Indeed, we're seeing this on the production machine again and it's impacting some other things using the db cluster.

I found sidewalk-sea sitting on 15 connections just now for no good reason, I restarted the instance to clear the connections.

@misaugstad
Copy link
Member Author

This is actually causing an issue when deploying a new version to test today! Got this error when trying to test one of the servers

Cannot connect to database [default]

And it had this same message in the logs:

Failed to obtain initial connection Sleeping for 0ms and trying again. Attempts left: 0. Exception: null.Message:FATAL: remaining connection slots are reserved for non-replication superuser connections

@mechanicjay this issue is probably getting worse the more servers we add, huh...

@misaugstad
Copy link
Member Author

This has happened on a bunch of test servers after our most recent deploy (see #3325). Going to try to push to develop and see if this gets resolved during a deploy to prod. Otherwise I'll email support to get the test servers back up today.

Either way, looks like fixing this is becoming a higher priority!

@misaugstad
Copy link
Member Author

Here's the exact error message from the Auckland server logs:

2023-07-27 22:42:45,583 - [ERROR] - from com.jolbox.bonecp.hooks.AbstractConnectionHook in play-internal-execution-context-1
Failed to acquire connection to jdbc:postgresql://docker-test.cs.washington.edu:5433/sidewalk-auckland Sleeping for 1000ms and trying again. Attempts left: 10. Exception: null.Message:FATAL: remaining connection slots are reserved for non-replication superuser connections

@jonfroehlich
Copy link
Member

In thinking through this further, is it possible that the ~4-5 idle connections that begin on startup are related to the database connection pool?

This is essentially how a thread pool works—you initialize a thread pool with the size of the pool, which is then a shared resources for the life of the app—but not sure if that analogy works here for databases.

@misaugstad
Copy link
Member Author

@mechanicjay opened up another 100 connections for us, bringing us to a max of 200 :)

@dylanbun also just updated the code that queries the config table to ensure that we aren't opening extra connections when getting those configs.

The last thing I'd like to do is to update the configuration of our database connection pooler, with the configs described here. Current defaults are determined by the play-jdbc library, found in .ivy2/cache/com.typesafe.play/play-jdbc_2.10/srcs/play-jdbc_2.10-2.3.10-sources.jar. Below are some proposed updates.

db.default.idleMaxAge=30 second  # default is 10 minutes
db.default.acquireRetryDelay=3 second  # default is 1 second
db.default.acquireRetryAttempts=20  # default is 10

The goal for idleMaxAge is to force the connection pooler to let go of idle connections more quickly. For acquireRetryDelay and acquireRetryAttempts, I'm trying to give us more time to search for a connection if the db doesn't connect immediately, giving time for other cities to let go of their idle connections.

@jonfroehlich
Copy link
Member

Are we sure that dropping idleMaxAge from 10 mins to 30 seconds is reasonable? That's a 95% reduction—seems pretty drastic. What are the implications?

All the other shifts from defaults seem good to me.

@misaugstad
Copy link
Member Author

Are we sure that dropping idleMaxAge from 10 mins to 30 seconds is reasonable? That's a 95% reduction—seems pretty drastic

Honestly, not 100% sure. It would mean that we would get hit with the overhead of connecting to the database if no new queries are run in that city during those 30 seconds. How long is the overhead? Really hard to say. This stackoverflow post seems to guess at 67-100 ms, though I think that ours is faster than that having looked through their post. And not all delays mean a delay for the end user.

If someone is using the Explore page, they're going to hit the database more than every 30 seconds. On Validate it might be more on the edge for some users. Maybe let's bring it up to 1-2 minutes? Maybe 90 seconds?

With the retry delay and such, I kind of want to have acquireRetryDelay * acquireRetryAttempts > idleMaxAge so that there's actually enough time to drop connections before a server gives up on an auto-deploy and never connects. So I'll update those configs a bit to get past the idleMaxAge bar we set.

ALSO I just found the minConnectionsPerPartition parameter! Which was set to 5, which is why we have 5 open connections at all times. I just did some tests on my local environment, and it looks like loading the landing page requires 4 connections, loading Validate opens 5, and loading Explore opens 8. Maybe we bring this down to 4 for now, so we aren't sitting on quite as many connections when most of our servers aren't being used at any given time?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants