Too many database connections are being opened #3316

misaugstad · 2023-07-27T00:16:19Z

Brief description of problem/feature

When I try to connect to our prod databases via the command line, I'm getting the following error:

psql: error: FATAL:  remaining connection slots are reserved for non-replication superuser connections

This had happened about a year ago as well to a worse degree, where we were getting proxy errors on our servers. It's possible that this has been happening occasionally since then though.

Unfortunately, I did not provide enough detail in my email thread with @mechanicjay... I even say that I thought I figured out the offending code, but I didn't say where that was, and I don't see any commits in the that timeframe that would have addressed it 🥴 What we did conclude is that our application is creating too many database connections for some reason.

From me in the thread:

Based on this and some more testing I just did, I think I've found the offending code. Not sure why we are creating quite that many connections right now, but I'll see what I can do.

And later on this is me quoting @mechanicjay and responding:

Based on observed behavior right now, it looks sidewalk opens 4 connections to the db when it starts and sits on them, the appears to open/close connections as needed while the software does things

I'm seeing something similar. I'm also seeing that after the code that requires more connections is run, our application might just not be closing those connections in a timely fashion? I ran the code that brought the sidewalk-pittsburgh database to jump up to 26 connections, but the code has finished running, and all of those connections are still open.

Do you think the issue is that each application is not letting go of connections, and since we are actually running 10 copies of the application, the connection pooler isn't aware of the other applications and doesn't know to stop hoarding the connections?

Potential solution(s)

We'll want to test our application on our local dev env, looking at how many connections are being opened. Then we should try doing different operations to figure out what might be causing the spike in database connections. I think I remember it being queries for Validate and/or Gallery that were the culprit.

The text was updated successfully, but these errors were encountered:

misaugstad · 2023-07-27T00:20:25Z

Could be related to #1809

mechanicjay · 2023-07-27T17:29:12Z

Indeed, we're seeing this on the production machine again and it's impacting some other things using the db cluster.

I found sidewalk-sea sitting on 15 connections just now for no good reason, I restarted the instance to clear the connections.

misaugstad · 2023-07-27T22:48:52Z

This is actually causing an issue when deploying a new version to test today! Got this error when trying to test one of the servers

Cannot connect to database [default]

And it had this same message in the logs:

Failed to obtain initial connection Sleeping for 0ms and trying again. Attempts left: 0. Exception: null.Message:FATAL: remaining connection slots are reserved for non-replication superuser connections

@mechanicjay this issue is probably getting worse the more servers we add, huh...

misaugstad · 2023-07-28T17:00:38Z

This has happened on a bunch of test servers after our most recent deploy (see #3325). Going to try to push to develop and see if this gets resolved during a deploy to prod. Otherwise I'll email support to get the test servers back up today.

Either way, looks like fixing this is becoming a higher priority!

misaugstad · 2023-07-28T17:03:04Z

Here's the exact error message from the Auckland server logs:

2023-07-27 22:42:45,583 - [ERROR] - from com.jolbox.bonecp.hooks.AbstractConnectionHook in play-internal-execution-context-1
Failed to acquire connection to jdbc:postgresql://docker-test.cs.washington.edu:5433/sidewalk-auckland Sleeping for 1000ms and trying again. Attempts left: 10. Exception: null.Message:FATAL: remaining connection slots are reserved for non-replication superuser connections

jonfroehlich · 2023-08-01T16:15:38Z

In thinking through this further, is it possible that the ~4-5 idle connections that begin on startup are related to the database connection pool?

This is essentially how a thread pool works—you initialize a thread pool with the size of the pool, which is then a shared resources for the life of the app—but not sure if that analogy works here for databases.

misaugstad · 2023-08-01T23:17:59Z

@mechanicjay opened up another 100 connections for us, bringing us to a max of 200 :)

@dylanbun also just updated the code that queries the config table to ensure that we aren't opening extra connections when getting those configs.

The last thing I'd like to do is to update the configuration of our database connection pooler, with the configs described here. Current defaults are determined by the play-jdbc library, found in .ivy2/cache/com.typesafe.play/play-jdbc_2.10/srcs/play-jdbc_2.10-2.3.10-sources.jar. Below are some proposed updates.

db.default.idleMaxAge=30 second  # default is 10 minutes
db.default.acquireRetryDelay=3 second  # default is 1 second
db.default.acquireRetryAttempts=20  # default is 10

The goal for idleMaxAge is to force the connection pooler to let go of idle connections more quickly. For acquireRetryDelay and acquireRetryAttempts, I'm trying to give us more time to search for a connection if the db doesn't connect immediately, giving time for other cities to let go of their idle connections.

jonfroehlich · 2023-08-01T23:23:40Z

Are we sure that dropping idleMaxAge from 10 mins to 30 seconds is reasonable? That's a 95% reduction—seems pretty drastic. What are the implications?

All the other shifts from defaults seem good to me.

misaugstad · 2023-08-01T23:46:18Z

Are we sure that dropping idleMaxAge from 10 mins to 30 seconds is reasonable? That's a 95% reduction—seems pretty drastic

Honestly, not 100% sure. It would mean that we would get hit with the overhead of connecting to the database if no new queries are run in that city during those 30 seconds. How long is the overhead? Really hard to say. This stackoverflow post seems to guess at 67-100 ms, though I think that ours is faster than that having looked through their post. And not all delays mean a delay for the end user.

If someone is using the Explore page, they're going to hit the database more than every 30 seconds. On Validate it might be more on the edge for some users. Maybe let's bring it up to 1-2 minutes? Maybe 90 seconds?

With the retry delay and such, I kind of want to have acquireRetryDelay * acquireRetryAttempts > idleMaxAge so that there's actually enough time to drop connections before a server gives up on an auto-deploy and never connects. So I'll update those configs a bit to get past the idleMaxAge bar we set.

ALSO I just found the minConnectionsPerPartition parameter! Which was set to 5, which is why we have 5 open connections at all times. I just did some tests on my local environment, and it looks like loading the landing page requires 4 connections, loading Validate opens 5, and loading Explore opens 8. Maybe we bring this down to 4 for now, so we aren't sitting on quite as many connections when most of our servers aren't being used at any given time?

misaugstad added the Bug label Jul 27, 2023

misaugstad mentioned this issue Jul 28, 2023

500 Internal Server Error for Auckland Server #3325

Closed

misaugstad mentioned this issue Aug 2, 2023

Update DB connection pooler config #3333

Merged

2 tasks

misaugstad closed this as completed in #3333 Aug 2, 2023

misaugstad mentioned this issue Aug 15, 2023

v7.15.5 #3365

Merged

misaugstad mentioned this issue Nov 18, 2024

Possible back end issues with new unified login? #3726

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Too many database connections are being opened #3316

Too many database connections are being opened #3316

misaugstad commented Jul 27, 2023

misaugstad commented Jul 27, 2023

mechanicjay commented Jul 27, 2023

misaugstad commented Jul 27, 2023

misaugstad commented Jul 28, 2023

misaugstad commented Jul 28, 2023

jonfroehlich commented Aug 1, 2023

misaugstad commented Aug 1, 2023

jonfroehlich commented Aug 1, 2023

misaugstad commented Aug 1, 2023

Too many database connections are being opened #3316

Too many database connections are being opened #3316

Comments

misaugstad commented Jul 27, 2023

Brief description of problem/feature

Potential solution(s)

misaugstad commented Jul 27, 2023

mechanicjay commented Jul 27, 2023

misaugstad commented Jul 27, 2023

misaugstad commented Jul 28, 2023

misaugstad commented Jul 28, 2023

jonfroehlich commented Aug 1, 2023

misaugstad commented Aug 1, 2023

jonfroehlich commented Aug 1, 2023

misaugstad commented Aug 1, 2023