Provide custom function for max_lifetime to prevent stampeding reconnections #2854

NathanFlurry · 2023-11-04T23:45:09Z

Is your feature request related to a problem? Please describe.

When enabling max_lifetime on a service, SQLx will recreate all of the initial connections at the same time every max_lifetime interval. This causes excessive load on the database & causes requests to hang until the connections have opened again.

Take this example configuration:

PgPoolOptions::new().max_lifetime(Duration::from_secs(30 * 60)).min_connections(8)

Assuming no more than 8 connections were created in the pool, all 8 connections would have to be created every 30 minutes, therefore causing queries to hang.

Similarly, if the service always has a high load (even with min_connections = 0), the connections created on startup would have a stampeding effect when they all reconnect exactly every 30 minutes. The current defaults have max_lifetime enabled, which is prone to this issue.

It looks like there's already code in place to limit the number of reaped connections which helps limit the number of connections affected if min_connections is configured (which is not in the default configuration).

Here's an example of the connection rate from some minimal testing:

min_connections = 0, max_lifetime = 5 minutes, running under load

Describe the solution you'd like

I'd like to be able to control the max_lifetime for each individual connection by providing a function that returns a duration. This way, we can provide randomness to max_lifetime.

async-nats has a similar implementation for reconnect_delay_callback.

Describe alternatives you've considered

Providing a default randomness to the max_lifetime
Disabling max_lifetime and leaving test_before_acquire set to true should work fine as long as there are no memory leaks in the database.

The text was updated successfully, but these errors were encountered:

abonander · 2023-11-06T22:46:23Z

I think this would be good to take care of at the same time as #2848 (comment)

However, since our pool implementation is more opinionated than a generic pool, I think a callback that returns a Duration is a bit too fiddly for most users but may also be too inflexible for some.

I'm thinking of a couple of different approaches here, which aren't necessarily mutually exclusive:

An async before_connect callback which can be used to insert an arbitrary delay, but also could be used to modify the ConnectOptions that will be used to connect (which has been requested before but I can't find an issue to link to):

impl<DB: Database> PoolOptions<DB> {

    /// Set an async callback to be invoked before beginning a connection attempt.
    ///
    /// The callback is passed a reference to the `ConnectOptions` that will be used to open the connection.
    /// and is free to return a modified version.
    ///
    /// The async nature of the callback allows for any pre-connection work to be done,
    /// such as a DNS lookup to resolve a domain for a high-availability server,
    /// or to insert an artificial delay to mitigate a potential stampeding herd effect.
    pub fn before_connect<F>(self, before_connect: F) -> Self
    where 
        F: for<'a> Fn(&'a DB::ConnectOptions) -> BoxFuture<'a, crate::Result<Cow<'a, DB::ConnectOptions>>>,
        F: Send + Sync + 'static
    { ... }

}

In my mind, this would only be invoked at the beginning of acquire() attempting to connect, but it could be invoked on retries as well for a custom backoff implementation (in which case it should also be passed an integer for the current number of connection attempts).

For a more high-level interface though, I was thinking something like this (mimicking backoff::ExponentialBackoffBuilder):

impl<DB: Database> PoolOptions<DB> {

    /// Set the base value for initial and backoff delay calculations for connection attempts.
    ///
    /// This value will have random jitter applied to mitigate a potential stampeding herd effect.
    ///
    /// See [`.connect_delay_randomization_factor()`][Self::connect_delay_randomization_factor].
    pub fn base_connect_delay(self, delay: impl Into<Option<Duration>>) -> Self { ... }

    // I would like to come up with a more concise name for this, 
    // or use something with a rigid definition in statistics like variance.
    /// The maximum of a random jitter factor between 0 and 1 that will be applied to the connection delay.
    ///
    /// A value of 0.5 means that each connect delay will have a random jitter applied between +/- 50%.
    pub fn connect_delay_randomization_factor(self, factor: f64) -> Self { ... }

    /// The exponential growth factor for connection backoffs.
    ///
    /// The backoff delay of each subsequent connection attempt for a given `acquire()` call will be multiplied by this factor,
    /// before jitter is applied.
    pub fn connect_delay_multiplier(self, multiplier: f64) -> Self { ... }
}

NathanFlurry added the enhancement New feature or request label Nov 4, 2023

abonander mentioned this issue Nov 6, 2023

PoolInner::acquire() does not try the idle queue after a transient connection failure #2848

Open

mirek26 mentioned this issue Feb 11, 2024

Improve max_lifetime handling #3051

Closed

abonander mentioned this issue Mar 19, 2024

Pool creates many connections when under high contention #3132

Open

NathanFlurry mentioned this issue Jun 15, 2024

feat: add max_lifetime_jitter #3292

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide custom function for max_lifetime to prevent stampeding reconnections #2854

Provide custom function for max_lifetime to prevent stampeding reconnections #2854

NathanFlurry commented Nov 4, 2023 •

edited

Loading

abonander commented Nov 6, 2023

Provide custom function for max_lifetime to prevent stampeding reconnections #2854

Provide custom function for max_lifetime to prevent stampeding reconnections #2854

Comments

NathanFlurry commented Nov 4, 2023 • edited Loading

abonander commented Nov 6, 2023

NathanFlurry commented Nov 4, 2023 •

edited

Loading