Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide custom function for max_lifetime to prevent stampeding reconnections #2854

Open
NathanFlurry opened this issue Nov 4, 2023 · 1 comment
Labels
enhancement New feature or request

Comments

@NathanFlurry
Copy link

NathanFlurry commented Nov 4, 2023

Is your feature request related to a problem? Please describe.

When enabling max_lifetime on a service, SQLx will recreate all of the initial connections at the same time every max_lifetime interval. This causes excessive load on the database & causes requests to hang until the connections have opened again.

Take this example configuration:

PgPoolOptions::new().max_lifetime(Duration::from_secs(30 * 60)).min_connections(8)

Assuming no more than 8 connections were created in the pool, all 8 connections would have to be created every 30 minutes, therefore causing queries to hang.

Similarly, if the service always has a high load (even with min_connections = 0), the connections created on startup would have a stampeding effect when they all reconnect exactly every 30 minutes. The current defaults have max_lifetime enabled, which is prone to this issue.

It looks like there's already code in place to limit the number of reaped connections which helps limit the number of connections affected if min_connections is configured (which is not in the default configuration).

Here's an example of the connection rate from some minimal testing:

min_connections = 0, max_lifetime = 5 minutes, running under load

image

Describe the solution you'd like

I'd like to be able to control the max_lifetime for each individual connection by providing a function that returns a duration. This way, we can provide randomness to max_lifetime.

async-nats has a similar implementation for reconnect_delay_callback.

Describe alternatives you've considered

  • Providing a default randomness to the max_lifetime
  • Disabling max_lifetime and leaving test_before_acquire set to true should work fine as long as there are no memory leaks in the database.
@abonander
Copy link
Collaborator

I think this would be good to take care of at the same time as #2848 (comment)

However, since our pool implementation is more opinionated than a generic pool, I think a callback that returns a Duration is a bit too fiddly for most users but may also be too inflexible for some.

I'm thinking of a couple of different approaches here, which aren't necessarily mutually exclusive:

An async before_connect callback which can be used to insert an arbitrary delay, but also could be used to modify the ConnectOptions that will be used to connect (which has been requested before but I can't find an issue to link to):

impl<DB: Database> PoolOptions<DB> {

    /// Set an async callback to be invoked before beginning a connection attempt.
    ///
    /// The callback is passed a reference to the `ConnectOptions` that will be used to open the connection.
    /// and is free to return a modified version.
    ///
    /// The async nature of the callback allows for any pre-connection work to be done,
    /// such as a DNS lookup to resolve a domain for a high-availability server,
    /// or to insert an artificial delay to mitigate a potential stampeding herd effect.
    pub fn before_connect<F>(self, before_connect: F) -> Self
    where 
        F: for<'a> Fn(&'a DB::ConnectOptions) -> BoxFuture<'a, crate::Result<Cow<'a, DB::ConnectOptions>>>,
        F: Send + Sync + 'static
    { ... }

}

In my mind, this would only be invoked at the beginning of acquire() attempting to connect, but it could be invoked on retries as well for a custom backoff implementation (in which case it should also be passed an integer for the current number of connection attempts).

For a more high-level interface though, I was thinking something like this (mimicking backoff::ExponentialBackoffBuilder):

impl<DB: Database> PoolOptions<DB> {

    /// Set the base value for initial and backoff delay calculations for connection attempts.
    ///
    /// This value will have random jitter applied to mitigate a potential stampeding herd effect.
    ///
    /// See [`.connect_delay_randomization_factor()`][Self::connect_delay_randomization_factor].
    pub fn base_connect_delay(self, delay: impl Into<Option<Duration>>) -> Self { ... }

    // I would like to come up with a more concise name for this, 
    // or use something with a rigid definition in statistics like variance.
    /// The maximum of a random jitter factor between 0 and 1 that will be applied to the connection delay.
    ///
    /// A value of 0.5 means that each connect delay will have a random jitter applied between +/- 50%.
    pub fn connect_delay_randomization_factor(self, factor: f64) -> Self { ... }

    /// The exponential growth factor for connection backoffs.
    ///
    /// The backoff delay of each subsequent connection attempt for a given `acquire()` call will be multiplied by this factor,
    /// before jitter is applied.
    pub fn connect_delay_multiplier(self, multiplier: f64) -> Self { ... }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants