Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PoolInner::acquire() does not try the idle queue after a transient connection failure #2848

Open
abonander opened this issue Nov 1, 2023 · 1 comment · May be fixed by #3582
Open

PoolInner::acquire() does not try the idle queue after a transient connection failure #2848

abonander opened this issue Nov 1, 2023 · 1 comment · May be fixed by #3582
Labels
bug pool Related to SQLx's included connection pool

Comments

@abonander
Copy link
Collaborator

While thinking about potential sources of the infamous PoolTimedOut error, I realized that there's an interesting failure mode to acquire().

Once it decides to open a new connection, that's all it tries to do:

// Attempt to connect...
return self.connect(deadline, guard).await;

If a nonfatal connection error happens, it just continues in the backoff loop in connect() and never touches the idle queue again:

Ok(Err(Error::Database(error))) if error.is_transient_in_connect_phase() => (),

It will continue to do this until the timeout if the transient error does not resolve itself.

Right now, only the Postgres driver overrides DatabaseError::is_transient_in_connect_phase(), but one of the error codes it considers transient is the "too many connections" error:

// too_many_connections
// This may be returned if we just un-gracefully closed a connection,
// give the database a chance to notice it and clean it up.
"53300",

This means that if the max_connections of the pool exceeds what is currently available on the server, tasks can get stuck in a loop trying to open new connections despite there being idle connections available, leading to surprising PoolTimedOut errors.

This is potentially the cause of some such issues being reported, although it's only likely to occur with the Postgres driver.

@abonander abonander added bug pool Related to SQLx's included connection pool labels Nov 1, 2023
@abonander
Copy link
Collaborator Author

I forgot to mention how I think this should be solved.

The logic of acquire() should be changed to race between acquiring a connection from the idle queue and opening a new connection.

To avoid wasting connection attempts, the connection process should be spawned as a separate task so that if the acquire() call wins the race, the connection is still allowed to complete. If the return type of the task is crate::Result<PoolConnection> then it should automatically return to the pool if the task completes after its handle is dropped.

The acquire() logic should wait a short amount of time before spawning the connection task to avoid growing the pool too eagerly or stampeding the server. This could even be configurable (#2854).

github-merge-queue bot pushed a commit to matter-labs/zksync-era that referenced this issue Feb 7, 2024
## What ❔

- Increases `max_connections` in postgres (locally) from 100 to 200.
- Hopefully, it'll reduce the problems with integration tests in CI.

## Why ❔

- A single server (main node or EN) instance by default uses a
connection pool size of 50, and it spawns several additional unique
connections, making the total slightly higher than 50.
- In integration tests, we run the server and the EN simultaneously,
resulting in more than 100 connections being utilized.
- We recently bumped SQLx to 0.7, which [seems to have problems with
that](launchbadge/sqlx#2848).
- ...with 200 connections supported, we will be below the threshold
again, so it might help.

## Checklist

<!-- Check your PR fulfills the following items. -->
<!-- For draft PRs check the boxes as you complete them. -->

- [ ] PR title corresponds to the body of PR (we generate changelog
entries from PRs).
- [ ] Tests for the changes have been added / updated.
- [ ] Documentation comments have been added / updated.
- [ ] Code has been formatted via `zk fmt` and `zk lint`.
- [ ] Spellcheck has been run via `zk spellcheck`.
- [ ] Linkcheck has been run via `zk linkcheck`.
@abonander abonander linked a pull request Oct 29, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug pool Related to SQLx's included connection pool
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant