You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I think I found a bug in Pool which sometimes causes unexpected PoolTimedOut errors. Here is a minimal reproducible test case:
#[tokio::test(flavor = "multi_thread")]asyncfnrace_condition_reproduction_case(){let pool: sqlx::Pool<Sqlite> = sqlx::pool::PoolOptions::new().max_connections(1).connect_timeout(std::time::Duration::from_secs(1)).connect(":memory:").await.unwrap();
sqlx::query("CREATE TABLE foo (name TEXT, value TEXT)").execute(&pool).await.unwrap();
sqlx::query("CREATE TABLE bar (name TEXT, value TEXT)").execute(&pool).await.unwrap();}
The failure is not deterministic - the test might need to be run several times before the failure is observed.
Note: I decreased the connect timeout to make the test fail faster, but the failure still occurs even with the default timeout. Also, this failure is most likely when max_connections is 1, but I think as long as the number of subsequent queries is higher than max_connections then the failure can still occur.
What happens I think is that there is a race condition. The first query acquires a connection and the releases it. But the release happens in a spawned task, so the second query might call acquire first. It gets to this point before a context switch happens and the release of the first query finishes, waking a Waiter if there is one (there are none in this case). Then context gets switched back to the acquire and it runs the poll_fn which pushes a new waiter to the queue. But it never gets woken up because the waking up already happened and eventually the call timeouts.
A simple fix which seems to work for me is to check whether idle_conns is still empty before pushing the waiter:
EDIT: after a bit more testing, this doesn't seem to be a sufficient fix. The only synchronization between idle_conns.push and idle_conns.is_empty is via the waiters, but in cases like I described there are no waiters yet, so depending on timing, the idle_conns.is_empty() call might still return true and the problem persists ☹️
EDIT 2: seems that moving the idle_conns.is_empty() check after the let waiter = waiter.get_or_insert(... line does the trick. That seems to eliminate the last race condition - namely when the is_empty check runs first, then release runs and notifies the waiters and finally the Waiter::push_new is called.
The text was updated successfully, but these errors were encountered:
I think I found a bug in
Pool
which sometimes causes unexpectedPoolTimedOut
errors. Here is a minimal reproducible test case:The failure is not deterministic - the test might need to be run several times before the failure is observed.
Note: I decreased the connect timeout to make the test fail faster, but the failure still occurs even with the default timeout. Also, this failure is most likely when
max_connections
is 1, but I think as long as the number of subsequent queries is higher thanmax_connections
then the failure can still occur.What happens I think is that there is a race condition. The first query acquires a connection and the releases it. But the release happens in a spawned task, so the second query might call
acquire
first. It gets to this point before a context switch happens and the release of the first query finishes, waking aWaiter
if there is one (there are none in this case). Then context gets switched back to the acquire and it runs thepoll_fn
which pushes a new waiter to the queue. But it never gets woken up because the waking up already happened and eventually the call timeouts.A simple fix which seems to work for me is to check whether
idle_conns
is still empty before pushing the waiter:EDIT: after a bit more testing, this doesn't seem to be a sufficient fix. The only synchronization between☹️
idle_conns.push
andidle_conns.is_empty
is via the waiters, but in cases like I described there are no waiters yet, so depending on timing, theidle_conns.is_empty()
call might still returntrue
and the problem persistsEDIT 2: seems that moving the
idle_conns.is_empty()
check after thelet waiter = waiter.get_or_insert(...
line does the trick. That seems to eliminate the last race condition - namely when theis_empty
check runs first, thenrelease
runs and notifies the waiters and finally theWaiter::push_new
is called.The text was updated successfully, but these errors were encountered: