Skip to content

Conversation

@ThomWright
Copy link
Contributor

Does your PR solve an issue?

Fixes #4005

Is this a breaking change?

No

Discussion

I've put together a PoC of offloading the SHA256 HMAC generation into a blocking executor, taking it off the main event loop.

There's also a little unit test demonstrating how long this code takes. On my machine this takes ~50ms.

I can remove the test and/or put something more useful in there before merging.

@abonander
Copy link
Collaborator

The majority of the time is likely being spent in this loop:

for _ in 1..iter_count {
mac.update(u.as_slice());
u = mac.finalize_reset().into_bytes();
hi = hi.iter().zip(u.iter()).map(|(&a, &b)| a ^ b).collect();
}

(It looks like there's an off-by-one error, but the first iteration is outside the loop.)

The iteration count is stored with the hashed password, but it's set for new passwords by the scram_iterations parameter, which defaults to 4096: https://www.postgresql.org/docs/current/runtime-config-connection.html#RUNTIME-CONFIG-CONNECTION-AUTHENTICATION

As the Tokio docs point out, spawn_blocking isn't really designed for CPU-bound code. Because it assumes blocking tasks will spend most of their time actually blocked (e.g. on I/O), it will spawn new blocking threads if none are idle up to a very high limit, 512 by default. Though it's not nearly as well documented, it appears the situation with async-std is similar.

So if your server is using a pool with a max_connections set to 100 and you get a burst of traffic, you could end up spawning 100 threads at the same time, all trying to max out the CPU. That sounds counter-productive if you're trying to keep your tail latency low.

I've worked around this in the past using a shared semaphore to limit the number of blocking tasks spawned at once, but I'm not sure that's appropriate here.

However, we could instead just yield to the executor here every so many iterations.

There's tokio::task::coop::consume_budget() which is specifically designed to be called on every iteration of a CPU-bound algorithm, because it only yields once the "budget" (as currently implemented, just a counter that starts at 128 and decrements every call) for the task is depleted. However, it involves accessing a thread-local on every call which can have non-trivial overhead.

There's also no equivalent for async-std (or smol or global-async-executor which we're going to be adding support for).

Assuming you haven't changed the default for scram_iterations, that means your system completed 4096 iterations in ~50ms, or on the order of 10us per iteration. The prevailing advice is to yield to the executor at least every 10-100 microseconds, so we could start by yielding every 10 iterations.

And it just so happens that we already have rt::yield_now() for this.

To prevent spenting too much time doing synchronous work on the
event loop, we yield every few iterations of the hmac-sha256.
@ThomWright
Copy link
Contributor Author

Hey! Sorry, I missed this message. I was checking the issue and not the PR... Thank you for the thorough response!

As the Tokio docs point out, spawn_blocking isn't really designed for CPU-bound code.

Ah yes, I see!

From the tokio docs:

If your code is CPU-bound and you wish to limit the number of threads used to run it, you should use a separate thread pool dedicated to CPU bound tasks.

This sounds like a potential solution, albeit a bit more involved. And I guess we'd like to avoid bringing in rayon as a dependency?

The majority of the time is likely being spent in this loop:

Yep, looks like it (which makes sense, it's where we do the 4096 iterations):

---- connection::sasl::tests::hi_duration stdout ----
[sqlx-postgres/src/connection/sasl.rs:273:9] duration = 40.899417ms

---- connection::sasl::tests::final_message_duration stdout ----
[sqlx-postgres/src/connection/sasl.rs:265:9] duration = 41.37775ms

And it just so happens that we already have rt::yield_now() for this.

Let's give this a go, at least to start.

On my machine, yielding every 10 iterations, it usually yields every 100us as expected, but sometimes much higher. Even yielding every 5 iterations, I saw it take occasionally 250us between yields.

It's never going to be perfect though, so I've implemented it to yield every 5 iterations, which usually gives us ~50us between yields.

Let me know if you'd like something different!

Is the benchmarking stuff still used?

@abonander
Copy link
Collaborator

From the tokio docs:

If your code is CPU-bound and you wish to limit the number of threads used to run it, you should use a separate thread pool dedicated to CPU bound tasks.

This sounds like a potential solution, albeit a bit more involved. And I guess we'd like to avoid bringing in rayon as a dependency?

Yeah, for a few reasons:

  • it's not exactly a lightweight dependency
  • the threads will compete for CPU time with the runtime threads, possibly making tail latency worse
  • once spawned, the threads will sit there idle the entire time they're not being used, consuming memory space for their stacks
    • if there was an idle timeout, you'd sometimes have to wait for a new thread to be spawned, also increasing latency
  • the SASL calculation must still run to completion even if the future is cancelled, which could theoretically be used in a DoS attack
    • building cancellation into it would be more work than this change

On my machine, yielding every 10 iterations, it usually yields every 100us as expected, but sometimes much higher. Even yielding every 5 iterations, I saw it take occasionally 250us between yields.

Yeah, that's going to depend on the exact timing of various things. That could just be the thread itself getting pre-empted at some point, which would occur occasionally on that time scale.

If we really wanted to be more precise here, we could measure the time spent in the loop and ensure we yield at least every 50-100us, but that's probably overkill.

The solution you've arrived at is pretty much exactly what I was thinking of.

Is the benchmarking stuff still used?

You mean this?

#[cfg(all(test, not(debug_assertions)))]
#[bench]
fn bench_sasl_hi(b: &mut test::Bencher) {
use test::black_box;
let mut rng = rand::thread_rng();
let nonce: Vec<u8> = std::iter::repeat(())
.map(|()| rng.sample(rand::distributions::Alphanumeric))
.take(64)
.collect();
b.iter(|| {
let _ = hi(
test::black_box("secret_password"),
test::black_box(&nonce),
test::black_box(4096),
);
});
}

I didn't realize that existed, honestly. I didn't write this code myself. We don't run benchmarks in our CI suite.

That appears to be using libtest benchmarking which seems unlikely to ever be stabilized. And it wouldn't compile anyways because #[bench] doesn't support async fns.

We've since switched to using criterion, which also has native support for async fns. But it doesn't work as well to benchmark internal stuff like this since you have to create a separate bench target and you can't rely on the automatic harness generation.

So it's an open question of what to do with this benchmark function.

You could keep it as-is, but instead of trying to make it an async fn, just wrap the call in sqlx_core::rt::test_block_on.

It's certainly possible to adapt it to criterion, but it's a little more involved and perhaps a bit much to walk you through.

Or you could just delete it (or leave it for posterity but comment it out).

I'm not sure of the right answer, here. I'm leaning towards just deleting it, though.

@ThomWright
Copy link
Contributor Author

I'm not sure of the right answer, here. I'm leaning towards just deleting it, though.

Done. I guess it'll still be in the commit history for posterity if we ever want it.

Any idea when the next release might be published? I'd be keen to test out a pre-release with a real (ish) workload to see what effect it has.

@ThomWright ThomWright marked this pull request as ready for review September 4, 2025 10:38
@abonander abonander merged commit 482c942 into launchbadge:main Sep 6, 2025
104 checks passed
@abonander
Copy link
Collaborator

Any idea when the next release might be published? I'd be keen to test out a pre-release with a real (ish) workload to see what effect it has.

There's so many PRs with breaking changes that still need to be reviewed and merged, so I'm not sure. But I'm working on it.

@AlexTMjugador
Copy link
Contributor

AlexTMjugador commented Sep 6, 2025

Any idea when the next release might be published? I'd be keen to test out a pre-release with a real (ish) workload to see what effect it has.

If you don't plan to publish your crates on crates.io or are bound by any policy against it, you can use a [patch] section in you Cargo.toml file to use whatever SQLx commit you prefer, including commits in a fork of yours, and give any unreleased changes a try.

JosiahParry pushed a commit to JosiahParry/sqlx that referenced this pull request Sep 24, 2025
* Yield while salting password for PG SASL

To prevent spenting too much time doing synchronous work on the
event loop, we yield every few iterations of the hmac-sha256.

* Remove unused bench
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PostgreSQL SCRAM-SHA-256 auth causes event loop blocking

3 participants