PostgreSQL SASL – run SHA256 in a blocking executor #4006

ThomWright · 2025-08-26T22:04:56Z

Does your PR solve an issue?

Is this a breaking change?

No

Discussion

I've put together a PoC of offloading the SHA256 HMAC generation into a blocking executor, taking it off the main event loop.

There's also a little unit test demonstrating how long this code takes. On my machine this takes ~50ms.

I can remove the test and/or put something more useful in there before merging.

abonander · 2025-08-27T01:55:03Z

The majority of the time is likely being spent in this loop:

sqlx/sqlx-postgres/src/connection/sasl.rs

Lines 214 to 218 in 753d327

    
           for _ in 1..iter_count { 
        
               mac.update(u.as_slice()); 
        
               u = mac.finalize_reset().into_bytes(); 
        
               hi = hi.iter().zip(u.iter()).map(|(&a, &b)| a ^ b).collect(); 
        
           }

(It looks like there's an off-by-one error, but the first iteration is outside the loop.)

The iteration count is stored with the hashed password, but it's set for new passwords by the scram_iterations parameter, which defaults to 4096: https://www.postgresql.org/docs/current/runtime-config-connection.html#RUNTIME-CONFIG-CONNECTION-AUTHENTICATION

As the Tokio docs point out, spawn_blocking isn't really designed for CPU-bound code. Because it assumes blocking tasks will spend most of their time actually blocked (e.g. on I/O), it will spawn new blocking threads if none are idle up to a very high limit, 512 by default. Though it's not nearly as well documented, it appears the situation with async-std is similar.

So if your server is using a pool with a max_connections set to 100 and you get a burst of traffic, you could end up spawning 100 threads at the same time, all trying to max out the CPU. That sounds counter-productive if you're trying to keep your tail latency low.

I've worked around this in the past using a shared semaphore to limit the number of blocking tasks spawned at once, but I'm not sure that's appropriate here.

However, we could instead just yield to the executor here every so many iterations.

There's tokio::task::coop::consume_budget() which is specifically designed to be called on every iteration of a CPU-bound algorithm, because it only yields once the "budget" (as currently implemented, just a counter that starts at 128 and decrements every call) for the task is depleted. However, it involves accessing a thread-local on every call which can have non-trivial overhead.

There's also no equivalent for async-std (or smol or global-async-executor which we're going to be adding support for).

Assuming you haven't changed the default for scram_iterations, that means your system completed 4096 iterations in ~50ms, or on the order of 10us per iteration. The prevailing advice is to yield to the executor at least every 10-100 microseconds, so we could start by yielding every 10 iterations.

And it just so happens that we already have rt::yield_now() for this.

To prevent spenting too much time doing synchronous work on the event loop, we yield every few iterations of the hmac-sha256.

ThomWright · 2025-09-03T18:15:10Z

Hey! Sorry, I missed this message. I was checking the issue and not the PR... Thank you for the thorough response!

As the Tokio docs point out, spawn_blocking isn't really designed for CPU-bound code.

Ah yes, I see!

From the tokio docs:

If your code is CPU-bound and you wish to limit the number of threads used to run it, you should use a separate thread pool dedicated to CPU bound tasks.

This sounds like a potential solution, albeit a bit more involved. And I guess we'd like to avoid bringing in rayon as a dependency?

The majority of the time is likely being spent in this loop:

Yep, looks like it (which makes sense, it's where we do the 4096 iterations):

---- connection::sasl::tests::hi_duration stdout ----
[sqlx-postgres/src/connection/sasl.rs:273:9] duration = 40.899417ms

---- connection::sasl::tests::final_message_duration stdout ----
[sqlx-postgres/src/connection/sasl.rs:265:9] duration = 41.37775ms

And it just so happens that we already have rt::yield_now() for this.

Let's give this a go, at least to start.

On my machine, yielding every 10 iterations, it usually yields every 100us as expected, but sometimes much higher. Even yielding every 5 iterations, I saw it take occasionally 250us between yields.

It's never going to be perfect though, so I've implemented it to yield every 5 iterations, which usually gives us ~50us between yields.

Let me know if you'd like something different!

Is the benchmarking stuff still used?

abonander · 2025-09-04T00:39:22Z

From the tokio docs:

If your code is CPU-bound and you wish to limit the number of threads used to run it, you should use a separate thread pool dedicated to CPU bound tasks.

This sounds like a potential solution, albeit a bit more involved. And I guess we'd like to avoid bringing in rayon as a dependency?

Yeah, for a few reasons:

it's not exactly a lightweight dependency
the threads will compete for CPU time with the runtime threads, possibly making tail latency worse
once spawned, the threads will sit there idle the entire time they're not being used, consuming memory space for their stacks
- if there was an idle timeout, you'd sometimes have to wait for a new thread to be spawned, also increasing latency
the SASL calculation must still run to completion even if the future is cancelled, which could theoretically be used in a DoS attack
- building cancellation into it would be more work than this change

On my machine, yielding every 10 iterations, it usually yields every 100us as expected, but sometimes much higher. Even yielding every 5 iterations, I saw it take occasionally 250us between yields.

Yeah, that's going to depend on the exact timing of various things. That could just be the thread itself getting pre-empted at some point, which would occur occasionally on that time scale.

If we really wanted to be more precise here, we could measure the time spent in the loop and ensure we yield at least every 50-100us, but that's probably overkill.

The solution you've arrived at is pretty much exactly what I was thinking of.

Is the benchmarking stuff still used?

You mean this?

sqlx/sqlx-postgres/src/connection/sasl.rs

Lines 208 to 225 in 69bb595

    
           #[cfg(all(test, not(debug_assertions)))] 
        
           #[bench] 
        
           fn bench_sasl_hi(b: &mut test::Bencher) { 
        
               use test::black_box; 
        
               let mut rng = rand::thread_rng(); 
        
               let nonce: Vec<u8> = std::iter::repeat(()) 
        
                   .map(|()| rng.sample(rand::distributions::Alphanumeric)) 
        
                   .take(64) 
        
                   .collect(); 
        
               b.iter(|| { 
        
                   let _ = hi( 
        
                       test::black_box("secret_password"), 
        
                       test::black_box(&nonce), 
        
                       test::black_box(4096), 
        
                   ); 
        
               }); 
        
           }

I didn't realize that existed, honestly. I didn't write this code myself. We don't run benchmarks in our CI suite.

That appears to be using libtest benchmarking which seems unlikely to ever be stabilized. And it wouldn't compile anyways because #[bench] doesn't support async fns.

We've since switched to using criterion, which also has native support for async fns. But it doesn't work as well to benchmark internal stuff like this since you have to create a separate bench target and you can't rely on the automatic harness generation.

So it's an open question of what to do with this benchmark function.

You could keep it as-is, but instead of trying to make it an async fn, just wrap the call in sqlx_core::rt::test_block_on.

It's certainly possible to adapt it to criterion, but it's a little more involved and perhaps a bit much to walk you through.

Or you could just delete it (or leave it for posterity but comment it out).

I'm not sure of the right answer, here. I'm leaning towards just deleting it, though.

ThomWright · 2025-09-04T09:29:20Z

I'm not sure of the right answer, here. I'm leaning towards just deleting it, though.

Done. I guess it'll still be in the commit history for posterity if we ever want it.

Any idea when the next release might be published? I'd be keen to test out a pre-release with a real (ish) workload to see what effect it has.

abonander · 2025-09-06T01:03:11Z

Any idea when the next release might be published? I'd be keen to test out a pre-release with a real (ish) workload to see what effect it has.

There's so many PRs with breaking changes that still need to be reviewed and merged, so I'm not sure. But I'm working on it.

AlexTMjugador · 2025-09-06T11:26:45Z

Any idea when the next release might be published? I'd be keen to test out a pre-release with a real (ish) workload to see what effect it has.

If you don't plan to publish your crates on crates.io or are bound by any policy against it, you can use a [patch] section in you Cargo.toml file to use whatever SQLx commit you prefer, including commits in a fork of yours, and give any unreleased changes a try.

* Yield while salting password for PG SASL To prevent spenting too much time doing synchronous work on the event loop, we yield every few iterations of the hmac-sha256. * Remove unused bench

ThomWright mentioned this pull request Aug 26, 2025

PostgreSQL SCRAM-SHA-256 auth causes event loop blocking #4005

Closed

Yield while salting password for PG SASL

ef2e809

To prevent spenting too much time doing synchronous work on the event loop, we yield every few iterations of the hmac-sha256.

ThomWright force-pushed the offload-sasl-sha branch from 753d327 to ef2e809 Compare September 3, 2025 18:13

Remove unused bench

2d68935

ThomWright marked this pull request as ready for review September 4, 2025 10:38

abonander merged commit 482c942 into launchbadge:main Sep 6, 2025
104 checks passed

ThomWright mentioned this pull request Sep 22, 2025

Cache Client Key for PostgreSQL SASL #4032

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

PostgreSQL SASL – run SHA256 in a blocking executor #4006

PostgreSQL SASL – run SHA256 in a blocking executor #4006

Uh oh!

ThomWright commented Aug 26, 2025

Uh oh!

abonander commented Aug 27, 2025

Uh oh!

ThomWright commented Sep 3, 2025

Uh oh!

abonander commented Sep 4, 2025

Uh oh!

ThomWright commented Sep 4, 2025

Uh oh!

Uh oh!

abonander commented Sep 6, 2025

Uh oh!

AlexTMjugador commented Sep 6, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

PostgreSQL SASL – run SHA256 in a blocking executor #4006

PostgreSQL SASL – run SHA256 in a blocking executor #4006

Uh oh!

Conversation

ThomWright commented Aug 26, 2025

Does your PR solve an issue?

Is this a breaking change?

Discussion

Uh oh!

abonander commented Aug 27, 2025

Uh oh!

ThomWright commented Sep 3, 2025

Uh oh!

abonander commented Sep 4, 2025

Uh oh!

ThomWright commented Sep 4, 2025

Uh oh!

Uh oh!

abonander commented Sep 6, 2025

Uh oh!

AlexTMjugador commented Sep 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AlexTMjugador commented Sep 6, 2025 •

edited

Loading