Consider using st3 for work-stealing #32

james7132 · 2022-12-05T21:08:34Z

st3 looks to be a fairly promising public crate for implementing tokio's !Send local worker queues which eliminate most of the atomic operations when popping from the local queue.

Right now async_executor uses concurrent_queue for these purposes, but realistically only the workstealing operations need to be threadsafe, the local worker popping from the queue does not actually need to be Send.

Some additional scrutiny on it's thread safety model and some API compatibility additions are definitely needed (see asynchronics/st3#2).

The text was updated successfully, but these errors were encountered:

notgull · 2022-12-06T06:51:12Z

In theory, I'd like to avoid adding things like this to this crate. One of the strengths of this crate (as well as most of the other crates in this repository) is that the code is easy to reason around. It makes it easier to understand it at first glance, and makes it easier to trust it. While the current work-stealing implementation can be improved (I actually have a branch where I'm working on this), I think that keeping it simple is probably the best option. There are other crates for those who want more optimized executors.

That being said, this brings up another issue. The local queues shouldn't really have to be thread-safe that often. Could we replace them with a Mutex<VecDeque> instead of a ConcurrentQueue, now that Mutexes are more finely grained?

james7132 · 2022-12-06T08:43:04Z

In theory, I'd like to avoid adding things like this to this crate. One of the strengths of this crate (as well as most of the other crates in this repository) is that the code is easy to reason around. It makes it easier to understand it at first glance, and makes it easier to trust it. While the current work-stealing implementation can be improved (I actually have a branch where I'm working on this), I think that keeping it simple is probably the best option. There are other crates for those who want more optimized executors.

I'd argue that the user facing logic for st3 is as simple (or even simpler) than concurrent_queue. The only thing that might be more warranted is the underlying thread safety model, which is both being tested by loom and miri on both ends.

That being said, this brings up another issue. The local queues shouldn't really have to be thread-safe that often. Could we replace them with a Mutex instead of a ConcurrentQueue, now that Mutexes are more finely grained?

Isn't this still going to be about the same overhead? The mutex will always at least incur one atomic fence as it checks the lock status if not a full syscall if it's actually locked.

notgull · 2022-12-13T14:57:09Z

The only thing that might be more warranted is the underlying thread safety model, which is both being tested by loom and miri on both ends.

I’d rather avoid depending on something like this if it’s still unstable. I’ll leave this issue open for once st3 is in a more stable state.

Isn't this still going to be about the same overhead? The mutex will always at least incur one atomic fence as it checks the lock status if not a full syscall if it's actually locked.

In smol-rs/event-listener#34, I found that a concurrent-queue was slower than a naïve singly atomically linked list when the list is expected to be empty. Given that the concurrency among the local queues is likely to be very low, I think it’s worth at least an experiment.

james7132 · 2024-04-21T23:00:25Z

Tested out this and saw some substantial improvements in the single-threaded cases, but a major regression in some of the multithreaded cases. The single threaded cases make sense given that popping off a local worker no longer requires any atomic fences. However, it's unclear why there's such a large regression in the spawn_one and spawn_batch benchmarks, while the other multithreaded cases are so mixed.

I suspect it has some unusual interactions with the locking we're doing while stealing/spawning tasks. I'll test this out again with the benchmarks in #112 soon.

executor::create        time:   [680.82 ns 681.07 ns 681.36 ns]
                        change: [-38.393% -38.318% -38.236%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe

single_thread/executor::spawn_one
                        time:   [988.49 ns 1.0130 µs 1.0432 µs]
                        change: [-32.591% -27.985% -22.563%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe
single_thread/executor::spawn_batch
                        time:   [35.322 µs 38.898 µs 43.128 µs]
                        change: [-43.524% -36.397% -28.635%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe
single_thread/executor::spawn_many_local
                        time:   [4.8108 ms 4.8590 ms 4.9103 ms]
                        change: [-9.3121% -7.8849% -6.5257%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild
single_thread/executor::spawn_recursively
                        time:   [39.891 ms 40.090 ms 40.296 ms]
                        change: [-21.519% -20.623% -19.731%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
single_thread/executor::yield_now
                        time:   [4.5668 ms 4.6005 ms 4.6468 ms]
                        change: [-13.465% -12.791% -11.894%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

multi_thread/executor::spawn_one
                        time:   [15.152 µs 15.592 µs 15.942 µs]
                        change: [+575.08% +618.49% +667.97%] (p = 0.00 < 0.05)
                        Performance has regressed.
multi_thread/executor::spawn_batch
                        time:   [57.882 µs 62.464 µs 67.913 µs]
                        change: [+59.846% +70.150% +79.827%] (p = 0.00 < 0.05)
                        Performance has regressed.
multi_thread/executor::spawn_many_local
                        time:   [28.761 ms 28.858 ms 28.954 ms]
                        change: [-8.3918% -6.4482% -4.4179%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
Benchmarking multi_thread/executor::spawn_recursively: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 18.2s, or reduce sample count to 20.
multi_thread/executor::spawn_recursively
                        time:   [180.14 ms 180.45 ms 180.79 ms]
                        change: [-3.0653% -2.7293% -2.4013%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
multi_thread/executor::yield_now
                        time:   [23.286 ms 23.361 ms 23.432 ms]
                        change: [+4.6276% +5.2948% +5.9920%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) low mild

notgull added the enhancement New feature or request label Sep 28, 2023

james7132 linked a pull request Apr 22, 2024 that will close this issue

Replace the local worker queues with st3's #115

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider using st3 for work-stealing #32

Consider using st3 for work-stealing #32

james7132 commented Dec 5, 2022 •

edited

Loading

notgull commented Dec 6, 2022

james7132 commented Dec 6, 2022

notgull commented Dec 13, 2022

james7132 commented Apr 21, 2024 •

edited

Loading

Consider using st3 for work-stealing #32

Consider using st3 for work-stealing #32

Comments

james7132 commented Dec 5, 2022 • edited Loading

notgull commented Dec 6, 2022

james7132 commented Dec 6, 2022

notgull commented Dec 13, 2022

james7132 commented Apr 21, 2024 • edited Loading

james7132 commented Dec 5, 2022 •

edited

Loading

james7132 commented Apr 21, 2024 •

edited

Loading