Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tweak threading model for streamers #206

Closed
scottlamb opened this issue Mar 18, 2022 · 3 comments
Closed

tweak threading model for streamers #206

scottlamb opened this issue Mar 18, 2022 · 3 comments
Labels
performance Performance—throughput and latency rust Rust backend work required

Comments

@scottlamb
Copy link
Owner

In 307a388 I tweaked the threading model for the streamers: before when using Retina, they ran the RTSP code on the main tokio runtime and sent frames over a channel to the per-stream writer threads to write to disk. After, they do everything on the per-stream writer threads.

I expected this to modestly improve CPU usage (at least as much as previous experiments to use reduce the multi-thread runtime to 1 thread or use the "current thread runtime") because it has fewer thread hand-offs for each frame and because it doesn't do the multi-thread runtime's spinning that I previously mentioned here.

Instead, it's a bit slower. On one setup instead of CPU falling from ~13–14% (currently using 1 multi thread) to <=12% it rose to ~16%. Not huge but still, worse when I was expecting better. I guess having the more common work of each recvfrom syscall being spread out over more threads is worse than having extra handoffs per frame.

Ideally I'd be doing writes once per key frame (mentioned previously at #34 and #116). Then it probably really makes more sense to use the channel, and additionally the extra buffering there can keep the RTSP stream from stalling even if disk writes have a moment of poor latency.

I probably won't look at this until after getting rid of moonfire-nvr config, which complicates things due to not being inherently async.

@scottlamb scottlamb added rust Rust backend work required performance Performance—throughput and latency labels Mar 18, 2022
@scottlamb scottlamb added the bug label Apr 2, 2022
@scottlamb
Copy link
Owner Author

scottlamb commented Apr 2, 2022

Oh, there's an actual bug in the new model. On shutdown, we can get panics like the following:

panic at '/home/slamb/git/retina/src/client/mod.rs:219:22': teardown Sender shouldn't be dropped: RecvError(())
I20220401 16:49:34.810 main moonfire_nvr::cmds::run] Waiting for TEARDOWN requests to complete.
E20220401 16:49:37.682 main moonfire_nvr] panic at '/home/slamb/git/retina/src/client/mod.rs:219:22': teardown Sender shouldn't be dropped: RecvError(())

Backtrace:
   0: failure::backtrace::internal::InternalBacktrace::new
             at /home/slamb/.cargo/registry/src/github.com-1ecc6299db9ec823/failure-0.1.8/src/backtrace/internal.rs:46:44
   1: failure::backtrace::Backtrace::new
             at /home/slamb/.cargo/registry/src/github.com-1ecc6299db9ec823/failure-0.1.8/src/backtrace/mod.rs:121:35
   2: moonfire_nvr::panic_hook
             at src/main.rs:93:13
   3: core::ops::function::Fn::call
             at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/ops/function.rs:70:5
   4: core::ops::function::impls::<impl core::ops::function::Fn<A> for &F>::call
             at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/ops/function.rs:237:13
   5: std::panicking::rust_panic_with_hook
             at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/panicking.rs:610:17
   6: std::panicking::begin_panic_handler::{{closure}}
             at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/panicking.rs:502:13
   7: std::sys_common::backtrace::__rust_end_short_backtrace
             at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/sys_common/backtrace.rs:139:18
   8: rust_begin_unwind
             at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/panicking.rs:498:5
   9: core::panicking::panic_fmt
             at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/panicking.rs:116:14
  10: core::result::unwrap_failed
             at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/result.rs:1690:5
  11: core::result::Result<T,E>::expect
             at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/result.rs:975:23
  12: retina::client::SessionGroup::await_teardown::{{closure}}
             at /home/slamb/git/retina/src/client/mod.rs:217:17
  13: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/future/mod.rs:84:19
  14: moonfire_nvr::cmds::run::inner::{{closure}}
             at src/cmds/run/mod.rs:425:43
  15: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/future/mod.rs:84:19
  16: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/future/future.rs:123:9
  17: <&mut F as core::future::future::Future>::poll
             at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/future/future.rs:111:9
  18: moonfire_nvr::cmds::run::async_run::{{closure}}::{{closure}}
             at /home/slamb/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.17.0/src/macros/select.rs:505:49
  19: <tokio::future::poll_fn::PollFn<F> as core::future::future::Future>::poll
             at /home/slamb/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.17.0/src/future/poll_fn.rs:38:9
  20: moonfire_nvr::cmds::run::async_run::{{closure}}
             at src/cmds/run/mod.rs:182:5
  21: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/future/mod.rs:84:19
  22: tokio::park::thread::CachedParkThread::block_on::{{closure}}
             at /home/slamb/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.17.0/src/park/thread.rs:263:54
  23: tokio::coop::with_budget::{{closure}}
             at /home/slamb/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.17.0/src/coop.rs:102:9
  24: std::thread::local::LocalKey<T>::try_with
             at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/thread/local.rs:412:16
  25: std::thread::local::LocalKey<T>::with
             at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/thread/local.rs:388:9
  26: tokio::coop::with_budget
             at /home/slamb/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.17.0/src/coop.rs:95:5
     tokio::coop::budget
             at /home/slamb/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.17.0/src/coop.rs:72:5
      tokio::park::thread::CachedParkThread::block_on
             at /home/slamb/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.17.0/src/park/thread.rs:263:31
  27: tokio::runtime::enter::Enter::block_on
             at /home/slamb/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.17.0/src/runtime/enter.rs:151:13
  28: tokio::runtime::thread_pool::ThreadPool::block_on
             at /home/slamb/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.17.0/src/runtime/thread_pool/mod.rs:73:9
  29: tokio::runtime::Runtime::block_on
             at /home/slamb/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.17.0/src/runtime/mod.rs:477:43
  30: moonfire_nvr::cmds::run::run
             at src/cmds/run/mod.rs:147:13
  31: moonfire_nvr::Args::run
             at src/main.rs:68:29
  32: moonfire_nvr::main
             at src/main.rs:143:9

because this code is assuming the tokio runtime that was used to create the session is still around, when it's actually been shut down without waiting for its tasks to complete.

@scottlamb scottlamb changed the title tweak threading model for streamers tweak threading model for streamers, fix panic at '/home/slamb/git/retina/src/client/mod.rs:219:22': teardown Sender shouldn't be dropped: RecvError(()) Apr 2, 2022
@scottlamb scottlamb changed the title tweak threading model for streamers, fix panic at '/home/slamb/git/retina/src/client/mod.rs:219:22': teardown Sender shouldn't be dropped: RecvError(()) tweak threading model for streamers, fix teardown Sender shouldn't be dropped: RecvError(()) Apr 2, 2022
@scottlamb
Copy link
Owner Author

scottlamb commented Apr 13, 2022

Hmm, the panic on shutdown should be fixed with 7b0a489, but the performance moved further in the wrong direction. With a 1-thread "multi-threaded" tokio reactor on my main setup:

commit cpu flamegraph notes
ffmpeg (unsure of exact commit) ~16% flamegraph-moonfire-with-ffmpeg old test results, flamegraph. see here.
retina (unsure of exact commit) ~14% flamegraph-moonfire-with-retina same commit as above, different commandline parameters.
v0.7.2-1-g307a388 (I think) ~16% (don't have handy) old test results, flamegraph. regression with move to separate per-stream single-thread reactor
v0.7.3-1-g5e7d558 ~15% flamegraph-v0 7 3-1-g5e7d558 as of today. likely improved due to new Rust version, new tokio versions including this one that promises significant performance improvements, etc.
v0.7.3-2-g7b0a489 ~19% flamegraph-v0 7 3-2-g7b0a489 as of today. another regression
v0.7.3-3-g967834c ~16% flamegraph-v0 7 3-3-g967834c clawed back some performance via a tokio::task::spawn per frame

Looks like having the block_on calls on one thread outside the reactor is worse than using the channel for one hand-off per frame. It might be incurring a bunch of handoffs for futures it launches or something.

so I'm going to keep this issue open for now. I could go back to the more performant model in which the stream threads pull from a channel, but that has the downside I mentioned in the 7b0a489 commit description that the session could still be open when we start another. And I don't think it's the absolute best model either. I think that's having the streamers started from moonfire_nvr::cmds::run::run being async and pushing data once per key frame to a single thread for each sample file directory. That requires more work to achieve, but as I've mentioned elsewhere, batching frames will be helpful anyway for audio support.

update: 967834c now does a tokio::task::spawn per thread, and we're basically back where we were with ffmpeg. Good enough for now.

@scottlamb scottlamb reopened this Apr 13, 2022
@scottlamb scottlamb changed the title tweak threading model for streamers, fix teardown Sender shouldn't be dropped: RecvError(()) tweak threading model for streamers Apr 13, 2022
@scottlamb scottlamb removed the bug label Apr 13, 2022
@scottlamb
Copy link
Owner Author

update: 967834c now does a tokio::task::spawn per thread, and we're basically back where we were with ffmpeg. Good enough for now.

scottlamb added a commit that referenced this issue Apr 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance—throughput and latency rust Rust backend work required
Projects
None yet
Development

No branches or pull requests

1 participant