Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use larger read buffers #5

Open
scottlamb opened this issue Jun 30, 2021 · 1 comment
Open

use larger read buffers #5

scottlamb opened this issue Jun 30, 2021 · 1 comment
Labels
performance Performance (latency, CPU/RAM efficiency)

Comments

@scottlamb
Copy link
Owner

scottlamb commented Jun 30, 2021

I was mildly surprised Moonfire NVR's CPU usage didn't go down noticeably when switching from ffmpeg to retina. There's not much CPU used in retina's code itself but I think it's making too many syscalls because it's using buffers that have too little available space. The histogram below counts reads that filled the buffer (and thus will require a follow-up syscall) bucketed by the available space in the buffer when the read started:

[slamb@nuc ~]$ cat recvsize.bt
#!/usr/bin/bpftrace

tracepoint:syscalls:sys_enter_recvfrom /pid == (uint64) $1/ {
    @sizes[tid] = (int64) args->size;
}

tracepoint:syscalls:sys_exit_recvfrom /pid == (uint64) $1/ {
    if (@sizes[tid] > 0 && args->ret == @sizes[tid]) {
        @full_read_sizes = hist(@sizes[tid]);
    }
    delete(@sizes[tid]);
}

interval:s:60 { exit() }
[slamb@nuc ~]$ sudo ./recvsize.bt "$(pidof moonfire-nvr)"
Attaching 3 probes...


@full_read_sizes:
[1]                    1 |                                                    |
[2, 4)                 0 |                                                    |
[4, 8)                 6 |                                                    |
[8, 16)               16 |                                                    |
[16, 32)              47 |                                                    |
[32, 64)             125 |                                                    |
[64, 128)            267 |@                                                   |
[128, 256)           531 |@@@                                                 |
[256, 512)          1119 |@@@@@@                                              |
[512, 1K)           8319 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1K, 2K)            3132 |@@@@@@@@@@@@@@@@@@@                                 |
[2K, 4K)             149 |                                                    |
[4K, 8K)              42 |                                                    |

That's ~230 times per second the buffer filled (across 12 video streams); in most cases it was reading into a buffer with less than 1 KiB available.

I'm letting tokio_util::codec::Framed do the buffer management now, but I think I should do it myself instead. Or at least call reserve(4096) before returning from Codec::decode (regardless of whether it was able to pull a full message).

@scottlamb
Copy link
Owner Author

Actually, that's a lot better than ffmpeg already. /shruggie With ffmpeg for a minute, there are ~8000 times per second the buffer filled.

$ sudo ./recvsize.bt "$(pidof moonfire-nvr)"
Attaching 3 probes...


@full_read_sizes:
[1]               161433 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[2, 4)            159932 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[4, 8)                59 |                                                    |
[8, 16)            12876 |@@@@                                                |
[16, 32)            5692 |@                                                   |
[32, 64)           27575 |@@@@@@@@                                            |
[64, 128)          11172 |@@@                                                 |
[128, 256)          6422 |@@                                                  |
[256, 512)          7465 |@@                                                  |
[512, 1K)          10387 |@@@                                                 |
[1K, 2K)           78285 |@@@@@@@@@@@@@@@@@@@@@@@@@                           |

Looking at actual CPU rates (by diffing /sys/fs/cgroup/cpu/system.slice/moonfire-nvr.service/cpuacct.usage):

config cpu usage (% of one core)
ffmpeg 16%
retina, multi-threaded tokio runtime, 4 threads 21%
retina, multi-threaded tokio runtime, 1 thread 14%
retina, current thread tokio runtime 12%

tl;dr version of investigating where that CPU is going: has more to do Moonfire NVR's current tokio and thread setup than with retina.

flamegraph says tokio wastes a bit of CPU on sched_yield here; it's less with fewer threads, and it goes away with the current-thread runtime. I think this sched_yield loop is silly and maybe will eventually convince tokio folks of that.

There's also a fair bit of thread handoffs because currently I'm using retina in the tokio threads and handing off to a thread per stream every frame to write data. Eventually I'll have one writer thread per sample file directory (2 instead of 12 in this deployment) and only write once per GOP (every 1 or 2 seconds instead of every 1/10th to 1/30th of a second), which will reduce memory.

@scottlamb scottlamb added the performance Performance (latency, CPU/RAM efficiency) label Jul 26, 2021
scottlamb added a commit to scottlamb/moonfire-nvr that referenced this issue Mar 18, 2022
* switch the config interface over to use Retina and make the test
  button honor rtsp_transport = udp.

* adjust the threading model of the Retina streaming code.

  Before, it spawned a background future that read from the runtime and
  wrote to a channel. Other calls read from this channel.

  After, it does work directly from within the block_on calls (no
  channels).

  The immediate motivation was that the config interface didn't have
  another runtime handy. And passing in a current thread runtime
  deadlocked. I later learned this is a difference between
  Runtime::block_on and Handle::block_on. The former will drive IO and
  timers; the latter will not.

  But this is also more efficient to avoid so many thread hand-offs.
  Both the context switches and the extra spinning that
  tokio appears to do as mentioned here:
  scottlamb/retina#5 (comment)

  This may not be the final word on the threading model. Eventually
  I may not have per-stream writing threads at all. But I think it will
  be easier to look at this after getting rid of the separate
  `moonfire-nvr config` subcommand in favor of a web interface.

* in tests, read `.mp4` files via the `mp4` crate rather than ffmpeg.
  The annoying part is that this doesn't parse edit lists; oh well.

* simplify the `Opener` interface. Formerly, it'd take either a RTSP
  URL or a path to a `.mp4` file, and they'd share some code because
  they both sometimes used ffmpeg. Now, they're totally different
  libraries (`retina` vs `mp4`). Pull the latter out to a `testutil`
  module with a different interface that exposes more of the `mp4`
  stuff. Now `Opener` is just for RTSP.

* simplify the h264 module. It had a lot of logic to deal with Annex B.
  Retina doesn't use this encoding.

Fixes #36
Fixes #126
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance (latency, CPU/RAM efficiency)
Projects
None yet
Development

No branches or pull requests

1 participant