audio support #34

scottlamb · 2018-03-09T08:04:52Z

It'd be nice to support audio. I think there are some non-trivial things to work out, though:

ffmpeg's rtsp support has a nasty bug when an audio stream is present; see poor behavior when camera has audio enabled #36. [edit: see investigation here. This bug may be rarer than I realized so perhaps is not really a blocker for most people.]
time unit. I picked 90 kHz as the fundamental unit of time because that works out well for RTSP/H.264 video. But I think audio codecs do different rates (such as 8 kHz). Do we do the lowest common multiplier of the supported audio and video codecs? if it's much higher, when does that wrap / when does converting it to a float for Javascript lose precision? One option: have a higher frequency in the database schema for timestamps on recording rows. In the video index, keep 90 kHz or even less to save storage space. The final frame is implicitly the rest of the duration of the total recording.
does this affect when we can seek to / how we generate edit lists?
need to define an audio_index like video_index.
We have to interleave the video and audio at some reasonably fine-grained interval (every key frame?). does this significantly blow up the in-memory .mp4 data structures? how do we represent the interleaving?
generating the sample entry box, similar to what we do for H.264 video.
frequency adjustment: as described in design/time.md, Moonfire NVR adjusts the timestamps of video frames to correct for inaccurate clocks in the cameras. Can we do those adjustments on audio without noticeably changing the pitch? or would we have to decode/re-encode audio? how much CPU would this require?

The text was updated successfully, but these errors were encountered:

The 091217b workaround of telling ffmpeg to only request the video stream works perfectly fine for now. I'll revisit when adding audio support (#34). Fixes #36

scottlamb · 2020-06-28T20:19:43Z

I've heard from a few people, most recently @jlpoolen here, that this is an important basic feature. I'm considering trying to support it for 1.0 then. I tweaked the list of things to consider above. A few of them still seem a little daunting though. Frequency adjustment is the newest addition.

scottlamb · 2020-06-30T05:30:31Z

I've been thinking through a design that addresses these issues. It's certainly not simple, but I'm gaining confidence we can do it.

First, I've rejected the idea of messing with audio durations. I don't want to deal with (lossy, messy, maybe slow) decoding and re-encoding of AAC. I'd rather Moonfire NVR just passes along the original audio as it does video. And obviously I don't want to do weird things to the pitch, so I can't just change the sample rate without redoing that encoding.

So what about frequency adjustment? I don't want to give it up. Sometimes over the course of days/weeks, small drift can add up to tens of second which is really noticeable without this adjustment. But I think we can change how it works. Basically, keep the frame indices in terms of uncorrected durations. At the recording row level, keep track of both the corrected start/end and the uncorrected duration. When exporting a single camera's .mp4 to view in a player like VLC, use the corrected start/end to find the right recordings, then generate the .mp4 with uncorrected durations. The slight mismatch in duration between what was requested and what we actually generate isn't really a problem. Reasonable .mp4 files are short enough that there's not really much mismatch, and it wouldn't be a big problem anyway. We already can include timestamps in the video (as a subtitle track, and most cameras support including them in the encoded video also) so a human can still find the time they're looking for. When trying to view multiple cameras in sync through the Moonfire NVR browser UI, we'd need to make small tweaks to the HTMLVideoElement.playbackRate anyway because browsers don't keep multiple video elements perfectly in sync. Having this small additional drift between cameras doesn't make that significantly harder.

I've also rejected the idea of adjusting the video durations so that IDR frames land exactly on an audio frame boundary. My Dahua camera (when using AAC at 48 kHz, which seems like the most reasonable of its possible encodings if you want to understand voices) seems to have an audio frame roughly every 20 ms. Adjusting a single frame's timing by up to 20 ms seems like it'd be pretty noticeable (visibly choppy). Smearing that adjustment over a whole (typically 1- or 2-second) GOP is probably okay timing-wise (~2% rate change). But it'd be weird for live streams. Right now we ~~send full GOPs at once anyway because of how the browser-side APIs work, but I hope that with WebCodecs we'll be able to~~ send each frame immediately. This smear would either prevent us from doing so (as we have to wait until we know the GOP's adjusted duration) or require us to adjust the durations after sending the frames to the browser, which seems like a messy protocol addition.

I think that means that recordings from the same RTSP session need to overlap. Specifically, either:

Have separate audio and video recording tables. Split each type at its convenient time, which won't be exactly the same. Join them back up on playback. OR
Have one recording with both, but the audio starts a little (tens of milliseconds) sooner than the video, and the video ends a little later than the audio (or vice versa).

I haven't decided between the two yet. I think either could work; it's just a matter of which is easier to understand and implement.

Another variation: duplicate the overlapping part of the audio into two recordings, and mark how much needs to be trimmed from the start/end on playback. The extra disk space shouldn't be too noticeable. When composing adjacent recordings into one saved .mp4 or SourceBuffer, adjust things to only include one copy of the duplicate part.

ffmpeg's rtsp support has a nasty bug when an audio stream is present; see #36.

Given that this bug may be triggered more rarely than I thought, I'm going to say it's not a blocker. I'd definitely like to either have it fixed or move to a new, pure-Rust RTSP library. But I think there's no reason we can't try out audio support in the meantime. It'd be optional anyway; at worst folks turn it off and are no worse than they'd be otherwise.

time unit ... One option: have a higher frequency in the database schema for timestamps on recording rows. In the video index, keep 90 kHz or even less to save storage space. The final frame is implicitly the rest of the duration of the total recording.

I think this option will work fine. We can choose a database row-level timebase that's the least common multiple of all the reasonable audio sampling rates (44.1 kHz, 48 kHz) and, if necessary, adjust the epoch so that the Javascript precision limit of 2^53 is far enough in the future. Then we can make the video durations fit to this timebase or a fraction of it. It's certainly not a problem to have a video frame's timing be off by less than a millisecond.

does this affect when we can seek to / how we generate edit lists?

We can still seek to anywhere. We do need edit lists even when starting at the "beginning" of a recording because the video and audio will never begin at exactly the same time. That's apparently just a normal way of doing things when you generate .mp4 files from RTP. ISO/IEC 14496-12 section H.3.2 describes how to do it.

need to define an audio_index like video_index.

TBD but doable.

We have to interleave the video and audio at some reasonably fine-grained interval (every key frame?). does this significantly blow up the in-memory .mp4 data structures? how do we represent the interleaving?

The .mp4 format doesn't say how tight the interleaving has to be. But it's probably best to have the generated .mp4 have it interleaved either GOP-by-GOP or in strict increasing wall clock order to avoid extra HTTP range requests as the browser seeks around. Likewise, it's probably best to do the same on disk so that there's no seeking required to get both simultaneously.

It seems easiest if we can store the samples in exactly the order we receive them in the RTP stream, rather than having fancy buffering in the write path. But ISO/IEC 14496-12 section H.3.2 mentions though "Audio and video streams may not be perfectly interleaved in terms of presentation times in transmission order [in the incoming RTP stream]." If the IDR frame can be out of order with respect to the audio sample before it, we'll need to do some buffering to put the audio sample in the right recording. That's annoying (and the write path already feels ugly) but fundamentally is still possible.

My concern about the blowup was referring to the "slices" part of the mp4::File. (Here is some example debugging output.) Currently the video sample data of each recording is a single chunk (the VideoSampleData lines). If the generated .mp4 file has both tracks in it, that should basically continue to be true—we can keep it in the same order it's already in. If we are just serving one track (MediaStreamExtensions ~~seems to want a separate file for each~~ can support two tracks in one SourceBuffer), we need to a much finer-grained mapping of .mp4 byte ranges to file byte ranges. But I think we don't need to do those for the whole file ahead of time / keep the whole mapping in memory at once. We can fairly easily compute the size of a given SampleData slice. Maybe it can be composed of another Slices object so we compute its mappings when we get to that part of the file, then throw it away afterward.

generating the sample entry box, similar to what we do for H.264 video.

No design mystery here; just work to do.

This splits the schema and playback path. The recording path still adjusts the frame durations and always says the wall and media durations are the same. I expect to change that in a following commit. I wouldn't be surprised if that shakes out some bugs in this portion.

scottlamb · 2020-08-07T15:04:26Z

Found my first bug in the new wall vs media duration distinction: the live viewing stuff is mixing up them up for the durations within a recording, causing a panic.

thread 'tokio-runtime-worker' panicked at 'wall_off_90k=540000 wall_duration_90k=539960 media_duration_90k=540000', db/recording.rs:48:5
stack backtrace:
   0: <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt
   1: core::fmt::write
   2: std::io::Write::write_fmt
   3: std::panicking::default_hook::{{closure}}
   4: std::panicking::default_hook
   5: std::panicking::rust_panic_with_hook
   6: rust_begin_unwind
   7: std::panicking::begin_panic_fmt
   8: moonfire_db::recording::wall_to_media
   9: moonfire_nvr::mp4::Segment::new
  10: moonfire_nvr::mp4::FileBuilder::append
  11: moonfire_nvr::web::Service::stream_live_m4s_chunk::{{closure}}::{{closure}}
  12: moonfire_db::db::LockedDatabase::list_recordings_by_id
  13: moonfire_nvr::web::Service::stream_live_m4s_chunk::{{closure}}
  14: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
  15: moonfire_nvr::web::Service::stream_live_m4s_ws::{{closure}}
...

Currently the API is that the offsets are given to mp4::FileBuilder::append via relative wall times. That doesn't work out well for the live viewer stuff, which wants to precisely locate frames to ensure there's no overlap with the prior one. I guess I'll change it to take media times, then convert back to wall times for the subtitle track.

This broke with the media vs wall duration split, part of #34.

Suxsem · 2024-11-12T12:00:08Z

hi @scottlamb thank you for the nice software, I'm using it with great success.

I really miss the possibility to record audio. My cameras encode audio in AAC so there should be no need for transcoding the audio. Any progress with this feature? I'm a developer, maybe I can help you, thanks!

Suxsem · 2025-02-01T16:48:14Z

hi @scottlamb, with h265 now supported I think the only big feature missing is the audio support, just to let you know there is interest, thank you!

scottlamb added enhancement milestone-wishlist labels Mar 9, 2018

scottlamb added this to the wishlist milestone Mar 9, 2018

scottlamb added schema change Requires a database schema change (see new-schema branch) and removed milestone-wishlist labels Mar 9, 2018

scottlamb added a commit that referenced this issue Feb 14, 2019

Remove mention of #36 from troubleshooting guide

d7a0cb9

The 091217b workaround of telling ffmpeg to only request the video stream works perfectly fine for now. I'll revisit when adding audio support (#34). Fixes #36

scottlamb modified the milestones: wishlist, 1.0? Jun 28, 2020

scottlamb added a commit that referenced this issue Aug 7, 2020

complete wall/media time split (for #34)

036e842

scottlamb added a commit that referenced this issue Aug 7, 2020

fix live view

b9c08b1

This broke with the media vs wall duration split, part of #34.

scottlamb added the rust Rust backend work required label May 3, 2021

scottlamb mentioned this issue Jun 4, 2021

support "monitor mode" / live view of unrecorded streams #120

Open

scottlamb mentioned this issue Aug 29, 2021

schema version 7 #155

Closed

11 tasks

scottlamb mentioned this issue Mar 18, 2022

tweak threading model for streamers #206

Closed

scottlamb mentioned this issue Apr 5, 2022

Unable to parse stream 0: bad clockrate in rtpmap #213

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

audio support #34

audio support #34

scottlamb commented Mar 9, 2018 •

edited

Loading

scottlamb commented Jun 28, 2020

scottlamb commented Jun 30, 2020 •

edited

Loading

scottlamb commented Aug 7, 2020

Suxsem commented Nov 12, 2024

Suxsem commented Feb 1, 2025

audio support #34

audio support #34

Comments

scottlamb commented Mar 9, 2018 • edited Loading

scottlamb commented Jun 28, 2020

scottlamb commented Jun 30, 2020 • edited Loading

scottlamb commented Aug 7, 2020

Suxsem commented Nov 12, 2024

Suxsem commented Feb 1, 2025

scottlamb commented Mar 9, 2018 •

edited

Loading

scottlamb commented Jun 30, 2020 •

edited

Loading