-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1/1000 timebase causes deviation when converting with other protocols #3
Comments
Great topic, we will add this to our backlog to investigate |
we at Kaltura co are having issues with aac audio timestamps which was cropped due to lack of timestamp precision with some players exactly because of this. |
Can you provide more details on what you wan to see on the wire and where? Perhaps a description of the end to end of a packet. |
I apologize for misleading use of "composition offset"; I meant chunk type 0,1 and 2 i.e. those containing timestamp information. New optional timestamp extension field is expected to be placed after audio/videotagheader: field named timestampFraction UB[2] which would represent fractional part of the timestamp in a 100 nanosecond resolution = 1-^-7. the field would require capacity of to 10^4 varying values, i.e. up to 2 additional bytes. Why 100 ns? Because they are file timestamps used by os and are considered max feasible time resolution of timestamps. The topic of required resolution and required number of bits is subject to discussion. |
the current timescale of 1000/second is more than adequate for the original intended purpose of synchronizing video, audio, and data messages within human perception for playback in Flash Player, and given how the Flash timing model works. however, i agree that the coarseness of the timescale is annoying when transmultiplexing for other formats (like MP4 or M2TS) or environments (Safari's implementation of Media Source Extensions will have audible pops if audio frame timestamps aren't accurate to within a sample or two). currently, to accurately transmultiplex to a format like MP4 or M2TS, way too much knowledge about the audio codecs in use is required in order to keep sample-accurate time, along with annoying heuristics to allow for discontinuities in the original RTMP timestamps. i strongly recommend against changes to the Chunk Stream (like extending the timestamps in some way). this would need to be negotiated in some way right at the beginning, which most likely would require a new RTMP Chunk Stream version number (the current publicly documented version of the RTMP Chunk Stream is version 3; later version numbers are used for other proprietary things, and having a number greater than one of those would imply support for the proprietary and undocumented extensions to the Chunk Stream). further, this kind of change would just be for the Chunk Stream transport for RTMP; it wouldn't address RTMFP or FLV. rather than changing the Chunk Stream, what if there was a new (*) perhaps instead of "all following audio messages until superseded", the offset should only apply to "the next audio message(s) having the same ordinary RTMP timestamp", which would simplify some other processing, especially for expiring/abandoning one of these along with the audio message it goes with. i think it's likely that if there's a high resolution offset, it'll probably be different frame-to-frame; otherwise a permanent sub-millisecond offset isn't useful since that's just shifting the entire millisecond-accurate timeline for synchronization, and that's below human perceptibility. for some time i'd been thinking the right way to address this problem would be to negotiate a different timescale (number of timestamp ticks per second) in the |
Thank you for the detailed feedback and suggestions. Here are my thoughts:
Regarding the timescale, the current 1000/second is adequate for its original purpose of synchronizing video, audio, and data messages within human perception for playback in Flash Player, considering the Flash timing model. However, I agree that the coarseness of this timescale can be problematic when transmuxing to other formats like MP4 or M2TS or dealing with environments like Safari's Media Source Extensions, where accurate timestamps are critical to avoid audio issues. At the risk of being redundant, the suggested proposal sounds like:
Questions:
This proposal aims to provide a more precise and flexible timestamping mechanism that will facilitate smoother transmuxing and compatibility with high-precision environments. @zenomt thanks for the suggestions on how to solve this. Your insights are invaluable in shaping this approach. Looking forward to all the feedback and any further suggestions! |
i think the issue raised by @igorshevach is more to do with transmuxing and playback in systems (like Safari) where more precise timestamps are required to avoid audio glitches. i wouldn't worry about more precise timing for video or data unless or until it actually becomes a problem. i don't think that's likely until we're talking about >100 frames/second, and even then we most likely would have acceptable fidelity and jitter at rates approaching 250 Hz. also, unlike audio playback, video frames don't have an inherent duration, and are supposed to be presented at the specified time, whereas audio frames do have inherent duration, and the up-to-a-millisecond error of the timestamps is what can lead to audio pops and glitches. @veovera i was proposing a new separate message to precede each audio message and having the same timestamp of that message to encode a high resolution offset. however, having a whole nother message is a lot of extra bytes, which can add up in FLVs and on the wire. instead i think i like your insight of "E-RTMP Audio isn't 'done' yet" better. i'd suggest having a new "CodedFrames with high resolution offset" AudioPacketType with a new signed 16 bit field in a fixed position between the FourCc and the coded data, encoding a + or - from the RTMP timestamp in units of 1/32768000 second (about 30.5 nanoseconds). if nanosecond precision is needed, make it signed 24 bits in units of 1/8388608000 second (about 0.119 nanoseconds). keep the existing "CodedFrames" AudioPacketType as-is, for when the offset is 0 or unknown. using a new AudioPacketType for "CodedFrames with high resolution offset" is the simplest all-around i think, because most processing stages and simple forwarders just need to recognize that packet type as a "coded frame" and treat it as such (like for applying a transmission deadline). it's only a transmux or final playback & rendering that would need to take the high res offset into account. |
note: the only reason to use a signed high-res offset instead of unsigned is to accommodate different rounding policies for the traditional coarse RTMP timestamp (1 ms accuracy); that is, "round to nearest ms" or "round down by truncating fraction of ms" . it would be much simpler to say that the offset is unsigned 16 (or 24) bits of fractional milliseconds after the coarse RTMP timestamp. i'm not sure accommodating different rounding policies is necessary or desirable, particularly since processing the coarse timestamps when fine timing is needed today must allow for an up-to-ms error. having it be signed is more flexible and allows for either rounding policy, but requires a smidge more effort by processors to correctly handle a negative offset. but note that today's "composition time offset" in video for AVC and HEVC is already signed, and processors need to do the right thing there too. i have no strong preference either way, but i think going with signed costs very little and retains more flexibility for use cases we might not be seeing right away. |
I would like to thank everyone involved for the thoughts and propositions. I feel now the direction of the solution is correct. I only want to emphasize that we rather not underestimate significance of timestamp correctness judging from codec implementations in use and quality of the equipment. I think that by extending both audio and video tag headers we ensure that no matter what codecs there will be in use in the future no other additions will be needed in this regard. zenomt Can you please elaborate how rounding decision is made? is it documented elsewhere? |
A timescale of 1/1000 will obviously cause many issues. This is why nginx-rtmp and SRS, when converting RTMP to MPEGTS, do not rely on the RTMP timestamp but instead recalculate timestamps based on the AAC sample count. Otherwise, there would be audible audio noise problems. However, this method has many potential pitfalls and does not solve the issue of insufficient timestamp precision in RTMP. It only accurately recalculates the audio timestamps for MPEGTS, which has a timescale of 90000, making it 90 times more precise than RTMP. RTMP timestamp rollover is a significant potential risk. A 24-bit timestamp will wrap around approximately every few hours, and different software implementations of extended timestamps are inconsistent. This makes it difficult to verify whether the software implementation truly complies with the standard. MPEGTS uses a longer timestamp length, and it is recommended to use more bits to avoid timestamp rollover issues. WebRTC's RTP timestamp has an even shorter bit length, making it more prone to rollover. Lengths shorter or longer than 24 bits are not problematic; longer ones avoid rollover, while shorter ones wrap around more quickly. I personally recommend using a longer bit length since the current network bandwidth and audio/video huge bitrates allow for using longer bits to avoid timestamp rollover issues and support a more precise timescale. |
if you don't rely on the RTMP timestamps at all, then audio and video will go out of sync if there's any missing audio frames, or if the actual audio sample rate is different from the nominal sample rate, even by a little bit. to work properly, using the AAC sample count also requires some heuristics looking at the RTMP timestamps to see if you're "close enough" to the RTMP time to decide you haven't missed one or more frames, or that the sample clock hasn't drifted too far from the wall clock. if there's too big of a discrepancy, you need to signal a discontinuity and resynchronize.
not if you use RTMP timestamps for their intended purpose. :) RTMP's timestamps were intended to synchronize audio, video, and data for playback in Flash Player. when there's an audio track, the timestamp of each audio message establishes/snaps the "current system time" at the instant of that message's first decoded audio sample being played, and then the system time advances with real time as long as audio is still playing up to the next audio message and its timestamp. video and data frames are then rendered according to the system time. this can cause video frame rendering jitter of up to 1ms, which is still more accurate than can be reproduced with your monitor for nearly all practical values of "your monitor". |
@zenomt On the contrary, using RTMP timestamps will lead to audio noise. Initially, SRS used RTMP timestamps, which caused issues. Therefore, it switched to recalculating timestamps using AAC sample counts. In fact, nginx-rtmp also does this. There is a very detailed analysis process on this ossrs/srs#547 (comment). In short, the RTMP timestamp is not accurate, for 44100HZ audio, each audio frame is:
The audio frame will set to 23ms, loss 0.2ms data, this is what cause the audio noisy when convert to HLS. Right now, using RTMP timestamps is easier to calculate for converting RTMP to HLS, as you only need to multiply by 90, but it is not correct. |
my point is that using RTMP timestamps as intended (that is, using the RTMP/Flash timing model) does not lead to audio noise or desynchronization. the RTMP/Flash timing model does not involve scheduling audio samples to play back at a particular time; rather, the timestamps of the audio messages and the continuous playback of samples at their natural sampling rate establishes the clock against which video and data messages are rendered. when playing back audio in a system that schedules audio samples to play at a particular time, then yes, RTMP timestamps have insufficient precision to align to within a single audio sample. |
Thank you for the thoughtful discussion on this topic. There are several approaches to solving this problem, and while there isn’t a single 'right' way, what follows is our formal proposed solution that maintains compatibility with standard timestamp tracking practices. I encourage you to review it and share any feedback you may have. E-RTMP Specification
WriteupWe are enhancing both audio and video RTMP messages by adding the optional capability to apply nanosecond offsets to the standard 32-bit RTMP timestamps, which are in milliseconds. When required, this enhancement allows us to fine-tune the presentation time of each message within the media streams with much greater precision. The nanosecond offset is particularly useful for addressing RTMP’s timescale limitations and improving compatibility with formats like MP4 and M2TS, as well as supporting environments like Safari's Media Source Extensions. By applying this fine-grained offset, we can ensure that audio, video, and data streams remain perfectly synchronized across various media formats and playback environments, without needing to alter the core 32-bit RTMP timestamps. However, it's important to note that the nanosecond offset in Enhanced RTMP (E-RTMP) is optional and should only be used when higher precision is necessary for specific audio and/or video messages. In this specification, when the VideoPacketType or AudioPacketType is identified as We considered various approaches, such as not allowing multiple offsets, replacing the old value, or supporting the combination of offsets. Ultimately, we opted to support combining offsets to enhance the system's flexibility, even though this feature may be rarely required and only in specific scenarios. After processing the nanosecond offset, it is integrated with the existing timestamp handling logic to adjust the presentation time of the media samples as necessary. Looking ahead, we plan to explore adding other types of timestamp offsets related to composition, decoding, and other aspects of media playback, further expanding our capability to fine-tune the presentation of media streams. So, who wants to test this? :) |
@veovera. what is the testing procedure? do you provide sample rtmp stream in specified format or encoder software? |
@veovera : i have a few concerns about the proposal above.
if you're already planning on signaling support with a capsEx flag, i'd recommend the much simpler approach of a new "coded frames with extra precision" type that includes a field to get to nanoseconds. that way the coded frame and its high-precision timestamp are atomically bound, which solves all of the transmission deadline, reorder, and "sequence special" problems, and is much less overhead compared to the alternative. and i would recommend against being able to shift by more than one ms, or of having different possible precisions (since then you need to signal support for new precisions too). |
@zenomt Thanks for your detailed feedback! To ensure I understand your points correctly:
|
@igorshevach VSO does not provide sample E-RTMP/FLV streams, files, or encoder software directly. We rely on the community to create and share such resources. While there is no sample content specifically for enhanced timestamps at this time, we hope those interested in this capability will be able to test it within their own setups and contribute back. If enhanced timestamp capability is what you're looking for, we hope you find the specification straightforward for implementing E-RTMP in your solution. We welcome any feedback and contributions to help refine and enhance the specification based on real-world use. The feedback we've received so far has been very compelling, and we look forward to any further input you or anyone else may have! |
if using the "TimestampOffsets" message, i'm suggesting that there only be one offset in it, because i don't believe there's a reasonable use for > 1ms of offset, and the bit shifting won't work if there are different kinds of offsets you're trying to combine together in the same message. if there's a compelling reason i don't currently understand to encode offsets > 1ms, then i'd say that repeating the but i'm really suggesting not having the "TimestampOffsets" message, and instead having a new type of Coded Frames message that includes 3 more bytes to encode the number of additional (and i think it should be signed, so + or -) nanoseconds (and only nanoseconds) to add to the RTMP timestamp to get the "high precision" timestamp. support for this new coded frames message could be negotiated between client and server with a capsEx flag, maybe called if a server tells a client that it supports the high precision coded frames messages, then it (BCP 14) MUST also be prepared to translate those messages to the normal-precision coded frame types when forwarding those messages to a client that didn't signal that it understands them. PS. if you really really want to have the |
closing the loop on my objections: after an offline conversation with @veovera , i see i missed & misunderstood a crucial point in the current proposal. i thought the proposal was for a separate message that would apply a nanosecond offset to following RTMP messages (and would therefore have a huge additional on-the-wire overhead). however, i'm 💯 on board with the actual proposal of an optional field inside the same RTMP message to apply a high-res offset. i have some minor concerns on how much code it'll take to properly handle this case, both for parsing and potentially rewriting for clients that don't understand this new message type. i'm hoping to have time this weekend to try it out to see if it's onerous or no big deal (my gut feeling is "not that big a deal" but i want to make sure). |
Great to hear and thank you for taking the time clarify the details. After our offline conversation I made some clarification in the specification. The updated information are linked below. E-RTMP Specification
Once the feedback for this feature has been solidified we will merge the feature/timestamp-offset branch into the main branch. |
@veovera i read through the new revision above, and it looks good. i haven't implemented it yet -- i'm still thinking through the cleanest way for that. there are still two things that are nagging at me though, but they are minor things that are more about the encoding than the general idea:
in taking a step back, it occurred to me that this is more like an "option" added to the RTMP message, similar to an RTP extension header or to the "message options" in what if, instead of a "TimestampOffsets" packet type, there was an "Option" packet that had a 4-bit type, 4-bit payload length (in bytes), and then that many bytes (or maybe that many plus one, so you could have 1-16 bytes instead of 0-15) of payload. instead of a "more coming" bit, you could just have more "Option" packets, with the constraint that all the "Option"s had to come first in the message. that would allow other kinds of options in the future, if there was ever a need, and they wouldn't be constrained to just different kinds of timestamp offsets. the most important part, though, is that the only check that needs to be done when sending to a peer is "do they understand the Option packet", and the "filter out" transform is now just "filter them all out" (implemented by "just skip over all the options bytes when forwarding"). peers that understand "Option" packets at all can skip over option types they don't understand because their lengths are explicit. a peer could/should still signal whether it understands particular option types, in case that's important to the other peer. an enhanced audio message with a nanosecond offset option could look like
this approach isn't as clean for video though, since the VideoPacketType would be repeated each time. however, this approach does enable "enhanced RTMP" peers to apply enhanced message options even to legacy RTMP audio and video messages, if that would ever be beneficial. |
also, unless i'm missing something, i think the "Fetch audioPacketType once more after processing audio timestamp offsets" here leaves the parser having read only 4 bits and not being at a byte boundary to continue processing, where it would be at a byte boundary if there hadn't been the audio timestamp offsets packet. |
the video pseudocode looks to have the same problem. |
Great catch! Yes it looks like there is a bug in the pseudocode where we end up not on a byte boundary. This means instead of 20bit offset we actually can have a 24 bit offset (16 bits would not be enough) to make sure we are aligned on a byte boundary. I'll update this in the documentation. Thank you for pointing it out! Also, I'm currently reviewing the additional suggestion... |
Hi folks, I wanted to follow up on this issue to see if there are any outstanding concerns or blockers that could affect the merging of the latest PR. Please let me know if there's anything specific that needs to be addressed, or if there are any other concerns that could impact the integration of the updated spec. If not, we should be able to move forward with the merge. The updated information is linked below. E-RTMP Specification
If any feedback is received that necessitates critical changes, we will address it accordingly. Otherwise, we will proceed with merging the |
@veovera while i haven't yet implemented these changes (and especially stripping logic), i don't see any problems just from inspection. |
sorry for jumping late into this topic. I don't specially like the idea of adding a nanosecond offset to the current timestamp, IMO we are going to still be having rounding errors, but now at nanosecond level instead of millisecond. Instead, we could have a "sample count" timestamp based on the We could even have a small 24 bits value, which will wrap around aprox each 5 minutes at 48khz clockrate. Just my 2 cents. |
while true, nanosecond timestamps have greater resolving power than any practical clock that would be providing those timestamps. i think if 1/90,000 second is good enough for MPEG, 1/1,000,000,000 second is good enough for RTMP. the whole reason for timestamps is to synchronize different tracks (audio, video) together. an audio sample clock and a video frame count can't on their own allow for synchronization; you'd need something like RTP's Sender Reports to align sample counts, plus you'd need to know the sampling rate for each message (RTMP allows codecs and sampling rates to change message-to-message). sample clocks (samples/second) drift just like wall clocks do, too. the main use case mentioned for high-resolution timestamps was to allow transmux/conversion to other container formats (like MP4 or M2TS) for use on platforms (like MSE in Safari that produces audible playback artifacts if the audio timestamps are a little off) without having to use heuristics (like decoding the sample rate in whatever codec and counting samples to advance the clock, while correcting for sample/wall clock drift) to extend the 1kHz timestamps to 90kHz timestamps. consider too that video codecs like AVC and HEVC don't have intrinsic frame rates, and instead just encode frames for playback at a timestamp that's kept in the container format. |
nanosecond have more precision except if the fraction is a recurring number. I agree with you on all the considerations about a/v sync, but my feedback (and the issue if I understood it correctly) was about how to convert from rtmp 1000 clocck based (or 10000000 based) to a different clock rate timestamp without having to do calculus and handle rounding errors. I don't think I will change my sample counting timestamping for audio with this new nanosecond offset. |
also, if we are worried about audio drift, the nanosecond approach will not work as we would need to count received samples anyway to detect the drift. Having both the wall clock timestamp and the sample rate timestamp would allow us to easily detect the drifts. |
Hey everyone, Really engaging discussion on timestamping in RTMP! I appreciate the points raised regarding both the nanosecond offset and the sample count timestamp approach. The devil is in the details and the solutions that you are working within. I do support the explanation given here -> GitHub Issue. After considering the discussion around nanosecond timestamps and comparing them with MPEG’s 90 kHz timestamping approach, I’d like to provide some thoughts on why I believe the nanosecond offset is a practical solution. The current RTMP millisecond timestamp is good enough in most cases and has withstood for decades. That said, there are situations where having a more granular timestamp, note it is optional and not a must, is desirable. For instance, it could be handy when transmuxing to an MPEG stream. MPEG uses a 90 kHz clock for timestamping (PTS/DTS), and this has been highly effective for ensuring synchronization between audio and video. With a granularity of around 11 microseconds per tick, it strikes a balance between precision and practicality, allowing for accurate sync across streams without introducing excessive computational overhead. Why Nanoseconds Work for RTMP:
A lot of the edge cases described in the thread are nuanced and depend on implementation details. While the sample counting approach has its merits in certain situations, nanosecond timestamps may offer broader flexibility and fewer challenges, particularly when considering issues like message parsing and handling sample rate variations. Nanosecond timestamps provide a more straightforward and reliable approach without adding dependencies on codec-specific behaviors, and they align well with the principles behind other successful standards. Additionally, they can greatly simplify transmuxing to/from formats like MP4 or M2TS, ensuring smoother interoperability across various container formats. If there is a need to introduce sample counting in the future, we could always leverage ModEx to introduce that capability without disrupting the existing infrastructure. For these reasons, I believe the nanosecond offset is a strong way forward for ensuring the flexibility and accuracy RTMP needs to stay relevant in modern media delivery systems. |
I just merged
Please provide feedback if you encounter any specific issues with the latest specification. |
Closing this issue now. Please reopen if further adjustments are needed. Thank you so much everyone for a very engaging discussion! |
The timebase of RTMP/FLV is 1/1000, while for MPEGTS or other protocols might be 1/90000. When convert MPEGTS to RTMP/FLV, there will be some small deviation and cause some stuttering.
Is it possible to support other timebase in enhanced RTMP?
See issue SRS#512 and SRS#547
The text was updated successfully, but these errors were encountered: