1/1000 timebase causes deviation when converting with other protocols #3

winlinvip · 2023-04-11T07:45:58Z

The timebase of RTMP/FLV is 1/1000, while for MPEGTS or other protocols might be 1/90000. When convert MPEGTS to RTMP/FLV, there will be some small deviation and cause some stuttering.

Is it possible to support other timebase in enhanced RTMP?

See issue SRS#512 and SRS#547

veovera · 2023-04-17T02:51:54Z

Great topic, we will add this to our backlog to investigate

igorshevach · 2024-08-02T19:17:20Z

we at Kaltura co are having issues with aac audio timestamps which was cropped due to lack of timestamp precision with some players exactly because of this.
please add composition offset/ timestamp enhancement! e.g. extend it with a variable bit length field depending on appropriate timescale or fixed at higher timescale.

veovera · 2024-08-02T22:53:52Z

please add composition offset/ timestamp enhancement!

Can you provide more details on what you wan to see on the wire and where? Perhaps a description of the end to end of a packet.

igorshevach · 2024-08-04T07:56:32Z

I apologize for misleading use of "composition offset"; I meant chunk type 0,1 and 2 i.e. those containing timestamp information. New optional timestamp extension field is expected to be placed after audio/videotagheader: field named timestampFraction UB[2] which would represent fractional part of the timestamp in a 100 nanosecond resolution = 1-^-7. the field would require capacity of to 10^4 varying values, i.e. up to 2 additional bytes. Why 100 ns? Because they are file timestamps used by os and are considered max feasible time resolution of timestamps. The topic of required resolution and required number of bits is subject to discussion.
The timescale of the field in case of total 10^-7 is 1/10000 and total timestamp value is T' = timestamp in milliseconds + derived fraction / 10000 / 1000. The only remaining part is signaling of presence of the timestamp fractional part in the bitstream. Without loss of generality we may require multiple independent extensions apply concurrently, and as such is outside the scope of this discussion - i.e. up to you :)

zenomt · 2024-08-04T19:20:41Z

the current timescale of 1000/second is more than adequate for the original intended purpose of synchronizing video, audio, and data messages within human perception for playback in Flash Player, and given how the Flash timing model works. however, i agree that the coarseness of the timescale is annoying when transmultiplexing for other formats (like MP4 or M2TS) or environments (Safari's implementation of Media Source Extensions will have audible pops if audio frame timestamps aren't accurate to within a sample or two). currently, to accurately transmultiplex to a format like MP4 or M2TS, way too much knowledge about the audio codecs in use is required in order to keep sample-accurate time, along with annoying heuristics to allow for discontinuities in the original RTMP timestamps.

i strongly recommend against changes to the Chunk Stream (like extending the timestamps in some way). this would need to be negotiated in some way right at the beginning, which most likely would require a new RTMP Chunk Stream version number (the current publicly documented version of the RTMP Chunk Stream is version 3; later version numbers are used for other proprietary things, and having a number greater than one of those would imply support for the proprietary and undocumented extensions to the Chunk Stream). further, this kind of change would just be for the Chunk Stream transport for RTMP; it wouldn't address RTMFP or FLV.

rather than changing the Chunk Stream, what if there was a new AudioPacketType for Enhanced RTMP audio messages that encodes a (signed 32 bit) number of nanoseconds offset for the RTMP timestamps of all* following audio messages in that stream (or track i guess) until superseded, probably right before the very next audio message. these messages could be stored in FLV, and could be transmitted/forwarded by RTMP Chunk Stream and RTMFP with no additional work.

(*) perhaps instead of "all following audio messages until superseded", the offset should only apply to "the next audio message(s) having the same ordinary RTMP timestamp", which would simplify some other processing, especially for expiring/abandoning one of these along with the audio message it goes with. i think it's likely that if there's a high resolution offset, it'll probably be different frame-to-frame; otherwise a permanent sub-millisecond offset isn't useful since that's just shifting the entire millisecond-accurate timeline for synchronization, and that's below human perceptibility.

for some time i'd been thinking the right way to address this problem would be to negotiate a different timescale (number of timestamp ticks per second) in the connect/_result handshake when first connecting to a server. however, this is significantly more complicated than a new AudioPacketType message, would shorten the time to RTMP timestamp rollover (which people already don't get right), and wouldn't work for FLV.

veovera · 2024-08-04T23:31:37Z

Thank you for the detailed feedback and suggestions. Here are my thoughts:

I agree that it's crucial for the entire system to function cohesively, and supporting FLV with a new timestamp is an essential aspect of this.
We need to clarify our primary objective. Are we aiming to enhance human perception for playback, or are we focusing on facilitating tooling and transmuxing? As mentioned, human perception for playback currently works well. So, if the goal is to support tooling and transmuxing, we need to address that specifically.
Can you provide an end-to-end example to illustrate the issue? For example, a real-world use case where the coarseness of the RTMP timescale has caused significant problems. This could be something like a live streaming scenario where precise synchronization across different platforms is critical. Understanding the specific contexts where this issue arises will help us better address the problem.

Regarding the timescale, the current 1000/second is adequate for its original purpose of synchronizing video, audio, and data messages within human perception for playback in Flash Player, considering the Flash timing model. However, I agree that the coarseness of this timescale can be problematic when transmuxing to other formats like MP4 or M2TS or dealing with environments like Safari's Media Source Extensions, where accurate timestamps are critical to avoid audio issues.

At the risk of being redundant, the suggested proposal sounds like:

Introduce a New AudioPacketType:
- Create a new AudioPacketType for Enhanced RTMP audio messages.
- This type will encode a (signed 32-bit) number of nanoseconds offset for RTMP timestamps of all following messages in that stream until superseded.
- This will allow for more precise timestamping without altering the existing Chunk Stream or RTMP protocol versions.
Maintain Backward Compatibility:
- Ensure that the new PacketType can be ignored by systems that do not support it, maintaining backward compatibility.
- The new messages should be stored in FLV and can be transmitted/forwarded by RTMP Chunk Stream and RTMFP with no additional work.
Simplify Processing:
- Apply the offset only to the next message(s) having the same ordinary RTMP timestamp. This approach will simplify processing and handling of these messages.

Questions:

Is the AudioPacketType that carries the timestamp offset a separate message or is it combined with another AudioPacketType (e.g., CodedFrames) into one message. Last option is more optimum but less backwards compatible, but it might be ok since E-RTMP for audio is new and is still in alpha mode.
How do we handle a late joiner? Perhaps the solution is that the offset is only valid for the current (i.e. one) Audio message?
What about Video and Data messages?

This proposal aims to provide a more precise and flexible timestamping mechanism that will facilitate smoother transmuxing and compatibility with high-precision environments.

@zenomt thanks for the suggestions on how to solve this. Your insights are invaluable in shaping this approach.

Looking forward to all the feedback and any further suggestions!

zenomt · 2024-08-05T03:00:11Z

i think the issue raised by @igorshevach is more to do with transmuxing and playback in systems (like Safari) where more precise timestamps are required to avoid audio glitches. i wouldn't worry about more precise timing for video or data unless or until it actually becomes a problem. i don't think that's likely until we're talking about >100 frames/second, and even then we most likely would have acceptable fidelity and jitter at rates approaching 250 Hz. also, unlike audio playback, video frames don't have an inherent duration, and are supposed to be presented at the specified time, whereas audio frames do have inherent duration, and the up-to-a-millisecond error of the timestamps is what can lead to audio pops and glitches.

@veovera i was proposing a new separate message to precede each audio message and having the same timestamp of that message to encode a high resolution offset. however, having a whole nother message is a lot of extra bytes, which can add up in FLVs and on the wire. instead i think i like your insight of "E-RTMP Audio isn't 'done' yet" better. i'd suggest having a new "CodedFrames with high resolution offset" AudioPacketType with a new signed 16 bit field in a fixed position between the FourCc and the coded data, encoding a + or - from the RTMP timestamp in units of 1/32768000 second (about 30.5 nanoseconds). if nanosecond precision is needed, make it signed 24 bits in units of 1/8388608000 second (about 0.119 nanoseconds). keep the existing "CodedFrames" AudioPacketType as-is, for when the offset is 0 or unknown.

using a new AudioPacketType for "CodedFrames with high resolution offset" is the simplest all-around i think, because most processing stages and simple forwarders just need to recognize that packet type as a "coded frame" and treat it as such (like for applying a transmission deadline). it's only a transmux or final playback & rendering that would need to take the high res offset into account.

zenomt · 2024-08-05T17:34:04Z

note: the only reason to use a signed high-res offset instead of unsigned is to accommodate different rounding policies for the traditional coarse RTMP timestamp (1 ms accuracy); that is, "round to nearest ms" or "round down by truncating fraction of ms" . it would be much simpler to say that the offset is unsigned 16 (or 24) bits of fractional milliseconds after the coarse RTMP timestamp. i'm not sure accommodating different rounding policies is necessary or desirable, particularly since processing the coarse timestamps when fine timing is needed today must allow for an up-to-ms error.

having it be signed is more flexible and allows for either rounding policy, but requires a smidge more effort by processors to correctly handle a negative offset. but note that today's "composition time offset" in video for AVC and HEVC is already signed, and processors need to do the right thing there too.

i have no strong preference either way, but i think going with signed costs very little and retains more flexibility for use cases we might not be seeing right away.

igorshevach · 2024-08-06T18:46:32Z

I would like to thank everyone involved for the thoughts and propositions. I feel now the direction of the solution is correct. I only want to emphasize that we rather not underestimate significance of timestamp correctness judging from codec implementations in use and quality of the equipment. I think that by extending both audio and video tag headers we ensure that no matter what codecs there will be in use in the future no other additions will be needed in this regard. zenomt Can you please elaborate how rounding decision is made? is it documented elsewhere?

winlinvip · 2024-08-12T06:59:47Z

A timescale of 1/1000 will obviously cause many issues. This is why nginx-rtmp and SRS, when converting RTMP to MPEGTS, do not rely on the RTMP timestamp but instead recalculate timestamps based on the AAC sample count. Otherwise, there would be audible audio noise problems. However, this method has many potential pitfalls and does not solve the issue of insufficient timestamp precision in RTMP. It only accurately recalculates the audio timestamps for MPEGTS, which has a timescale of 90000, making it 90 times more precise than RTMP.

RTMP timestamp rollover is a significant potential risk. A 24-bit timestamp will wrap around approximately every few hours, and different software implementations of extended timestamps are inconsistent. This makes it difficult to verify whether the software implementation truly complies with the standard. MPEGTS uses a longer timestamp length, and it is recommended to use more bits to avoid timestamp rollover issues. WebRTC's RTP timestamp has an even shorter bit length, making it more prone to rollover. Lengths shorter or longer than 24 bits are not problematic; longer ones avoid rollover, while shorter ones wrap around more quickly.

I personally recommend using a longer bit length since the current network bandwidth and audio/video huge bitrates allow for using longer bits to avoid timestamp rollover issues and support a more precise timescale.

zenomt · 2024-08-13T00:34:15Z

when converting RTMP to MPEGTS, do not rely on the RTMP timestamp but instead recalculate timestamps based on the AAC sample count

if you don't rely on the RTMP timestamps at all, then audio and video will go out of sync if there's any missing audio frames, or if the actual audio sample rate is different from the nominal sample rate, even by a little bit. to work properly, using the AAC sample count also requires some heuristics looking at the RTMP timestamps to see if you're "close enough" to the RTMP time to decide you haven't missed one or more frames, or that the sample clock hasn't drifted too far from the wall clock. if there's too big of a discrepancy, you need to signal a discontinuity and resynchronize.

Otherwise, there would be audible audio noise problems.

not if you use RTMP timestamps for their intended purpose. :) RTMP's timestamps were intended to synchronize audio, video, and data for playback in Flash Player. when there's an audio track, the timestamp of each audio message establishes/snaps the "current system time" at the instant of that message's first decoded audio sample being played, and then the system time advances with real time as long as audio is still playing up to the next audio message and its timestamp. video and data frames are then rendered according to the system time. this can cause video frame rendering jitter of up to 1ms, which is still more accurate than can be reproduced with your monitor for nearly all practical values of "your monitor".

winlinvip · 2024-08-13T00:50:40Z

@zenomt On the contrary, using RTMP timestamps will lead to audio noise. Initially, SRS used RTMP timestamps, which caused issues. Therefore, it switched to recalculating timestamps using AAC sample counts. In fact, nginx-rtmp also does this. There is a very detailed analysis process on this ossrs/srs#547 (comment).

In short, the RTMP timestamp is not accurate, for 44100HZ audio, each audio frame is:

1024/44100.0=0.02321995s=23.21995ms

The audio frame will set to 23ms, loss 0.2ms data, this is what cause the audio noisy when convert to HLS. Right now, using RTMP timestamps is easier to calculate for converting RTMP to HLS, as you only need to multiply by 90, but it is not correct.

zenomt · 2024-08-13T01:17:48Z

@zenomt On the contrary, using RTMP timestamps will lead to audio noise.

my point is that using RTMP timestamps as intended (that is, using the RTMP/Flash timing model) does not lead to audio noise or desynchronization. the RTMP/Flash timing model does not involve scheduling audio samples to play back at a particular time; rather, the timestamps of the audio messages and the continuous playback of samples at their natural sampling rate establishes the clock against which video and data messages are rendered.

when playing back audio in a system that schedules audio samples to play at a particular time, then yes, RTMP timestamps have insufficient precision to align to within a single audio sample.

veovera · 2024-08-13T15:55:58Z

Thank you for the thoughtful discussion on this topic. There are several approaches to solving this problem, and while there isn’t a single 'right' way, what follows is our formal proposed solution that maintains compatibility with standard timestamp tracking practices. I encourage you to review it and share any feedback you may have.

E-RTMP Specification

Audio timestamp offset signal within the packet type <link>
Video timestamp offset signal within the packet type <link>
Timestamp offset types. For now we only propose a presentation time offset. In the future we might have things like composition time, decoding time offsets <link>
Audio bitstream parsing logic <link>
Video bitstream parsing logic <link>
Enhanced timestamps capability flags <link>

Writeup

We are enhancing both audio and video RTMP messages by adding the optional capability to apply nanosecond offsets to the standard 32-bit RTMP timestamps, which are in milliseconds. When required, this enhancement allows us to fine-tune the presentation time of each message within the media streams with much greater precision. The nanosecond offset is particularly useful for addressing RTMP’s timescale limitations and improving compatibility with formats like MP4 and M2TS, as well as supporting environments like Safari's Media Source Extensions. By applying this fine-grained offset, we can ensure that audio, video, and data streams remain perfectly synchronized across various media formats and playback environments, without needing to alter the core 32-bit RTMP timestamps. However, it's important to note that the nanosecond offset in Enhanced RTMP (E-RTMP) is optional and should only be used when higher precision is necessary for specific audio and/or video messages.

In this specification, when the VideoPacketType or AudioPacketType is identified as TimestampOffsets, the system first checks if additional offsets need to be processed and then retrieves the type of timestamp offset. If the type is TimestampOffsetType.Nano, the system processes this nanosecond-level precision offset by fetching an unsigned 20-bit nanosecond value (just enough to add up to one millisecond). This value is then applied to the media message timestamp, providing the needed precision for synchronization. If the same TimestampOffsetType is encountered multiple times within the same packet, the bits should be combined from left to right to create a larger value, enabling offsets greater than 1 millisecond. This approach is particularly beneficial in specialized solutions where the presentation time needs significant adjustments beyond mere precision, such as when addressing substantial delays or timing corrections.

We considered various approaches, such as not allowing multiple offsets, replacing the old value, or supporting the combination of offsets. Ultimately, we opted to support combining offsets to enhance the system's flexibility, even though this feature may be rarely required and only in specific scenarios. After processing the nanosecond offset, it is integrated with the existing timestamp handling logic to adjust the presentation time of the media samples as necessary.

Looking ahead, we plan to explore adding other types of timestamp offsets related to composition, decoding, and other aspects of media playback, further expanding our capability to fine-tune the presentation of media streams.

So, who wants to test this? :)

igorshevach · 2024-08-18T10:54:27Z

@veovera. what is the testing procedure? do you provide sample rtmp stream in specified format or encoder software?

zenomt · 2024-08-18T18:06:16Z

@veovera : i have a few concerns about the proposal above.

i understand the "backward compatible" aspect of having a separate message preceding coded frames to add extra precision if you want. however:
- this is a lot of extra bytes on the wire vs the amount of information being sent. especially for audio where the coded frames are already pretty small and message overhead is already significant.
- a separate message complicates real-time treatments (like transmission deadlines). you want to make sure that a timestamp offset message has a chance to get through, but you don't want to try forever if you're going to potentially abandon its accompanying audio message. the worst case is that the following audio message makes it through (in a transport like RTMFP, which has partial reliability and retransmission at the individual message level), but the timestamp offset message didn't make it through before its deadline. this isn't insurmountable, but it's a significant increase in complexity to handle properly.
- related to the above point, a separate message might make it impossible to properly process audio messages when in a "super low latency" mode (possible with RTMFP), where you take delivery of messages as they arrive in the network and put them into a short reorder/dejitter buffer. there's a possibility that the timestamp offset message for an audio frame might arrive after the playout time for a frame that's already arrived, or worse (especially if you allow for shifting the time by 40 bits of nanoseconds, which is up to 1024 seconds) the audio message might actually be in the wrong spot/order in the reorder buffer and playout would be all jumbly.
- this message becomes yet another "sequence special" message that needs to be preserved (at least temporarily, and cleared at the appropriate time), and played out for "late joiners" who arrive between the timestamp offset message and its accompanying coded frame(s) (again adding complexity).
assuming a separate timestamp offset message, i don't understand the utility of the "shift to add more bits" (to go to, say, 40 bits of nanoseconds allowing up to 1024 seconds). if there's a major synchronization problem, i feel the RTMP timestamps should be adjusted, rather than trying to adjust the timeline with a timestamp offset message.
in the "shifting left to add more bits" case, what happens if the first 20 bits is nanoseconds, but the second 20 bits is something else (once there is another TimestampOffsetType to set the field to)? if they all have to be the same type, why is the type repeated?

if you're already planning on signaling support with a capsEx flag, i'd recommend the much simpler approach of a new "coded frames with extra precision" type that includes a field to get to nanoseconds. that way the coded frame and its high-precision timestamp are atomically bound, which solves all of the transmission deadline, reorder, and "sequence special" problems, and is much less overhead compared to the alternative. and i would recommend against being able to shift by more than one ms, or of having different possible precisions (since then you need to signal support for new precisions too).

veovera · 2024-08-19T00:31:53Z

@zenomt Thanks for your detailed feedback! To ensure I understand your points correctly:

Are you suggesting that a TimestampOffsetType should not be repeated within a message? This would imply rewording the proposed behavior below.

// If the same TimestampOffsetType is encountered more than once in the same
// packet, we combine the bits left-to-right to create a larger value. This
// ensures that the first offset is placed in the more significant bits, and
// subsequent offsets are appended to the right. This is useful if there is
// a need to offset the presentation by more than 1 millisecond, which might
// be required in unique solutions where the presentation time needs to be
// offset for reasons beyond precision (e.g., significant delays or
// corrections).

Could you provide an example of how you would propose signaling the TimestampNanoOffset capability? Are you suggesting an alternative to the approach proposed below?

enum CapsExMask {
  Reconnect           = 0x01,
  Multitrack          = 0x02,
  TimestampNanoOffset = 0x04,  // Indicates support for nanosecond offset
};

veovera · 2024-08-19T01:10:00Z

@veovera. what is the testing procedure? do you provide sample rtmp stream in specified format or encoder software?

@igorshevach
Thanks for your interest in testing! As part of our open-source initiative, we provide the E-RTMP specification and encourage the community to contribute to its development and testing. There have already been many valuable contributions.

VSO does not provide sample E-RTMP/FLV streams, files, or encoder software directly. We rely on the community to create and share such resources. While there is no sample content specifically for enhanced timestamps at this time, we hope those interested in this capability will be able to test it within their own setups and contribute back.

If enhanced timestamp capability is what you're looking for, we hope you find the specification straightforward for implementing E-RTMP in your solution.

We welcome any feedback and contributions to help refine and enhance the specification based on real-world use. The feedback we've received so far has been very compelling, and we look forward to any further input you or anyone else may have!

zenomt · 2024-08-19T01:10:29Z

if using the "TimestampOffsets" message, i'm suggesting that there only be one offset in it, because i don't believe there's a reasonable use for > 1ms of offset, and the bit shifting won't work if there are different kinds of offsets you're trying to combine together in the same message. if there's a compelling reason i don't currently understand to encode offsets > 1ms, then i'd say that repeating the TimestampOffsetType doesn't make sense and shouldn't be done, especially if the fields are to be combined by shift+add. (also, "shift+add" would mean that the first "nanoseconds" field is not in fact nanoseconds, but actually number of 1048576 nanosecond periods, which makes the semantics of that field even less precise).

but i'm really suggesting not having the "TimestampOffsets" message, and instead having a new type of Coded Frames message that includes 3 more bytes to encode the number of additional (and i think it should be signed, so + or -) nanoseconds (and only nanoseconds) to add to the RTMP timestamp to get the "high precision" timestamp. support for this new coded frames message could be negotiated between client and server with a capsEx flag, maybe called HighPrecisionCodedFrames or something (depending on what the actual message type ends up getting called).

if a server tells a client that it supports the high precision coded frames messages, then it (BCP 14) MUST also be prepared to translate those messages to the normal-precision coded frame types when forwarding those messages to a client that didn't signal that it understands them.

PS. if you really really want to have the TimestampOffsets message and encode offsets > 1ms, then i'd make the field variable-length (minimum of 20 bits, total length derived from the RTMP message length). so it could be 20 bits, or 28, or 36, or 42 bits of nanoseconds (or whatever units, in the future), depending on how long the RTMP message is. or you could use a Variable Length Unsigned Integer (VLU) if you wanted to leave the door open for additional fields in the future. but i also really think this message isn't the right solution, for all the reasons i listed in my previous message.

zenomt · 2024-08-20T19:17:41Z

closing the loop on my objections: after an offline conversation with @veovera , i see i missed & misunderstood a crucial point in the current proposal. i thought the proposal was for a separate message that would apply a nanosecond offset to following RTMP messages (and would therefore have a huge additional on-the-wire overhead). however, i'm 💯 on board with the actual proposal of an optional field inside the same RTMP message to apply a high-res offset.

i have some minor concerns on how much code it'll take to properly handle this case, both for parsing and potentially rewriting for clients that don't understand this new message type. i'm hoping to have time this weekend to try it out to see if it's onerous or no big deal (my gut feeling is "not that big a deal" but i want to make sure).

veovera · 2024-08-22T19:08:50Z

closing the loop

Great to hear and thank you for taking the time clarify the details. After our offline conversation I made some clarification in the specification. The updated information are linked below.

E-RTMP Specification

Audio timestamp offset signal within the packet type <link>
Video timestamp offset signal within the packet type <link>
Timestamp offset types. For now we only propose a presentation time offset. In the future we might have things like composition time, decoding time offsets <link>
Audio bitstream parsing logic <link>
Video bitstream parsing logic <link>
Enhanced timestamps capability flags <link>

Once the feedback for this feature has been solidified we will merge the feature/timestamp-offset branch into the main branch.

zenomt · 2024-08-24T19:25:27Z

@veovera i read through the new revision above, and it looks good. i haven't implemented it yet -- i'm still thinking through the cleanest way for that.

there are still two things that are nagging at me though, but they are minor things that are more about the encoding than the general idea:

this new message type (actually a prefix sub-message in the same RTMP message) is specific to "timestamp offset", so future message modifications that aren't about timestamps will require another message type and more specialized logic.
as written, with the packet length implicit from the type, a parser needs to know the length of every "timestamp offset" type, and if future ones were ever defined, filtering them out or translating them for a peer that doesn't understand them is (a) necessary and (b) painful.

in taking a step back, it occurred to me that this is more like an "option" added to the RTMP message, similar to an RTP extension header or to the "message options" in http://zenomt.com/ns/rtmfp#media.

what if, instead of a "TimestampOffsets" packet type, there was an "Option" packet that had a 4-bit type, 4-bit payload length (in bytes), and then that many bytes (or maybe that many plus one, so you could have 1-16 bytes instead of 0-15) of payload. instead of a "more coming" bit, you could just have more "Option" packets, with the constraint that all the "Option"s had to come first in the message. that would allow other kinds of options in the future, if there was ever a need, and they wouldn't be constrained to just different kinds of timestamp offsets. the most important part, though, is that the only check that needs to be done when sending to a peer is "do they understand the Option packet", and the "filter out" transform is now just "filter them all out" (implemented by "just skip over all the options bytes when forwarding"). peers that understand "Option" packets at all can skip over option types they don't understand because their lengths are explicit. a peer could/should still signal whether it understands particular option types, in case that's important to the other peer.

an enhanced audio message with a nanosecond offset option could look like

soundFormat     = UB[4] = 9: ExHeader
audioPacketType = UB[4] = 6: Option
optionType      = UB[4] = 0: Timestamp Offset Nanos option type
optionLength    = UB[4] = 3: there are 3 bytes of payload
nanosOffset     = UI24     : those 3 bytes of payload

soundFormat     = UB[4] = 9: ExHeader
audioPacketType = UB[4] = 1: CodedFrames
audioFourCc     = FOURCC as AudioFourCc
[audio data]

this approach isn't as clean for video though, since the VideoPacketType would be repeated each time. however, this approach does enable "enhanced RTMP" peers to apply enhanced message options even to legacy RTMP audio and video messages, if that would ever be beneficial.

zenomt · 2024-08-24T20:00:49Z

also, unless i'm missing something, i think the "Fetch audioPacketType once more after processing audio timestamp offsets" here leaves the parser having read only 4 bits and not being at a byte boundary to continue processing, where it would be at a byte boundary if there hadn't been the audio timestamp offsets packet.

zenomt · 2024-08-24T20:04:14Z

the video pseudocode looks to have the same problem.

veovera · 2024-08-26T17:05:06Z

also, unless i'm missing something, i think the "Fetch audioPacketType once more after processing audio timestamp offsets" here leaves the parser having read only 4 bits and not being at a byte boundary to continue processing, where it would be at a byte boundary if there hadn't been the audio timestamp offsets packet.

Great catch! Yes it looks like there is a bug in the pseudocode where we end up not on a byte boundary. This means instead of 20bit offset we actually can have a 24 bit offset (16 bits would not be enough) to make sure we are aligned on a byte boundary. I'll update this in the documentation. Thank you for pointing it out! Also, I'm currently reviewing the additional suggestion...

veovera · 2024-09-25T03:47:31Z

Hi folks,

I wanted to follow up on this issue to see if there are any outstanding concerns or blockers that could affect the merging of the latest PR. Please let me know if there's anything specific that needs to be addressed, or if there are any other concerns that could impact the integration of the updated spec. If not, we should be able to move forward with the merge.

The updated information is linked below.

E-RTMP Specification

Introduction of ModEx signal

// ModEx is a special signal within the AudioPacketType and VideoPacketThype
// enum that serves to both modify and extend the behavior of the current packet.
// When this signal is encountered, it indicates the presence of
// additional modifiers or extensions, requiring further processing to
// adjust or augment the packet's functionality. ModEx can be used to
// introduce new capabilities or modify existing ones, such as
// enabling support for high-precision timestamps or other advanced
// features that enhance the base packet structure.

Audio TimestampOffsetNano signal within the packet type <link>
Video TimestampOffsetNano signal within the packet type <link>
Audio bitstream parsing logic <link>
Video bitstream parsing logic <link>
ModEx and TimestampNanoOffset capability flags <link>

If any feedback is received that necessitates critical changes, we will address it accordingly. Otherwise, we will proceed with merging the feature/timestamp-offset branch into the main branch. If no further feedback is provided, we will proceed with the merge.

@murillo128 @zenomt

zenomt · 2024-09-25T04:07:43Z

@veovera while i haven't yet implemented these changes (and especially stripping logic), i don't see any problems just from inspection.

murillo128 · 2024-09-25T11:45:41Z

sorry for jumping late into this topic.

I don't specially like the idea of adding a nanosecond offset to the current timestamp, IMO we are going to still be having rounding errors, but now at nanosecond level instead of millisecond.

Instead, we could have a "sample count" timestamp based on the audiosamplerate which is incremented by the number of samples in each audio frame. That would make straightforward conversion from/to other protocols.

We could even have a small 24 bits value, which will wrap around aprox each 5 minutes at 48khz clockrate. Just my 2 cents.

zenomt · 2024-09-25T15:42:55Z

IMO we are going to still be having rounding errors, but now at nanosecond level instead of millisecond.

while true, nanosecond timestamps have greater resolving power than any practical clock that would be providing those timestamps.

i think if 1/90,000 second is good enough for MPEG, 1/1,000,000,000 second is good enough for RTMP.

the whole reason for timestamps is to synchronize different tracks (audio, video) together. an audio sample clock and a video frame count can't on their own allow for synchronization; you'd need something like RTP's Sender Reports to align sample counts, plus you'd need to know the sampling rate for each message (RTMP allows codecs and sampling rates to change message-to-message).

sample clocks (samples/second) drift just like wall clocks do, too.

the main use case mentioned for high-resolution timestamps was to allow transmux/conversion to other container formats (like MP4 or M2TS) for use on platforms (like MSE in Safari that produces audible playback artifacts if the audio timestamps are a little off) without having to use heuristics (like decoding the sample rate in whatever codec and counting samples to advance the clock, while correcting for sample/wall clock drift) to extend the 1kHz timestamps to 90kHz timestamps.

consider too that video codecs like AVC and HEVC don't have intrinsic frame rates, and instead just encode frames for playback at a timestamp that's kept in the container format.

murillo128 · 2024-09-25T16:05:31Z

nanosecond have more precision except if the fraction is a recurring number.

I agree with you on all the considerations about a/v sync, but my feedback (and the issue if I understood it correctly) was about how to convert from rtmp 1000 clocck based (or 10000000 based) to a different clock rate timestamp without having to do calculus and handle rounding errors.

I don't think I will change my sample counting timestamping for audio with this new nanosecond offset.

murillo128 · 2024-09-25T16:15:05Z

also, if we are worried about audio drift, the nanosecond approach will not work as we would need to count received samples anyway to detect the drift.

Having both the wall clock timestamp and the sample rate timestamp would allow us to easily detect the drifts.

veovera · 2024-09-26T02:56:55Z

Hey everyone,

Really engaging discussion on timestamping in RTMP! I appreciate the points raised regarding both the nanosecond offset and the sample count timestamp approach. The devil is in the details and the solutions that you are working within.

I do support the explanation given here -> GitHub Issue. After considering the discussion around nanosecond timestamps and comparing them with MPEG’s 90 kHz timestamping approach, I’d like to provide some thoughts on why I believe the nanosecond offset is a practical solution.

The current RTMP millisecond timestamp is good enough in most cases and has withstood for decades. That said, there are situations where having a more granular timestamp, note it is optional and not a must, is desirable. For instance, it could be handy when transmuxing to an MPEG stream. MPEG uses a 90 kHz clock for timestamping (PTS/DTS), and this has been highly effective for ensuring synchronization between audio and video. With a granularity of around 11 microseconds per tick, it strikes a balance between precision and practicality, allowing for accurate sync across streams without introducing excessive computational overhead.

Why Nanoseconds Work for RTMP:

Higher Precision:
- Nanosecond timestamps provide even greater resolving power compared to MPEG’s 90 kHz clock. While MPEG operates at 11.11 microseconds per tick, nanosecond-level timestamps allow precision down to 1 nanosecond. This increased precision can be critical in certain high-precision applications, such as avoiding the small but noticeable desyncs that can occur with audio and video tracks over long durations. From a very simplistic perspective, all we are doing is allowing timestamps to be nanosecond granularity instead of millisecond. That's it.
Avoiding Message Parsing and Sample Rate Dependencies:
- A sample count-based approach would require us to "crack open" the message to derive timing information, which adds overhead and complexity. Even though unlikely, audio sampling rates can change message-to-message in RTMP, depending solely on sample counts makes synchronization difficult and may introduce errors when the sample rate shifts.
- In contrast, nanosecond timestamps are codec-agnostic and independent of the audio or video sampling rates, making them more versatile and easier to use across a range of formats, especially when transmuxing to container formats like MP4 or M2TS.
Alignment with Existing Standards:
- The MPEG standard’s 90 kHz clock has been widely adopted and has worked well in practice for syncing media streams, proving that a relatively high-precision clock is effective. A nanosecond clock goes even further, offering precision an order of magnitude higher than needed for any current practical application. However, this future-proofs the protocol, ensuring accurate sync even as media formats and precision requirements evolve.
Compatibility with Other Formats:
- The primary motivation for introducing higher-resolution timestamps in RTMP is to allow seamless conversion to other container formats without resorting to heuristics or workarounds. For example, in environments like Safari's Media Source Extensions (MSE), slight deviations in timestamps can cause audible artifacts. By adopting nanosecond resolution, we can reduce these artifacts and improve overall playback quality without relying on decoding sample counts.

A lot of the edge cases described in the thread are nuanced and depend on implementation details. While the sample counting approach has its merits in certain situations, nanosecond timestamps may offer broader flexibility and fewer challenges, particularly when considering issues like message parsing and handling sample rate variations. Nanosecond timestamps provide a more straightforward and reliable approach without adding dependencies on codec-specific behaviors, and they align well with the principles behind other successful standards. Additionally, they can greatly simplify transmuxing to/from formats like MP4 or M2TS, ensuring smoother interoperability across various container formats.

If there is a need to introduce sample counting in the future, we could always leverage ModEx to introduce that capability without disrupting the existing infrastructure.

For these reasons, I believe the nanosecond offset is a strong way forward for ensuring the flexibility and accuracy RTMP needs to stay relevant in modern media delivery systems.

veovera · 2024-09-27T20:01:19Z

I just merged feature/timestamp-offset into main, which:

Provided the option to use a more granular timestamp than a millisecond. E-RTMP now supports timestamps up to nanosecond precision.
Provided a way for SCRIPTDATA to handle Multitrack Parameter Handling (see issue #37 for more context).

Please provide feedback if you encounter any specific issues with the latest specification.

veovera · 2024-09-27T20:04:31Z

Closing this issue now. Please reopen if further adjustments are needed. Thank you so much everyone for a very engaging discussion!

veovera mentioned this issue Sep 25, 2024

No way of retrieving the metadata for multitrack streaming #37

Closed

veovera closed this as completed Sep 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1/1000 timebase causes deviation when converting with other protocols #3

1/1000 timebase causes deviation when converting with other protocols #3

winlinvip commented Apr 11, 2023

veovera commented Apr 17, 2023

igorshevach commented Aug 2, 2024 •

edited

Loading

veovera commented Aug 2, 2024

igorshevach commented Aug 4, 2024 •

edited

Loading

zenomt commented Aug 4, 2024

veovera commented Aug 4, 2024

zenomt commented Aug 5, 2024

zenomt commented Aug 5, 2024

igorshevach commented Aug 6, 2024

winlinvip commented Aug 12, 2024

zenomt commented Aug 13, 2024

winlinvip commented Aug 13, 2024 •

edited

Loading

zenomt commented Aug 13, 2024

veovera commented Aug 13, 2024 •

edited

Loading

igorshevach commented Aug 18, 2024 •

edited

Loading

zenomt commented Aug 18, 2024

veovera commented Aug 19, 2024

veovera commented Aug 19, 2024

zenomt commented Aug 19, 2024

zenomt commented Aug 20, 2024

veovera commented Aug 22, 2024

zenomt commented Aug 24, 2024 •

edited

Loading

zenomt commented Aug 24, 2024 •

edited

Loading

zenomt commented Aug 24, 2024

veovera commented Aug 26, 2024

veovera commented Sep 25, 2024 •

edited

Loading

zenomt commented Sep 25, 2024

murillo128 commented Sep 25, 2024

zenomt commented Sep 25, 2024

murillo128 commented Sep 25, 2024

murillo128 commented Sep 25, 2024

veovera commented Sep 26, 2024

veovera commented Sep 27, 2024

veovera commented Sep 27, 2024

1/1000 timebase causes deviation when converting with other protocols #3

1/1000 timebase causes deviation when converting with other protocols #3

Comments

winlinvip commented Apr 11, 2023

veovera commented Apr 17, 2023

igorshevach commented Aug 2, 2024 • edited Loading

veovera commented Aug 2, 2024

igorshevach commented Aug 4, 2024 • edited Loading

zenomt commented Aug 4, 2024

veovera commented Aug 4, 2024

zenomt commented Aug 5, 2024

zenomt commented Aug 5, 2024

igorshevach commented Aug 6, 2024

winlinvip commented Aug 12, 2024

zenomt commented Aug 13, 2024

winlinvip commented Aug 13, 2024 • edited Loading

zenomt commented Aug 13, 2024

veovera commented Aug 13, 2024 • edited Loading

E-RTMP Specification

Writeup

igorshevach commented Aug 18, 2024 • edited Loading

zenomt commented Aug 18, 2024

veovera commented Aug 19, 2024

veovera commented Aug 19, 2024

zenomt commented Aug 19, 2024

zenomt commented Aug 20, 2024

veovera commented Aug 22, 2024

E-RTMP Specification

zenomt commented Aug 24, 2024 • edited Loading

zenomt commented Aug 24, 2024 • edited Loading

zenomt commented Aug 24, 2024

veovera commented Aug 26, 2024

veovera commented Sep 25, 2024 • edited Loading

E-RTMP Specification

zenomt commented Sep 25, 2024

murillo128 commented Sep 25, 2024

zenomt commented Sep 25, 2024

murillo128 commented Sep 25, 2024

murillo128 commented Sep 25, 2024

veovera commented Sep 26, 2024

Why Nanoseconds Work for RTMP:

veovera commented Sep 27, 2024

veovera commented Sep 27, 2024

igorshevach commented Aug 2, 2024 •

edited

Loading

igorshevach commented Aug 4, 2024 •

edited

Loading

winlinvip commented Aug 13, 2024 •

edited

Loading

veovera commented Aug 13, 2024 •

edited

Loading

igorshevach commented Aug 18, 2024 •

edited

Loading

zenomt commented Aug 24, 2024 •

edited

Loading

zenomt commented Aug 24, 2024 •

edited

Loading

veovera commented Sep 25, 2024 •

edited

Loading