-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Too much video_decodeRIL:image pool allocated for H.264 #337
Comments
How have you determined the number of buffers allocated in each of your cases? |
Like explained in omxplayer issue 236, all info from vcdbg. An a ton of tests with my video library. *3MB each 1080p buffer. |
Please try running "sudo vcgencmd set_logging level=64" before your test, and "sudo vcdbg log msg" afterwards. It should log a line similar to: The IL layer will then increase N by the value set under OMX_IndexParamBrcmExtraBuffers (default 2 to cope with double buffering the display), and do a max 7. I'm not 100% certain as to the reason behind the max, but I would take a guess that it links to supporting higher framerates at lower resolutions. Please note that omxplayer does change OMX_IndexParamBrcmExtraBuffers as deinterlacing requires extra buffers to be held as references. Were any of your clips interlaced? Edit to correct vcgencmd line (=64, not =128) |
Format/Info : Advanced Video Codec video_decode:1:RIL: setup image requirements (1280x720(1280x720)x5) sudo /opt/vc/bin/vcdbg pools object 0 object 1 object 2 object 3 object 4 object 5 object 6 object 7 N shows as 5, maybe the extra one is the current but only checking. Then if i understand correctly besides writing to video_scheduler and rendering from video_render it allocates TWO extra buffers for double buffering. BUT the mechanism of a scheduler and a render already provides "double buffer" to eliminate tearing. Now effectly allocated are 8 buffers so there is a error somewhere. Now in the issue on omxplayer there was a line comment by popcornmix with a min( 7, DPB+2) and not max 7, even so H.264 limits in 16REF. Later will contiinue running other files but i dont have interlaced ones. The limit in hardware decode block is a macroblock line of 1920 so bigger than that does not run. I already run some heavyweight HIGH level 4.1 16REF 1080p@60fps and a 120fps one. with 96MB or at most 128 GPURAM , so i dont want to limit the capabilities, but the lower the gpu mem the bigger arm mem so everyone benefits. video_queue from omxplayer allocates from ARM_mem, but video_fifo from the available GPU_MEM. fifo is more important to sync issues so any wasted memory can go to fifo. |
If you've got 4 reference frames contributing to the current frame, then you've got to be able to hold all of those and the current one in memory, hence 5 does make sense. Popcornmix quoted it as "max (7, (DPB size + 2))", same as me. Not min. I haven't got my laptop with me today so can't check under what other conditions it adds anything else. As I read it last night, it only adds the value specified by IL parameter OMX_IndexParamBrcmExtraBuffers. |
Sorry, i expressed poorly. The thing is the current buffer is written on the decoding, then copied to scheduler and then copied to render. Effectively a "triple buffering" is in effect. so 4+2 would allocate 6. EDit testing is ongoing but a 2REF file allocated 17 pools; Format/Info : Advanced Video Codec |
There should be no copying going on. References to the images in the pool are passed down the pipe. The renderer will not remove the current buffer from the display until a new one is presented. Generally a scheduler wants the frame for a bit ahead of time or risks being late. The decoder hardware needs about 25ms to decode the frame. Between those you need 3 frames circulating, plus the reference frames needed by the decoder. The codec seems to already add one frame for the "currently being decoded frame", hence +2 for the others. You have a parameter available that allows you to alter the number of buffers allocated. If you have a real issue then feel free to alter the omxplayer code and create a pull request. I can't see there being a great need here to try and alter the firmware when it has a chance of causing glitching. With regard your 17 images allocated, if you've got particular files that cause odd effects then you're going to have to post them (or at least the first few seconds) somewhere. One of us can then analyse what is going on in the codec as that may be a real parsing bug. |
DPB+current+render+sched -> the minimum is 4 and not 7! If there was a reason i an willing to test it and lower the firmware default for everybody. Maybe a reduction to 5 is enough. The 2REF sample with problem from earlier: http://killzone.dl.playstation.net/killzone/kzsf_multiplayer/KZSF_HIGHTBITRATE_H264.zip |
@popcornmix I don't know the full reason why (I suspect) TJG set it as a minimum of 7, but personally I don't see it as a big failing so would suggest leaving as is. It was probably to get around issues with streams that incorrectly defined the number of reference frames required, hence assuming 4 reference frames + the display frames = 7 as the minimum. The 2REF clip posted is a level 5.0 clip, 1080P60. BCM2835 is only specified to support 4.1, so anything above is an unsupported bonus. Reading the code and how it parses the headers, it will only drop the number of frames it declares it requires if both sps->vui_parameters_present_flag and the sps->vui_parameters.bitstream_restriction_flag are present, and then it use the sps->vui_parameters.max_dec_frame_buffering value. |
Well, DXVA compatibility skip checks is by default in level-skip because too many encoders wrong set level 5.0/5.1 and/or set DPB=16. There is so much issues about it, that the compatibility skips were standardized in the community. (GPU drivers were forced to implement loose checks) Then by the table from level 5.0=1080p@72fps. I think HDMI RPi output limits in 60hz. Now the firmware then correctly calculates the level 5 limits and so allocates 14 buffers for the sample? why if in theory its hardware is level 4.1? or a fix can at level 4.2 limits then because OMX_IndexParamImagePoolSize is default to 1080p. (1080p@60hz effective limit) A 2560x1600p sample was decoding ok in vcdbg level=64 dump log and used 8 image pools of 8MB, no image is show care for analysis. Unless video_sched can do a better job on sync control, then maybe a reduction alone is not gonna improve the sync but a more robust video_scheduler or advanced sync control? "About "ReFrames" field in MediaInfo, provide the value of "max_num_ref_frames" in seq_parameter_set_data( ) of ISO/IEC 14496-10 header." MediaInfo is reading out the DPB size, not the number of reference frames. X264 source: At least on 16 DPB case + cur+render+sched so max at 19. TJG minimum is probably based on ref with b, with pyramid normal. Not very common because realtime coders dont use B-frames, even less pyramid. Blu-ray compliant only pyramid-strict for example. |
H264 hardware block limits are independent from the HDMI output block. Spec for the H264 hardware block is nominally max level 4.1 - ie 1080P30.1. Certainly for encode there is a line buffer in the hardware (probably actually a macroblock row in size), and that is 2048 pixels in width. Therefore I'm not surprised anything wider than that fails. popcornmix is trying to track back in the source control system for when the max(7) happened. Nothing as yet, and not really the highest priority. |
After a little less than 200 samples, here is a resume list: 1finalBattle pools 7 log 2 highL4 * 3b24kbps pools 7 log 4 ref=1 but pyramid ant encoded in 2008 with ref=1 so 2 extra for B-pyramid 4IncepT pools 8 log 5 --ref 3 pyramid=2 A bunch of "vc_mem_map: length 8192 is too big" made it a little pain to test it all. |
There are quite a number of issues here; let me try and disentangle them a little: 1. Minimum of seven buffersI think this is very old code which has probably survived longer than it should have (although it's lost its explanatory comments in the process...) The oldest code we have access to (from August 2011) notes it as "add two frames for display, use minimum of seven frames, two in decode, three in pending list, two in display"; I suspect that the original code goes back to 2006-7, and in particular predates the codec having any way to tell the outside world what shape the stream is. What it's trying to achieve is two reference buffers (for older codecs -- bear in mind video_decode doesn't just do H.264), three buffers for latency adaptation (and a working frame), and the two buffers needed for downstream pipeline stages. Now, bear in mind that the code was originally targetted at mobile phones and similar: deploying fixes is hard, and customers (of the chip manufacturer) tend to be very displeased by any kind of regressions. Software codecs also tend to be a little more inclined to need more latency adaptation, depending on how close they are to (or past...) the edge of performance limits -- the old code actually sets a minimum of 14 for VP6, for example, in order to ensure performance (probably on 200MHz VC3 hardware). Consequently, the minimum buffer level has never been removed in a spirit of conservatism. (In a world where H.264 can use more reference frames, it makes less sense in any case...) I think we should be able to remove it, but it will take a certain amount of experimentation to check how much latency adaptation is needed. The scheduler and renderer don't actually provide any, because the buffer can get sent to the display arbitrarily soon after delivery to the scheduler; eliminating all latency buffering is OK for single decode use cases, but could cause problems when the decoder hardware is being time-multiplexed (including cases where the encoder and decoder run in parallel, since some hardware is shared). 2. "Four extra buffers"I have a limited amount of test material available right now, but a quick test showed (as expected) a stream with four reference frames (and max_dpb_frames = 4) requesting five image buffers, and seven being allocated -- video_decode's extra_frames parameter defaults to two to account for the downstream pipeline stages. We do need #ref + 1 for the decoder itself: the downstream stages won't let go of a frame until they receive another one, as far as I recall, so the decoder needs an extra working frame in order not to deadlock the pipeline. (With extra_buffers set to zero, without the extra frame we'd also stall if the DPB were entirely full of reference frames.) Interlaced streams will cause extra_frames to be set to three, because the deinterlace module will (I believe) hang on to three buffers at a time -- I'll check exactly what the behaviour is on this. Using fewer buffers in this case would cause the pipeline to deadlock. In the case of your streams (or at least the KZSF stream that you uploaded), OMXPlayer is under the impression that the stream is interlaced, and therefore sets extra_buffers to 3; however, it's not sufficiently convinced of it actually to create a deinterlace component and its downstream buffer pool... I will look into why it does this! 3. 2REF file allocates 17 image buffersIn this case, we're using the DPB size derived from the level to determine the size of the DPB; in the absence of the bitstream restriction section of the VUI, we have no guarantee that someone won't gratuitously code frames backwards just because they can. While this is silly, it happens both in formal conformance test streams and in some customer streams. As @6by9 comments above, this results in 13 frames for the DPB plus a working frame; for some reason (which I will look into) OMXPlayer sort-of-thinks the stream is interlaced and adds an extra one. 4. Level 5.0 supportThe VC4 accelerator hardware only claims support for 4.1 (with a restriction that the picture width must be 1920 or less), it's true. However, if presented with streams at higher levels, the decoder will simply process them on a best-efforts basis, again, subject to the 1920 width restriction as that's a hardware limitation. (There's also a height restriction, but that's not usually a problem.) For streams that are at too high a frame-rate, you simply won't get real-time performance, but for (e.g.) transcoding applications, that can be absolutely fine. For streams with supported image size and frame rate, but with a level higher than 4.1, it will simply change the buffering requirements (again, in the absence of max_dpb_frames / num_reorder_frames). I'm slightly puzzled by your assertion that 2560x1600 was working, because the hardware has line buffers that just won't do that! 5. "Increasing reorder buffer"It's a little difficult to follow what you're saying here, but I think the gist is "can the decoder just allocate more image buffers when it finds it needs them?" Please correct me if that isn't the case. There are two issues here, one historical and one practical. The original VC3 decoder firmware (dec2) couldn't cope with reallocation of memory at all: it worked with whatever it was given at the start of time. This, obviously, is a problem with streams that have new sequences part way through with increased memory use. The dec3 decoder was built from the ground up to be able to do allocation dynamically. However, it only does so at sequence boundaries. This is partly because some customers wanted systems that used static memory allocation, in order to guarantee that applications didn't interfere with each other -- bear in mind that VC4 hardware isn't always used with Linux or Android -- and partly because the underlying image pool abstraction doesn't support changing pool sizes after creation time. Now, clearly, the image pool mechanics could be changed to allow extra image buffers to be added to the pool after creation, and the decoder could be changed to call that interface when it found more reordering than it had previously accounted for. However (a) the existing mechanics only allow for propagation of (and reaction to) allocation errors at sequence boundaries, so would have to be reworked, (b) image allocation is slightly tricky because of hardware addressing limitations, and (c) it's rather beyond the scope of the work I'm currently budgeted to do on the decoder... There's also an issue with speed of allocation. dec3 is designed to allow most of its internal buffers (the CDB, for example) to start small, and be expanded dynamically on demand. However, in practice, when this was first used with Android, it turned out that allocating the memory was really slow, and caused repeated stutters as the decoder progressively ramped up the amount of buffering being used. Hence in practice, the default buffer sizes are set rather larger than originally intended (although it's possible to override them, in theory). The Raspberry Pi environment is probably a bit more friendly in that respect, but I'd worry about making stream playback stop being smooth, particularly as you'd also like to remove the extra latency tolerance! |
OMXplayer requests 3 extra buffers unless it's been told (by --nodeinterlace) that the stream definitely isn't interlaced. This allows it to insert the deinterlacer into the pipeline if it discovers (via a displayed frame) that the stream requires it; if it only had two extra buffers, it wouldn't be able to do this, since (as noted above) it's impossible dynamically to add buffers to an image pool. |
The advanced deinterlacer currently retains three images on its source side, but actually doesn't need to (since it operates internally on a different image format, and keeps converted copies in the output side image pool). It isn't actually necessary to keep the images on the source side at all. The fast deinterlacer can operate on the source image directly, but only needs to store one image. Changing the advanced deinterlace to release the images as soon as they've been converted would mean that it wouldn't need an extra buffer in the decoder's pool, so we wouldn't need to set extra_buffers to three under any conditions. |
Awesome response! 1: Two display is probably for "double-buffer", but you said only pointers pass between, so besides "on-screen", the second display buffer is not written at any time (like tearing artifacts), so can be eliminated? This latency adaption dont use frame-drop to restore AVsync in my tests. Conservatism is fine if videos dont lag. Freed memory could go to "video_fifo" if it helps. Malloc pool is fixed in13MB with 10MB free during decode with reloc starving, so i felt a issue somewhere. 4: 1920 line buffer limitation was my hunch but vcdbg msg log (level=64), was populated with normal decode msgs. It is not "really" working because no image is displayed. 5: Yes, FFMPEG h264.c lines 890~932 http://ffmpeg.org/doxygen/trunk/h264_8c_source.html#l00890 obs.: DiVX3/5 tests showed log requirements 3 buffers and total 7 pools. VP8-Big buckBunny 480p sample doesnt play for me. |
Hi @deborah-c "The advanced deinterlacer currently retains three images on its source side, but actually doesn't need to (since it operates internally on a different image format, and keeps converted copies in the output side image pool). It isn't actually necessary to keep the images on the source side at all." Catch up with you in a fortnight. |
@6by9: Yes, I'm currently looking at some of the backlog of video decoder issues for them, for a few days. I believe (after a certain amount of digging around) that the source images will be copied in the deinterlace code even when they're in the same format as the source. However, releasing the images early without breaking the metadata associated with them requires slightly more finesse. I'm experimenting at the moment. |
Agreed. It does look like it always does a convert_blt. |
@brunogm0 We do not operate a "double buffer" as such. The video scheduler holds pending frames, and forwards them to the renderer when the media clock indicates that it's time to do so. The renderer will switch images at a vsync, and only then release the old image for reuse. If you don't have a working frame in addition to these two, then you reduce the time available for decoding and can't guarantee real time operation. The discussion here is entirely about image buffering. The latency adaptation here is both for A/V sync, and to provide for differences in decode time between frames, latency in interrupt response or processor availability, and so on. The CPB is separate, and is internal to the decoder. I'm not quite sure what you mean here by "video_fifo". Do you mean the CPB? If so, it's considerably smaller than a single 1080p buffer in any case, so there's not a lot of mileage in changing its size. Decode messages will keep being generated even when the image size exceeds the decoder's capabilities; this is because the codec is still processing the data stream (metadata and slice headers, at least) in the hope that some day there might be a sequence it can handle... The statement "reorder only occurs with B-frame" is not necessarily true for H.264. Neither is it true that all streams without reordering use pic_order_cnt_type 2. There is a huge range of bizarre streams that break most assumptions one could possibly make that aren't actually written into the standard! The decoder deliberately uses only one block from the malloc heap (with the core instance data structure in it); all resizable structures are in the relocatable heap, by design. This is pretty fundamental to its structure, and not something that could be readily changed. It's also not really desirable in any case, since heap fragmentation can be a significant problem. Bear in mind that the decoder has many use cases that are not quite so simple. DivX will ask for two buffers for reference frames, plus a working frame; as before, OMXplayer will add three further frames unless you tell it that there's definitely no deinterlace needed, so the usage is unsurprising. VP8 is not implemented by the hardware codec. |
@6by9 Yes, that's exactly what I thought a few hours back, but I think you'll find it's more complicated than that ;-) |
Ok, bizarre streams are bizarre! omxplayer video_fifo controls buffering for the OMX.broadcom.video_decode ports( 130), vcdbg reloc show as a bunch of 80k RIL buffers. So actually instead of buffering in these ports the "pending" does a smarter job? Ex: 3 pending buffer with timestamps 10 20 30. clock is 8, tick tick show one, then things slow down and clock is 29 but "pending/sched" shows frame two instead of waiting tick and skipping to last frame, effectively dropping to ensure sync ? Some samples i tested with high video_fifo play perfectly for 45 seconds then video_fifo goes negative until 50sec and even if sound is finished it still exhibiting previous frames, so no drop_frame and no rescheduling whatsoever. Given this, would 4 pending frames contribute more than video_fifo for lag? (Trying to discover the conflict between these RIL buffers and pools) uploaded a sample i know is 60fps but this plays ok for a minute with fifo=32; |
An experimental patch now allows the advanced deinterlace to operate without requiring extra buffers in the decoder output pool; this obviates the requirement for OMXPlayer to set extra_buffers to 3 when --nodeinterlace isn't specified. |
Nice, i can test it. obs: Sorry, i was travelling. |
See: #353 kernel: dwc_otg: fixup read-modify-write in critical paths Be more careful about read-modify-write on registers that the FIQ touches kernel: w1-gpio: Sort out the pullup/parasitic power tangle See: #348 kernel: BCM270x_DT: Add i2c0_baudrate and i2c1_baudrate parameters See: http://www.raspberrypi.org/forums/viewtopic.php?f=28&t=97314#p683127 kernel: HiFiBerry Amp: fix device-tree problems See: raspberrypi/linux#790 firmware: Camera: Fix for reset stc timestamping mode firmware: deinterlace: reduce memory use on input side for advanced deinterlace See: #337 firmware: Kill max(7,) in buffer allocation See: #337
See: raspberrypi/firmware#353 kernel: dwc_otg: fixup read-modify-write in critical paths Be more careful about read-modify-write on registers that the FIQ touches kernel: w1-gpio: Sort out the pullup/parasitic power tangle See: raspberrypi/firmware#348 kernel: BCM270x_DT: Add i2c0_baudrate and i2c1_baudrate parameters See: http://www.raspberrypi.org/forums/viewtopic.php?f=28&t=97314#p683127 kernel: HiFiBerry Amp: fix device-tree problems See: raspberrypi/linux#790 firmware: Camera: Fix for reset stc timestamping mode firmware: deinterlace: reduce memory use on input side for advanced deinterlace See: raspberrypi/firmware#337 firmware: Kill max(7,) in buffer allocation See: raspberrypi/firmware#337
Firmware and omxplayer are updated. You can add: to config.txt to revert back to the previous behaviour. |
updating now... results A.S.A.P. 1creepy pools 4 log 2 baseL3 277x288p #$reduced three 2u2 pools 5 log 3 highL4 #$ reduced two 2kzsf pools 16 log 14 ****L5 #$ redux one, 480kfifo plays correct but audio fifo 0.29/6.07 lowest on PCM, same sample but with flac is play better! if using more than 1360kvfi bigger lag. Also video-queue bigger than 4M gives lag 3~3.7 better. 2testgravacao pools 7+ 3fx log 5 highL4 interlaced 3vortex pools 6 log 4 #$reduced one 3Into pools 6 log 4 #$reduced one Most samples play fine with 320k vfifo, including birds80mbit. The few that lags have problems with audio buffering and so increasing vfifo to 1360k helps but cannot solve the audio sync problem. Testing with audio_queue 1 video_queue 3.7 video_fifo 0.32 (audio_fifo 0.02 helps sometimes). For the purpose of this issue the interlaced FX is not working if fx_early_release works then its ok to close. There are confirmed reductions in buffer for a practical minimum of 4. Its playing very smooth with a confirmed increase in video capabilities from birds60Mbit lag free to birds80Mbit lag free. Birds90Mbit plays with a little lag even with 24M vfifo + 40M vqueue. audio_renderRIL: sent 0/779 samples |
@brunogm0 your last message isn't easy to read. |
The advanced deinterlace patch is not working and the 3 extra buffers are still used. This issue can be closed after the FX_early_release works. Sorry, for the confusing data. i tryed to report the new behavior for each major sample. I only have two interlaced samples. If you want i can upload one? |
How are you testing the interlaced files? |
Testing with the correct omxplayer and firmware versions. the fx_early_option is not working. So the pach to omxplayer doesnt touch FX pool? |
The patch to omxplayer will reduce memory usage by one frame (when --nodeinterlace is not used) with any firmware version, with or without fx_no_early_release). However for some streams, it may hang with older firmware, or with fx_no_early_release=1. With latest firmware and with default settings (fx_no_early_release=0), it should not hang. |
Ok closing. Just to point out that 1080p@60hz, works now in a very large number of samples. More smooth with cppcheck performance issue fix. |
See: raspberrypi#353 kernel: dwc_otg: fixup read-modify-write in critical paths Be more careful about read-modify-write on registers that the FIQ touches kernel: w1-gpio: Sort out the pullup/parasitic power tangle See: raspberrypi#348 kernel: BCM270x_DT: Add i2c0_baudrate and i2c1_baudrate parameters See: http://www.raspberrypi.org/forums/viewtopic.php?f=28&t=97314#p683127 kernel: HiFiBerry Amp: fix device-tree problems See: raspberrypi/linux#790 firmware: Camera: Fix for reset stc timestamping mode firmware: deinterlace: reduce memory use on input side for advanced deinterlace See: raspberrypi#337 firmware: Kill max(7,) in buffer allocation See: raspberrypi#337
I initially posted in omxplayer but i guess this allocation comes in the firmware. So as i posted here: popcornmix/omxplayer#236
The explanation for the two extra buffers in schedule and render is ok.
My updated issue and for months without answer is why the 7 minimum. Because for the 1 REF case there should be 3 total buffers, but with this formula there would be 4 * extra buffers, resulting in too much gpu memory pressure. By the way i am a expert in codecs, so the "should be" 7 minimum is nowhere in h264 standard. From spec there is the DPB allocation with max at 16.
DPB= min(floor(MaxDpbMbs / (PicWidthInMbs * FrameHeightInMbs)), 16)
So as i presented even for the max references case there are 2 extra buffers.
The text was updated successfully, but these errors were encountered: