Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too much video_decodeRIL:image pool allocated for H.264 #337

Closed
brunogm0 opened this issue Dec 21, 2014 · 32 comments
Closed

Too much video_decodeRIL:image pool allocated for H.264 #337

brunogm0 opened this issue Dec 21, 2014 · 32 comments

Comments

@brunogm0
Copy link

I initially posted in omxplayer but i guess this allocation comes in the firmware. So as i posted here: popcornmix/omxplayer#236

The explanation for the two extra buffers in schedule and render is ok.
My updated issue and for months without answer is why the 7 minimum. Because for the 1 REF case there should be 3 total buffers, but with this formula there would be 4 * extra buffers, resulting in too much gpu memory pressure. By the way i am a expert in codecs, so the "should be" 7 minimum is nowhere in h264 standard. From spec there is the DPB allocation with max at 16.
DPB= min(floor(MaxDpbMbs / (PicWidthInMbs * FrameHeightInMbs)), 16)

So as i presented even for the max references case there are 2 extra buffers.

@6by9
Copy link

6by9 commented Jan 5, 2015

How have you determined the number of buffers allocated in each of your cases?

@brunogm0
Copy link
Author

brunogm0 commented Jan 5, 2015

Like explained in omxplayer issue 236, all info from vcdbg. An a ton of tests with my video library.
So for 16REF case instead of 18, it shows 20.

*3MB each 1080p buffer.

@6by9
Copy link

6by9 commented Jan 5, 2015

Please try running "sudo vcgencmd set_logging level=64" before your test, and "sudo vcdbg log msg" afterwards. It should log a line similar to:
"video_decode:RIL: setup image requirements (BWxBH (DWxDH) x N)"
where BWxBH is the buffer width by buffer height, DWxDH is display width/height, an N is the number of buffers the decoder has determined are needed. (There may be a line which reads "new" instead of "setup" - that is when the settings have been updated based on decoding the header bytes).
Does N represent the value you expect from the stream?

The IL layer will then increase N by the value set under OMX_IndexParamBrcmExtraBuffers (default 2 to cope with double buffering the display), and do a max 7. I'm not 100% certain as to the reason behind the max, but I would take a guess that it links to supporting higher framerates at lower resolutions.

Please note that omxplayer does change OMX_IndexParamBrcmExtraBuffers as deinterlacing requires extra buffers to be held as references. Were any of your clips interlaced?

Edit to correct vcgencmd line (=64, not =128)

@brunogm0
Copy link
Author

brunogm0 commented Jan 6, 2015

Format/Info : Advanced Video Codec
Format profile : Main@L4.0
Format settings, CABAC : Yes
Format settings, ReFrames : 4 frames
Duration : 3mn 59s
Bit rate : 4 594 Kbps
Width : 1 280 pixels
Height : 720 pixels
Display aspect ratio : 16:9
Frame rate mode : Variable
Frame rate : 59.940 fps
Minimum frame rate : 59.920 fps
Maximum frame rate : 59.960 fps
Color space : YUV
Chroma subsampling : 4:2:0
Bit depth : 8 bits
Scan type : Progressive
Stream size : 131 MiB (96%)

video_decode:1:RIL: setup image requirements (1280x720(1280x720)x5)

sudo /opt/vc/bin/vcdbg pools
video_decodeRIL:image pool:
video_decodeRIL:image pool:
pool_flags : 0x00000042: DIRECT
magic : pOOl
object_size : 1.4M (1433728)
nobjects : 8 ######allocated buffers########
allocated : true
overhead_size : 64
destroying : false
refcount : 1
alloc_fails : 541

object 0
magic : p00J
refcount : 1
mem : 136
offset : 0
last_thread.acquire : ILVDecode
last_thread.release : ILVDecode

object 1
magic : p00J
refcount : 2
mem : 137
offset : 0
last_thread.acquire : ILVDecode
last_thread.release : ILVDecode

object 2
magic : p00J
refcount : 1
mem : 138
offset : 0
last_thread.acquire : ILVDecode
last_thread.release : H264#0 Outer

object 3
magic : p00J
refcount : 1
mem : 139
offset : 0
last_thread.acquire : ILVScheduler
last_thread.release : ILVScheduler

object 4
magic : p00J
refcount : 1
mem : 140
offset : 0
last_thread.acquire : ILVScheduler
last_thread.release : ILVScheduler

object 5
magic : p00J
refcount : 2
mem : 141
offset : 0
last_thread.acquire : ILVDecode
last_thread.release : ILVDecode

object 6
magic : p00J
refcount : 1
mem : 142
offset : 0
last_thread.acquire : ILVDecode
last_thread.release : H264#0 Outer

object 7
magic : p00J
refcount : 2
mem : 143
offset : 0
last_thread.acquire : ILVDecode
last_thread.release : ILVDecode

N shows as 5, maybe the extra one is the current but only checking. Then if i understand correctly besides writing to video_scheduler and rendering from video_render it allocates TWO extra buffers for double buffering. BUT the mechanism of a scheduler and a render already provides "double buffer" to eliminate tearing. Now effectly allocated are 8 buffers so there is a error somewhere. Now in the issue on omxplayer there was a line comment by popcornmix with a min( 7, DPB+2) and not max 7, even so H.264 limits in 16REF. Later will contiinue running other files but i dont have interlaced ones. The limit in hardware decode block is a macroblock line of 1920 so bigger than that does not run.

I already run some heavyweight HIGH level 4.1 16REF 1080p@60fps and a 120fps one. with 96MB or at most 128 GPURAM , so i dont want to limit the capabilities, but the lower the gpu mem the bigger arm mem so everyone benefits. video_queue from omxplayer allocates from ARM_mem, but video_fifo from the available GPU_MEM. fifo is more important to sync issues so any wasted memory can go to fifo.

@6by9
Copy link

6by9 commented Jan 6, 2015

If you've got 4 reference frames contributing to the current frame, then you've got to be able to hold all of those and the current one in memory, hence 5 does make sense.

Popcornmix quoted it as "max (7, (DPB size + 2))", same as me. Not min.

I haven't got my laptop with me today so can't check under what other conditions it adds anything else. As I read it last night, it only adds the value specified by IL parameter OMX_IndexParamBrcmExtraBuffers.
The default value had been set to keep the pipe running smoothly under 99% of cases. You can alter that value to your heart's content (even setting it negative) to get any memory requirement that you want, just don't come complaining when things don't decode! "With great power comes great responsibility."

@brunogm0
Copy link
Author

brunogm0 commented Jan 6, 2015

Sorry, i expressed poorly. The thing is the current buffer is written on the decoding, then copied to scheduler and then copied to render. Effectively a "triple buffering" is in effect. so 4+2 would allocate 6.
Ok, but the pipe is not clearly defined and for sure in low REF, 7 can be reduced if it will be 3 based on min DPB+2 or more lets leave to testing.

EDit testing is ongoing but a 2REF file allocated 17 pools;
video_decode:54:RIL: setup image requirements (1920x1088(1920x1088)x14)

Format/Info : Advanced Video Codec
Format profile : Main@L5.0
Format settings, CABAC : No
Format settings, ReFrames : 2 frames
Format settings, GOP : M=2, N=24
Codec ID : avc1
Codec ID/Info : Advanced Video Coding
Duration : 1mn 14s
Bit rate : 59.9 Mbps
Width : 1 920 pixels
Height : 1 080 pixels
Display aspect ratio : 16:9
Frame rate mode : Constant
Frame rate : 59.940 fps
Color space : YUV
Chroma subsampling : 4:2:0
Bit depth : 8 bits
Scan type : Progressive
Bits/(Pixel*Frame) : 0.482
Stream size : 531 MiB (97%)
Language : English
Encoded date : UTC 2013-11-01 22:10:04

@6by9
Copy link

6by9 commented Jan 6, 2015

There should be no copying going on. References to the images in the pool are passed down the pipe.

The renderer will not remove the current buffer from the display until a new one is presented. Generally a scheduler wants the frame for a bit ahead of time or risks being late. The decoder hardware needs about 25ms to decode the frame. Between those you need 3 frames circulating, plus the reference frames needed by the decoder. The codec seems to already add one frame for the "currently being decoded frame", hence +2 for the others.

You have a parameter available that allows you to alter the number of buffers allocated. If you have a real issue then feel free to alter the omxplayer code and create a pull request. I can't see there being a great need here to try and alter the firmware when it has a chance of causing glitching.

With regard your 17 images allocated, if you've got particular files that cause odd effects then you're going to have to post them (or at least the first few seconds) somewhere. One of us can then analyse what is going on in the codec as that may be a real parsing bug.

@brunogm0
Copy link
Author

brunogm0 commented Jan 6, 2015

DPB+current+render+sched -> the minimum is 4 and not 7! If there was a reason i an willing to test it and lower the firmware default for everybody. Maybe a reduction to 5 is enough.

The 2REF sample with problem from earlier: http://killzone.dl.playstation.net/killzone/kzsf_multiplayer/KZSF_HIGHTBITRATE_H264.zip

@6by9
Copy link

6by9 commented Jan 7, 2015

@popcornmix I don't know the full reason why (I suspect) TJG set it as a minimum of 7, but personally I don't see it as a big failing so would suggest leaving as is. It was probably to get around issues with streams that incorrectly defined the number of reference frames required, hence assuming 4 reference frames + the display frames = 7 as the minimum.
The risk is very high for causing other videos not to play back smoothly. Your call, but I'd suggest "won't fix" here.

The 2REF clip posted is a level 5.0 clip, 1080P60. BCM2835 is only specified to support 4.1, so anything above is an unsupported bonus.

Reading the code and how it parses the headers, it will only drop the number of frames it declares it requires if both sps->vui_parameters_present_flag and the sps->vui_parameters.bitstream_restriction_flag are present, and then it use the sps->vui_parameters.max_dec_frame_buffering value.
bitstream_restriction_flag is not set in your stream, therefore it goes solely on the level of 5.0. Floor(110400/(120*68)) = 13, plus the one for the current frame means the codec requests 14 frames. Sounds like a valid thing to do in my book. I don't know what your analysis tool (mediainfo?) is doing to determine the number of reference frames - feel free to analyse it yourself and report back, but to me it appears that the firmware is doing the safest thing it can based on the info provided.

@brunogm0
Copy link
Author

brunogm0 commented Jan 7, 2015

Well, DXVA compatibility skip checks is by default in level-skip because too many encoders wrong set level 5.0/5.1 and/or set DPB=16. There is so much issues about it, that the compatibility skips were standardized in the community. (GPU drivers were forced to implement loose checks)

Then by the table from level 5.0=1080p@72fps. I think HDMI RPi output limits in 60hz. Now the firmware then correctly calculates the level 5 limits and so allocates 14 buffers for the sample? why if in theory its hardware is level 4.1? or a fix can at level 4.2 limits then because OMX_IndexParamImagePoolSize is default to 1080p. (1080p@60hz effective limit)

A 2560x1600p sample was decoding ok in vcdbg level=64 dump log and used 8 image pools of 8MB, no image is show care for analysis.

Unless video_sched can do a better job on sync control, then maybe a reduction alone is not gonna improve the sync but a more robust video_scheduler or advanced sync control?

"About "ReFrames" field in MediaInfo, provide the value of "max_num_ref_frames" in seq_parameter_set_data( ) of ISO/IEC 14496-10 header." MediaInfo is reading out the DPB size, not the number of reference frames.

X264 source:
143 sps->vui.i_num_reorder_frames = param->i_bframe_pyramid ? 2 : param->i_bframe ? 1 : 0;
144 /* extra slot with pyramid so that we don't have to override order of forgetting old pictures */
146 sps->vui.i_max_dec_frame_buffering = sps->i_num_ref_frames = X264_MIN(X264_REF_MAX, X264_MAX4(param->i_frame_reference, 1 + sps->vui.i_num_reorder_frames, param->i_bframe_pyramid ? 4 : 1, param->i_dpb_size));

At least on 16 DPB case + cur+render+sched so max at 19.
Then ref+b is 2 if pyramid is 4. Min 7 case is for B-pyramid normal. Pools can be 4 as put earlier, but check B +1 (regardless of # of bframes), check B-pyramid (+2 or +1 in case strict).

TJG minimum is probably based on ref with b, with pyramid normal. Not very common because realtime coders dont use B-frames, even less pyramid. Blu-ray compliant only pyramid-strict for example.
So fix to 4 without B-frame. let me look where in SPS the pyramid and b-frame can be checked.

@6by9
Copy link

6by9 commented Jan 7, 2015

H264 hardware block limits are independent from the HDMI output block. Spec for the H264 hardware block is nominally max level 4.1 - ie 1080P30.1.
The codebase was shared with the next generation of hardware which I believe could support level 5.0, hence it doesn't throw an error on it. Also it can generally decode the data, just not in realtime.

Certainly for encode there is a line buffer in the hardware (probably actually a macroblock row in size), and that is 2048 pixels in width. Therefore I'm not surprised anything wider than that fails.

popcornmix is trying to track back in the source control system for when the max(7) happened. Nothing as yet, and not really the highest priority.

@brunogm0
Copy link
Author

After a little less than 200 samples, here is a resume list:
#DPB-sample# #vcdbg pools# #log msg level=64 requirements# #profile, comments#

1finalBattle pools 7 log 2 highL4 *
1FXV ppols 7 log 2 high bframes=0*
1akb pools 7 log 2 mainL4.1
1redband pools 7 log 2 highL4
1creepy pools 7 log 2 baseL3 277x288p
1Asteroid pools 7 log 2 highL3.1 720p youtube
1dead pools 8 log 5 highL4 *
1Unreal pools 7 log 2 highL3.1
1lift pools 7 log 2
1orianthi pools 7 log 2 baseL2.1
1Entering pools 7 log 2 highL3.1
1Eva pools 7 log 2
2u2 pools 7 log 3 highL4
2salad pools 7 log 3 mainL2.2
2KZpro pools 8 log 5 nocabac mainL4.0*
_2testgravacao pools 8+3fx log 5 highL4 _interlaced*, --nodeinterlace pool=7
2big_epi pools 7 log 3 mainL3
2MemoFlash pools 7 log 3 highL5.1 --ref=1 bframes=2 2008
2earth pools 8 log 5 mainL4
*2Final13 pools 8 log 5 *

2RedReel pools 8 log 5 * applequicktime 2009
2versus13 pools 13 log 10 2011 main *
2KZSF pools 17 log 14 ***L5

3b24kbps pools 7 log 4 ref=1 but pyramid ant encoded in 2008 with ref=1 so 2 extra for B-pyramid
3vortex pools 7 log 4 highL4.1 109mbps, max 40m
3nmsGame pools 7 log 4 highL5.1 47~240mbps
3TD pools 7 log 4 mainL5.1
*
3nms_infi pools 8 log 5 main@L5.1 52~240mbps *
3arg pools 7 log 4 mainL2.1
3t1440 pools 7 log 4
3ALJ pools 7 log 4 high
4gabi pools 7 log 5 interlaced --nodeinterlace/native; deinter pools 8+3fx
4forza pools 8 log 5 high4.2

4Persona pools 8 log 5 high10pL3.1 --ref=2 pyramid=2 2011
4ff13 pools 8 log 5 pyrami=2 ref=3
4rotoscan pools 8 log 5 pyramid=2 ref=3 2012
4ken pools 8 log 5 highL4.2 Most videos log ref+cur but pools is registering +3, render sched,?
#truehd versus ac3 double audio use#
4Paramount pools 8 log 5 highL4.1

4IncepT pools 8 log 5 --ref 3 pyramid=2
5puncher pools 13 log 10**** highL4.1
5FCL pools 10* log 7 _L5
6sword pools 10 log 7 baseL3
6shinNinja pools 10 log 7 high10pL4.1
6Accel pools 10 log 7 high10L5.1 pramid=2 2012
8tayo pools 12 log 9 HighL5.1 x264ref5,pyramid,480p
8SVS pools 12 log 9 highL5.1 --ref5 pyramid
9fate pools 13 log 10 high@l4.1
9psy pools 13 log 10 high10L4.1
9hatsuki pools 13 log 10 highL4.1
*_11Crack pools 15 log 12 2008 ref=8 pyramid **
**11MepInamix pools 15 log 12 2008 ref=8 pyramid=1
12Hinata pools 16 log 13 highL5.1 480p
12Koop pools 16 log 13 highL5.0 2009 pyramid
16hotd pools 20 log 17 high10L5.1
16mepMyWorld pools 20 log 17 #all 16ref files tested fall this pools alloc
16nostroHidden pools 20 log 17

A bunch of "vc_mem_map: length 8192 is too big" made it a little pain to test it all.
Some of the samples marked with *, can be further analysed in a parser to compare flags. After Oct/2008 x264 samples are correct and the problem is miss num_reorder_frames badly (other encoders maybe still). A poorly fix tested in FFMPEG but scraped, was the exact solution the firmware uses in the 2DPB_KZSF sample to allocate 14. Now decoding with VAAPI, the sample in question use the DPB, but i still have to look the code in ffmpeg: "[ffmpeg/video] h264: Increasing reorder buffer to 1"
A few issues are diagnosed from this table but assuming ## 7min == 3ref+cur+sched+render+???#, all cases that DPB wasnt respected, the codec showed a auto detection so the minimum can be safely lowered to 3, including a Bframe.
The extra buffer besides current,sched,render im pretty sure dont need to count in the CODEC minimum. I have tested some mods(need cleaning) to omxplayer and the KZSF sample and other are very fluid compared to original. Also vbv_buffsize=78125(1.25s *62500 from level 4.1, shows problems even if video_fifo=78 video_queue=50, gpu_mem=136).

@deborah-c
Copy link

There are quite a number of issues here; let me try and disentangle them a little:

1. Minimum of seven buffers

I think this is very old code which has probably survived longer than it should have (although it's lost its explanatory comments in the process...) The oldest code we have access to (from August 2011) notes it as "add two frames for display, use minimum of seven frames, two in decode, three in pending list, two in display"; I suspect that the original code goes back to 2006-7, and in particular predates the codec having any way to tell the outside world what shape the stream is.

What it's trying to achieve is two reference buffers (for older codecs -- bear in mind video_decode doesn't just do H.264), three buffers for latency adaptation (and a working frame), and the two buffers needed for downstream pipeline stages. Now, bear in mind that the code was originally targetted at mobile phones and similar: deploying fixes is hard, and customers (of the chip manufacturer) tend to be very displeased by any kind of regressions. Software codecs also tend to be a little more inclined to need more latency adaptation, depending on how close they are to (or past...) the edge of performance limits -- the old code actually sets a minimum of 14 for VP6, for example, in order to ensure performance (probably on 200MHz VC3 hardware). Consequently, the minimum buffer level has never been removed in a spirit of conservatism. (In a world where H.264 can use more reference frames, it makes less sense in any case...)

I think we should be able to remove it, but it will take a certain amount of experimentation to check how much latency adaptation is needed. The scheduler and renderer don't actually provide any, because the buffer can get sent to the display arbitrarily soon after delivery to the scheduler; eliminating all latency buffering is OK for single decode use cases, but could cause problems when the decoder hardware is being time-multiplexed (including cases where the encoder and decoder run in parallel, since some hardware is shared).

2. "Four extra buffers"

I have a limited amount of test material available right now, but a quick test showed (as expected) a stream with four reference frames (and max_dpb_frames = 4) requesting five image buffers, and seven being allocated -- video_decode's extra_frames parameter defaults to two to account for the downstream pipeline stages.

We do need #ref + 1 for the decoder itself: the downstream stages won't let go of a frame until they receive another one, as far as I recall, so the decoder needs an extra working frame in order not to deadlock the pipeline. (With extra_buffers set to zero, without the extra frame we'd also stall if the DPB were entirely full of reference frames.)

Interlaced streams will cause extra_frames to be set to three, because the deinterlace module will (I believe) hang on to three buffers at a time -- I'll check exactly what the behaviour is on this. Using fewer buffers in this case would cause the pipeline to deadlock.

In the case of your streams (or at least the KZSF stream that you uploaded), OMXPlayer is under the impression that the stream is interlaced, and therefore sets extra_buffers to 3; however, it's not sufficiently convinced of it actually to create a deinterlace component and its downstream buffer pool... I will look into why it does this!

3. 2REF file allocates 17 image buffers

In this case, we're using the DPB size derived from the level to determine the size of the DPB; in the absence of the bitstream restriction section of the VUI, we have no guarantee that someone won't gratuitously code frames backwards just because they can. While this is silly, it happens both in formal conformance test streams and in some customer streams. As @6by9 comments above, this results in 13 frames for the DPB plus a working frame; for some reason (which I will look into) OMXPlayer sort-of-thinks the stream is interlaced and adds an extra one.

4. Level 5.0 support

The VC4 accelerator hardware only claims support for 4.1 (with a restriction that the picture width must be 1920 or less), it's true. However, if presented with streams at higher levels, the decoder will simply process them on a best-efforts basis, again, subject to the 1920 width restriction as that's a hardware limitation. (There's also a height restriction, but that's not usually a problem.)

For streams that are at too high a frame-rate, you simply won't get real-time performance, but for (e.g.) transcoding applications, that can be absolutely fine. For streams with supported image size and frame rate, but with a level higher than 4.1, it will simply change the buffering requirements (again, in the absence of max_dpb_frames / num_reorder_frames).

I'm slightly puzzled by your assertion that 2560x1600 was working, because the hardware has line buffers that just won't do that!

5. "Increasing reorder buffer"

It's a little difficult to follow what you're saying here, but I think the gist is "can the decoder just allocate more image buffers when it finds it needs them?" Please correct me if that isn't the case.

There are two issues here, one historical and one practical. The original VC3 decoder firmware (dec2) couldn't cope with reallocation of memory at all: it worked with whatever it was given at the start of time. This, obviously, is a problem with streams that have new sequences part way through with increased memory use. The dec3 decoder was built from the ground up to be able to do allocation dynamically. However, it only does so at sequence boundaries. This is partly because some customers wanted systems that used static memory allocation, in order to guarantee that applications didn't interfere with each other -- bear in mind that VC4 hardware isn't always used with Linux or Android -- and partly because the underlying image pool abstraction doesn't support changing pool sizes after creation time.

Now, clearly, the image pool mechanics could be changed to allow extra image buffers to be added to the pool after creation, and the decoder could be changed to call that interface when it found more reordering than it had previously accounted for. However (a) the existing mechanics only allow for propagation of (and reaction to) allocation errors at sequence boundaries, so would have to be reworked, (b) image allocation is slightly tricky because of hardware addressing limitations, and (c) it's rather beyond the scope of the work I'm currently budgeted to do on the decoder...

There's also an issue with speed of allocation. dec3 is designed to allow most of its internal buffers (the CDB, for example) to start small, and be expanded dynamically on demand. However, in practice, when this was first used with Android, it turned out that allocating the memory was really slow, and caused repeated stutters as the decoder progressively ramped up the amount of buffering being used. Hence in practice, the default buffer sizes are set rather larger than originally intended (although it's possible to override them, in theory). The Raspberry Pi environment is probably a bit more friendly in that respect, but I'd worry about making stream playback stop being smooth, particularly as you'd also like to remove the extra latency tolerance!

@deborah-c
Copy link

OMXplayer requests 3 extra buffers unless it's been told (by --nodeinterlace) that the stream definitely isn't interlaced. This allows it to insert the deinterlacer into the pipeline if it discovers (via a displayed frame) that the stream requires it; if it only had two extra buffers, it wouldn't be able to do this, since (as noted above) it's impossible dynamically to add buffers to an image pool.

@deborah-c
Copy link

The advanced deinterlacer currently retains three images on its source side, but actually doesn't need to (since it operates internally on a different image format, and keeps converted copies in the output side image pool). It isn't actually necessary to keep the images on the source side at all. The fast deinterlacer can operate on the source image directly, but only needs to store one image.

Changing the advanced deinterlace to release the images as soon as they've been converted would mean that it wouldn't need an extra buffer in the decoder's pool, so we wouldn't need to set extra_buffers to three under any conditions.

@brunogm0
Copy link
Author

Awesome response!
ftp://helpedia.com/pub/multimedia/x264/testvideos/

1: Two display is probably for "double-buffer", but you said only pointers pass between, so besides "on-screen", the second display buffer is not written at any time (like tearing artifacts), so can be eliminated?
DECODE+CABAC?
Pending is for sched A-V/timestamp sync (latency adaption) or CodedPictureBuffer ?

This latency adaption dont use frame-drop to restore AVsync in my tests.

Conservatism is fine if videos dont lag. Freed memory could go to "video_fifo" if it helps. Malloc pool is fixed in13MB with 10MB free during decode with reloc starving, so i felt a issue somewhere.
Maybe pending buffer is more important to smooth output than video_fifo, then a re-balancing in allocation?

4: 1920 line buffer limitation was my hunch but vcdbg msg log (level=64), was populated with normal decode msgs. It is not "really" working because no image is displayed.

5: Yes, FFMPEG h264.c lines 890~932 http://ffmpeg.org/doxygen/trunk/h264_8c_source.html#l00890
Problems arise from num_reorder_frames not signaled. Its ranged is [num_ref_frames, MaxDPBSize]. Older encoders (2008) dont use num_ref_frames as DPB size, and so need vui.max_dec_buffering. If not present the range above gives min of num_ref_samples*2. Reorder only occurs with B-frame and streams without have pic_order_cnt_type=2, x264 commit=d48f5f8c83a6ea5fbdd309b2ab6284bdf96550a9. Then initial cpb is 0.9 * vbv_bufsize . if unrestricted i have to research a little. Can some of those structures use the other vcdbg malloc pool ?

obs.: DiVX3/5 tests showed log requirements 3 buffers and total 7 pools. VP8-Big buckBunny 480p sample doesnt play for me.

@6by9
Copy link

6by9 commented Jan 22, 2015

Hi @deborah-c
"it's rather beyond the scope of the work I'm currently budgeted to do on the decoder..."
Does this mean you're working officially for Pi Towers at the moment? Useful to know if you are. I was picking GV's brain over a few codec queries, but no longer have him easily available.

"The advanced deinterlacer currently retains three images on its source side, but actually doesn't need to (since it operates internally on a different image format, and keeps converted copies in the output side image pool). It isn't actually necessary to keep the images on the source side at all."
Not all codecs need that format conversion, hence why I suspect deinterlace was kept the same for all potential sources.

Catch up with you in a fortnight.

@deborah-c
Copy link

@6by9: Yes, I'm currently looking at some of the backlog of video decoder issues for them, for a few days.

I believe (after a certain amount of digging around) that the source images will be copied in the deinterlace code even when they're in the same format as the source. However, releasing the images early without breaking the metadata associated with them requires slightly more finesse. I'm experimenting at the moment.

@6by9
Copy link

6by9 commented Jan 22, 2015

Agreed. It does look like it always does a convert_blt.
.image appears to be used only as a flag and then released. img_release_func at the end of set_src_image and set .image to 1 to designate "in use", and then only actually call img_release_func in the other places if !=NULL && !=1?
I'll leave you to play. I've generally been taking a softly softly approach on making big changes, and still don't see a significant issue here.

@deborah-c
Copy link

@brunogm0 We do not operate a "double buffer" as such. The video scheduler holds pending frames, and forwards them to the renderer when the media clock indicates that it's time to do so. The renderer will switch images at a vsync, and only then release the old image for reuse.

If you don't have a working frame in addition to these two, then you reduce the time available for decoding and can't guarantee real time operation.

The discussion here is entirely about image buffering. The latency adaptation here is both for A/V sync, and to provide for differences in decode time between frames, latency in interrupt response or processor availability, and so on. The CPB is separate, and is internal to the decoder.

I'm not quite sure what you mean here by "video_fifo". Do you mean the CPB? If so, it's considerably smaller than a single 1080p buffer in any case, so there's not a lot of mileage in changing its size.

Decode messages will keep being generated even when the image size exceeds the decoder's capabilities; this is because the codec is still processing the data stream (metadata and slice headers, at least) in the hope that some day there might be a sequence it can handle...

The statement "reorder only occurs with B-frame" is not necessarily true for H.264. Neither is it true that all streams without reordering use pic_order_cnt_type 2. There is a huge range of bizarre streams that break most assumptions one could possibly make that aren't actually written into the standard!

The decoder deliberately uses only one block from the malloc heap (with the core instance data structure in it); all resizable structures are in the relocatable heap, by design. This is pretty fundamental to its structure, and not something that could be readily changed. It's also not really desirable in any case, since heap fragmentation can be a significant problem. Bear in mind that the decoder has many use cases that are not quite so simple.

DivX will ask for two buffers for reference frames, plus a working frame; as before, OMXplayer will add three further frames unless you tell it that there's definitely no deinterlace needed, so the usage is unsurprising. VP8 is not implemented by the hardware codec.

@deborah-c
Copy link

@6by9 Yes, that's exactly what I thought a few hours back, but I think you'll find it's more complicated than that ;-)

@brunogm0
Copy link
Author

Ok, bizarre streams are bizarre! omxplayer video_fifo controls buffering for the OMX.broadcom.video_decode ports( 130), vcdbg reloc show as a bunch of 80k RIL buffers. So actually instead of buffering in these ports the "pending" does a smarter job?

Ex: 3 pending buffer with timestamps 10 20 30. clock is 8, tick tick show one, then things slow down and clock is 29 but "pending/sched" shows frame two instead of waiting tick and skipping to last frame, effectively dropping to ensure sync ? Some samples i tested with high video_fifo play perfectly for 45 seconds then video_fifo goes negative until 50sec and even if sound is finished it still exhibiting previous frames, so no drop_frame and no rescheduling whatsoever. Given this, would 4 pending frames contribute more than video_fifo for lag? (Trying to discover the conflict between these RIL buffers and pools)

uploaded a sample i know is 60fps but this plays ok for a minute with fifo=32;
vbv-bufsize=78125 http://d-h.st/S4p
vbv-bufsize=6000 http://d-h.st/3n8

@deborah-c
Copy link

An experimental patch now allows the advanced deinterlace to operate without requiring extra buffers in the decoder output pool; this obviates the requirement for OMXPlayer to set extra_buffers to 3 when --nodeinterlace isn't specified.

@brunogm0
Copy link
Author

brunogm0 commented Feb 4, 2015

Nice, i can test it.

obs: Sorry, i was travelling.

popcornmix added a commit that referenced this issue Feb 4, 2015
See: #353

kernel: dwc_otg: fixup read-modify-write in critical paths
Be more careful about read-modify-write on registers that the FIQ touches

kernel: w1-gpio: Sort out the pullup/parasitic power tangle
See: #348

kernel: BCM270x_DT: Add i2c0_baudrate and i2c1_baudrate parameters
See: http://www.raspberrypi.org/forums/viewtopic.php?f=28&t=97314#p683127

kernel: HiFiBerry Amp: fix device-tree problems
See: raspberrypi/linux#790

firmware: Camera: Fix for reset stc timestamping mode

firmware: deinterlace: reduce memory use on input side for advanced deinterlace
See: #337

firmware: Kill max(7,) in buffer allocation
See: #337
popcornmix added a commit to Hexxeh/rpi-firmware that referenced this issue Feb 4, 2015
See: raspberrypi/firmware#353

kernel: dwc_otg: fixup read-modify-write in critical paths
Be more careful about read-modify-write on registers that the FIQ touches

kernel: w1-gpio: Sort out the pullup/parasitic power tangle
See: raspberrypi/firmware#348

kernel: BCM270x_DT: Add i2c0_baudrate and i2c1_baudrate parameters
See: http://www.raspberrypi.org/forums/viewtopic.php?f=28&t=97314#p683127

kernel: HiFiBerry Amp: fix device-tree problems
See: raspberrypi/linux#790

firmware: Camera: Fix for reset stc timestamping mode

firmware: deinterlace: reduce memory use on input side for advanced deinterlace
See: raspberrypi/firmware#337

firmware: Kill max(7,) in buffer allocation
See: raspberrypi/firmware#337
@popcornmix
Copy link
Contributor

Firmware and omxplayer are updated.
The max(7, ..) is removed from the pool size.
The extra video buffer is removed from omxplayer.
Can you test?

You can add:
vd_min_images=7
fx_no_early_release=1

to config.txt to revert back to the previous behaviour.

@brunogm0
Copy link
Author

brunogm0 commented Feb 4, 2015

updating now... results A.S.A.P.

1creepy pools 4 log 2 baseL3 277x288p #$reduced three
1finalBattle pools 4 log 2 highL4 #$ reduce three, plays with 320kfifo 3m vqueue 1m aqueue (8860k )
1Asteroid pools 4 log 2 highL3.1 720pyoutube #$reduced three plays 320kfifo
V:4.09s 320k/320k A: 6.32 5.99s/6.34s Cv: 3047k Ca:s 107k
1FXV ppols 4 log 2 high bframes=0 #$ reduce three plays 320kfifo
1akb pools 4 log 2 mainL4.1 #$ reduce three plays 480kfifo
1dead pools 7 log 5 HL4 #$reduced one but strange sample

2u2 pools 5 log 3 highL4 #$ reduced two
2RedReel pools 7 log 5 quicktime2009 20mbps max, 6.5mbps avg #$ redux one needs 640kfifo
2u2 pools 5 log 3 highL4 #$ reduced one
2salad pools 5 log 3 mainL2.2 #$ reduced two
2KZpro pools 7 log 5 nocabac mainL4.0* #$reduced one +5mfifo?
_2testgravacao pools 7+ 3fx log 5 highL4 _interlaced*, nodeinterlace pools 7 #$ reduced one in advanced deinterlace
2big_epi pools 5 log 3 mainL3 #$ reduced two, 320kfifo
2MemoFlash pools 7 log 3 highL5.1 --ref=1 bframes=2 2008 #$ reduced two, 320kfifo %%memoflash3 needs 400kfifo
2earthNight pools 7 log 5 mainL4*** #$ reduced one, 320kfifo
2Final13 pools 7 log 5 ** #$ reduced one, 320kfifo
2RedReel pools 7 log 5 ** applequicktime 2009 #$ redux one 640kfifo
2versus13 pools 12 log 10 2011 main ** #$ redux one, 400kfifo

2kzsf pools 16 log 14 ****L5 #$ redux one, 480kfifo plays correct but audio fifo 0.29/6.07 lowest on PCM, same sample but with flac is play better! if using more than 1360kvfi bigger lag. Also video-queue bigger than 4M gives lag 3~3.7 better.

2testgravacao pools 7+ 3fx log 5 highL4 interlaced
#$ reduced one with advanced deinterlace maintains the 3 FX pools, nodeinterlace remains the same.
4gabi pools 7+ 3fx log 5 MainL4 interlaced
#$ reduced one with advanced deinterlace maintains the 3 FX pools, nodeinterlace remains the same.
Same behavior with vd_min_images=7 fx_no_early_release=1 !!!

3vortex pools 6 log 4 #$reduced one
3b24kbps ref=1 but pyramid ant encoded in 2008 with ref=1 so 2 extra dpb for old pyramid bframe. pools 6 log 4 #$ redux one
3vortex pools 6 log 4 highL4.1 109mbps, max 40m #$reduction of one buffer
3nmsGame pools 6 log 4 highL5.1 47~240mbps #$ redux one 960kfifo has lag in first 2s scene, smoother difficult sample
3TD pools 6 log 4 mainL5.1
* #$ redux one, 720kfifo
3nms_infi pools 7 log 5 main@L5.1 52~240mbps * #$ redux one, ?1280k fifo video faster than audio. async negative a lot!, smoother than before very difficult sample
3arg pools 6 log 4 mainL2.1 #$ redux one, 320kfifo
3t1440 pools 6 log 4 #$ reduced one,fifo320k 2560x1440 garbage but plays and shows something!
3ALJ pools 6 log 4 high #$ reduced one +960kfifo!!!

3Into pools 6 log 4 #$reduced one
4Paramount pools 7 log 5 #$reduced one
5FCL pools 9 log 7 L5 #$reduced one, strange
8Lia pools 11 log 9 ##reduced one can play with gpumem=64 if video_fifo=240k and FB1360x768
16Demi pools 19 log 17 #$reduced one, most 16 ref samples fall into this.

Most samples play fine with 320k vfifo, including birds80mbit. The few that lags have problems with audio buffering and so increasing vfifo to 1360k helps but cannot solve the audio sync problem. Testing with audio_queue 1 video_queue 3.7 video_fifo 0.32 (audio_fifo 0.02 helps sometimes).

For the purpose of this issue the interlaced FX is not working if fx_early_release works then its ok to close. There are confirmed reductions in buffer for a practical minimum of 4. Its playing very smooth with a confirmed increase in video capabilities from birds60Mbit lag free to birds80Mbit lag free. Birds90Mbit plays with a little lag even with 24M vfifo + 40M vqueue.

audio_renderRIL: sent 0/779 samples
debug_sym: AccessVideoCoreMemory: mmap failed: Invalid argument(22)
Unable to read log message header from 0xdffffffc

@popcornmix
Copy link
Contributor

@brunogm0 your last message isn't easy to read.
Can you confirm if you are happy that the original issue posted (too many buffers allocated) is resolved? If you have other issues, it's best to open a new issue.

@brunogm0
Copy link
Author

The advanced deinterlace patch is not working and the 3 extra buffers are still used. This issue can be closed after the FX_early_release works.

Sorry, for the confusing data. i tryed to report the new behavior for each major sample. I only have two interlaced samples. If you want i can upload one?

@popcornmix
Copy link
Contributor

How are you testing the interlaced files?
There should be 1 fewer buffers allocated than before, but it needs this patch:
popcornmix/omxplayer@ced2387 to get the memory
Yes, more buffers are required when deinterlacing - that is unavoidable.

@brunogm0
Copy link
Author

Testing with the correct omxplayer and firmware versions.
These options (vd_min_images=7 fx_no_early_release=1) dont restore the original behavior for the interlaced path or samples that used 8 pools. The min option works to increase for samples that now alloc 5 or 4.

the fx_early_option is not working. So the pach to omxplayer doesnt touch FX pool?

@popcornmix
Copy link
Contributor

The patch to omxplayer will reduce memory usage by one frame (when --nodeinterlace is not used) with any firmware version, with or without fx_no_early_release).

However for some streams, it may hang with older firmware, or with fx_no_early_release=1. With latest firmware and with default settings (fx_no_early_release=0), it should not hang.

@brunogm0
Copy link
Author

Ok closing.

Just to point out that 1080p@60hz, works now in a very large number of samples. More smooth with cppcheck performance issue fix.

neuschaefer pushed a commit to neuschaefer/raspi-binary-firmware that referenced this issue Feb 27, 2017
See: raspberrypi#353

kernel: dwc_otg: fixup read-modify-write in critical paths
Be more careful about read-modify-write on registers that the FIQ touches

kernel: w1-gpio: Sort out the pullup/parasitic power tangle
See: raspberrypi#348

kernel: BCM270x_DT: Add i2c0_baudrate and i2c1_baudrate parameters
See: http://www.raspberrypi.org/forums/viewtopic.php?f=28&t=97314#p683127

kernel: HiFiBerry Amp: fix device-tree problems
See: raspberrypi/linux#790

firmware: Camera: Fix for reset stc timestamping mode

firmware: deinterlace: reduce memory use on input side for advanced deinterlace
See: raspberrypi#337

firmware: Kill max(7,) in buffer allocation
See: raspberrypi#337
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants