-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hardware Decoding 10x faster than Software Decoding? #443
Comments
What does "speed" mean? Is it GHz? Time it takes to do something? |
The speed is something that ffmpeg is outputting. It indicates velocity of its progress through the file relative to realtime playback. |
The two videos here are showing these tests in action, including a visualization of the transcoding pipeline topology: |
For convenience, all-in-one (ff binary + test files): https://1drv.ms/u/c/8a9863d7afb15f9b/Ebkn1YXBEBlMutUN5NuzO1oB7NSZNnrKyNdpJtJqKkrxhw?e=jqUq9B Just unzip and you can run the commands in the Excel file. |
@softworkz These are interesting results. But you are totally missing the point. The discussion was about efficiency (power consumption) - not decoding speed. I never said that a HW unit can decode 10x as fast as a CPU. That would be ridiculous - why would they add such an overpowered HW unit, wasting silicon area? I said that it uses 10x less energy during playback, thus saving battery and keeping the device cool. And I gave you proof that the numbers are actually even higher for most modern codecs. It is a rule of thumb that an ASIC can perform an algorithm about a magnitude more efficient than a general purpose CPU (of course, strongly depending on the actual algorithm and implementation details). That's why they are used so often. Modern CPUs also start integrating ASICs for crypto algorithms, since they become more and more common and start eating a considerable amount of CPU power without ASIC support. CPU manufacturers sure would not do that if it did not have a considerable effect. And crypto mining of the popular coins is mainly done on ASICs now, since they are so much more efficient, allowing much higher revenues than GPU mining. |
I wonder if it would be possible to speed up hwdownload and hwupload by parallelizing it (doing multiple downloads and uploads at the same time). Theoretically, the speed of PCIe 3.0 x16 should be high enough for download+upload of 4K frames in real time. And when running on a iGPU, things do not even have to go through PCIe. Why is it so slow then? FFmpeg 7 does run filters in a graph in parallel, but a single hwdownload or hwupload does only process one frame at a time. |
The PCIe 3 port is not the bottleneck, a 4090 can barely saturate it. |
There's locking for D3D11 frame access in ffmpeg. I have removed that in our ffmpeg, because ffmpeg filtering is (was) single threaded, but that brought jsut a small improvement in certain cases.
What I've been often wondering is why they can't just remap the memory instead of copying in case of iGPUs - it's the same memory anyway.
I have not worked with the code from newer versions, but running filters in parallel can only mean that multiple filters can execute in parallel, From the architecture it's not possible to have a single filter executing in parallel. |
Remapping for iGPU is available in directx12. |
Oh, and there's a doubling involved. When you upload or download a d3d texture, you get a pointer in CPU memory for accessing the data, bu tyou don't "own" the data, so you need to copy it to or from your own memory range. It doesn't double PCIe bandwidth, but memory bandwidth. And CPU time for copying. |
There's also the requirement of using array textures with D3D11 that's why it's slower than DXVA2 - or wait - I think that was just requirement for QSV withh D2D11. |
I'm afraid, but not even close...
Reproducing
Here's an Excel file including all the ffmpeg commands: SubtitleBurnInTests.xlsx
The test.mlkv is "Samsung Dubai" which you can find on DemoLandia.net
Subtitles file: subs.zip
Assessing the Results
General
First of all, this exactly aligns with what I wrote in this (#439 (comment)) and subsequent posts below.
My Laptop (Tigerlake) has similar graphics than my PC (RocketLake), that's why results are similar. Unfortunately the older Laptop I have is too old for this. Feel free to run these tests on weaker machines. You will see somewhat different results and relations, but all of the following conclusions will generally hold true (exceptions always possible).
When assessing transcoding performance results, these are often appearing to be odd and unexpected. It is important to understand that:
So you may see a max speed of 7.6 for decoding a stream, but there may still be possible for the GPU to process a second stream at the same speed and without affecting the speed of the first one
It's more usful to think in "faster, slower, much faster, much slower", which is a lot more relatable than factors
This is why I said that you cannot reasonably talk in factors when trying to make comparisons in this area.
Observations
That's essentIially what I said and you can easily see that it's true when comparing the last two lines for Intel and Nvidia
Yet, my statements aren't based on some synthetic test results. Up until 2 or 3 years ago, we have regularly received user reports about stuttering audio, where it turned out that it was caused by transcodes with subtiltle burn-in. After half a year of research and testing, we have made the change in the stable release to use sw decoding instead of hw decoding in those specific cases.
After this change, we have rarely seen any such report. Many are running our server on NAS devices with non-recent and non-high-end CPUs, and this change has helped to lift the transcoding speed over the critical bar (1.0x, everything below cannot play fluidly) for many users.
Final Notes
Nr 1
If it was in the context of FFmpegInteropX playback where you came to that "10x faster" impression, then you might have missed to consider the following:
When comparing decoding speed while switching FFmpegInteropX between hw decoding and ffmpeg sw decoding, you are not comparing "sw decoding" to "hw decoding". Instead you are actually comparing "sw decoding + hw uploading" to "hw decoding without data transfer".
Nr 2
Yes, that's why I wanted to tell you about it.
There's not much point in telling things you already know 😆
But it's not about "I know something that you don't know" - it's about knowledge transfer. Since FFmpegInteropX is driving our Xbox app now, we have a natural interest in getting it even better.
The text was updated successfully, but these errors were encountered: