-
-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
avfilter/tonemapx: add simd optimized tonemapx #407
Conversation
Got 74fps with the tonemapx.avx on 5950x on Windows. I'm quite pleased with the performance. Further adding yuv420p10 input and yuv420p output support should squeeze out more FPS, because IIRC yuv420p1x to p01x conversion in FFmpeg is written in C, and other encoders such as libx265 and svt-av1 do not support nv12 input. |
Probably not that much for bandwidth rich systems as this conversion is mainly memory operation not compute operation, could be useful for bandwidth constrained platforms if we process yuv420p directly as the frame copy is further reduced and our avx implementation could see more improvements. |
We really can't expect all our users to have i9/R9/Apple Silicon Max. FWIW in this simple test it already clearly hurts performance, although it wouldn't be so noticeable in a real world testing because this version is already much better than the zscale+tonemap.orig.
|
Do you have time for a reference c implementation of yuv420p? I can port SIMD version to it. |
Not yet. But here are some examples. Just convert it to use in a for loop. The difference between nv12/p01x and yuv420p/yuv420p1x is only whether U and V are interleaved. jellyfin-ffmpeg/debian/patches/0005-add-cuda-tonemap-impl.patch Lines 745 to 883 in 44e1bf0
|
gnu.org is down for 9 hours now and failing the pipeline |
ffmpeg crashes after inserting a downscaling filter and running for a while. For example, these resolutions. It can't be reproduced in tonemapx.c. I guess it has something to do with the edges of the image that cannot be accelerated by SIMD.
|
How long did you let it run? I cannot reproduce on my machine. |
interesting, only occurs with windows builds |
It happens randomly. Not sure if this has to do with gcc options and versions. https://github.com/jellyfin/jellyfin-ffmpeg/blob/jellyfin/Dockerfile.win64.in#L20 Maybe try another builder? |
The issue occurs frequently on Windows but never on Linux. Windows reports an access violation, which doesn't make sense to me if all memory accesses are to legal locations on Linux. |
Does the compiler emit the same assembly code from the intrinsics? |
|
I'll take a look at it later. You can drop the vulkan related scripts first. |
The problem is now super stupid to me. I commented out both read and write to/from the framebuffer operation and it is still telling me access violation. |
Well I found the cause. GCC generated code sequences from |
I guess this uncertainty from the compiler is one of the reasons why upstream FFmpeg only accepts assembly code. |
Well it turns out that it is not that simple. Now I really believe it is due to gcc+windows. I can workaround this issue by reducing the To make it even worse, it seems like only cap the concurrency in tonemapx is not enough and this has to be a global option which means all filters in the chain has to be concurrency capped. With a global concurrency of 24 and we only spawn 1 job for tonemapx, the access violation still happens after a few moments. |
Take it easy. We still have several months to investigate before JF 10.10. Could it be related to LTO and GCC auto-vectorization? |
What works is that by implementing the write whole block back to memory and read from there(no usage of I however observed that zscale can support a huge amount of concurrency (with 1920x1080 works with 24 and 1928x1080 works with 16) after the Edit: it seems like zscale works even without modification? At least it works with |
Have you seen these? MingGW-W64 seems to be quite fragile in handling AVX. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54412 edit: https://github.com/search?q=use-unaligned-vector-move&type=code |
Well it is worse than I think. The problem now is that the sse version also has such problems and I have no idea why. It does support higher concurrency though, but when the threads is too much the access violation will eventually come even with only sse. Still, zscale works better. |
Perhaps you should try compiling with MSVC to see if this is just another mingw gcc issue. |
Will figure it out later. |
The access violation with many threads even happens with msvc(btw our codebase is not really compatible with msvc and have to comment out a lot of things to make it compile). Perhaps I have to try that project generator to debug with visual studio to see what happens... |
I "fixed" the access violation on Windows by refactoring the memory store logic and using more stable range clipping. The new memory store logic improved performance for all platforms: i9-12900 with AVX2: 77fps->87fps (Linux) Now AVX is usable on Windows with compiler flag |
A question about the yuv420p implementation: If most of the software decoder and encoders are not expecting p01x frames, isn't it safe to just drop the support for such frames and supports yuv420 exclusively? For what use case the p01x is preferred? |
There are also some warnings that can be eliminated before merging. There should be more in the GH actions log. For example:
|
It seems that ffmpeg has a edit:
|
I couldn't care less about those ancient compilers. I even dropped debian buster support due to its ancient gcc. Extend the macro checking for configure flag and disable with |
Indeed, it is not practical to test all functions. How about checking with compiler version? When the gcc/clang version is lower than required, disable SIMD. Other compilers can be reasonably ignored. |
Done |
I made a windows clang build for testing: https://github.com/gnattu/jellyfin-ffmpeg/releases/tag/win64-clang This performs faster than the gcc version(at least on my own machine), but more testing is needed. |
0080-add-simd-optimized-tonemapx-filter.patch I made some minor improvements:
Two Q:
|
This seems unnecessary because you have to test compiler version after all. GCC has
It should compile, though I cannot guarantee if it really works.
This should work on all modern compilers, not sure for ancient gcc |
At least this ensures that the current gcc/clang environment is capable of handling some kind of intrinsics before actually building ffmpeg, no? If As for whether the contents of the headers are out of date, I think that is beyond our scope. If it's really necessary, we can add them manually, just like you did for
20.04 will reach EOL in April 2025. We can try enabling it and see.
I think we don't care about those ancient compilers. |
Co-authored-by: Nyanmisaka <nst799610810@gmail.com>
This includes NEON for ARMv8, SSE for x86-64-v2 and AVX+FMA for x86-64-v3
Test result with 4K HEVC 10bit HLG input, encoding with libx264 veryfast using bt2390:
Intel Core i9-12900:
tonemapx.c: 57fps
tonemapx.sse: 74fps
tonemapx.avx: 77fps
Apple M1 Max:
tonemapx.c:43fps
tonemapx.neon: 57fps
For comparison, original zscale+tonemap simd results:
Intel Core i9-12900:
tonemap.avx: 40fps
tonemap.sse: 40fps
tonemap.c: 32fps
Apple M1 Max:
tonemap.neon: 44fps
tonemap.c: 35fps
The original implementation is too memory heavy that dual-channel desktop CPUs are easily memory bounded due to the intermediate RGBF32 framebuffer sharing with zscale. Tonemapx lowered the the bandwidth requirement which brings significant performance gain to bandwidth limited platforms. Even for bandwidth-rich M1 Max it still provides significant performance boost due to better cache hitrate.
Changes
Issues
Replaces #401