avfilter/tonemapx: add simd optimized tonemapx #407

gnattu · 2024-06-25T13:07:29Z

This includes NEON for ARMv8, SSE for x86-64-v2 and AVX+FMA for x86-64-v3

Test result with 4K HEVC 10bit HLG input, encoding with libx264 veryfast using bt2390:

Intel Core i9-12900:

tonemapx.c: 57fps
tonemapx.sse: 74fps
tonemapx.avx: 77fps

Apple M1 Max:

tonemapx.c:43fps
tonemapx.neon: 57fps

For comparison, original zscale+tonemap simd results:

Intel Core i9-12900:

tonemap.avx: 40fps
tonemap.sse: 40fps
tonemap.c: 32fps

Apple M1 Max:

tonemap.neon: 44fps
tonemap.c: 35fps

The original implementation is too memory heavy that dual-channel desktop CPUs are easily memory bounded due to the intermediate RGBF32 framebuffer sharing with zscale. Tonemapx lowered the the bandwidth requirement which brings significant performance gain to bandwidth limited platforms. Even for bandwidth-rich M1 Max it still provides significant performance boost due to better cache hitrate.

Changes

Fixed an issue that the input parameter is not really initialized
Added neon, avx and sse code path
Use function pointer instead of branching in each thread for tonemap function selection

Issues

Replaces #401

nyanmisaka · 2024-06-25T15:22:33Z

Got 74fps with the tonemapx.avx on 5950x on Windows. I'm quite pleased with the performance.

Further adding yuv420p10 input and yuv420p output support should squeeze out more FPS, because IIRC yuv420p1x to p01x conversion in FFmpeg is written in C, and other encoders such as libx265 and svt-av1 do not support nv12 input.

gnattu · 2024-06-25T15:28:29Z

Further adding yuv420p10 input and yuv420p output support should squeeze out more FPS, because IIRC yuv420p1x to p01x conversion in FFmpeg is written in C, and other encoders such as libx265 and svt-av1 do not support nv12 input.

Probably not that much for bandwidth rich systems as this conversion is mainly memory operation not compute operation, could be useful for bandwidth constrained platforms if we process yuv420p directly as the frame copy is further reduced and our avx implementation could see more improvements.

nyanmisaka · 2024-06-25T15:51:48Z

Further adding yuv420p10 input and yuv420p output support should squeeze out more FPS, because IIRC yuv420p1x to p01x conversion in FFmpeg is written in C, and other encoders such as libx265 and svt-av1 do not support nv12 input.

Probably not that much for bandwidth rich systems as this conversion is mainly memory operation not compute operation, could be useful for bandwidth constrained platforms if we process yuv420p directly as the frame copy is further reduced and our avx implementation could see more improvements.

We really can't expect all our users to have i9/R9/Apple Silicon Max. FWIW in this simple test it already clearly hurts performance, although it wouldn't be so noticeable in a real world testing because this version is already much better than the zscale+tonemap.orig.

./ffmpeg -threads 16 -i 4K_HLG.MP4 -an -sn -vf format=yuv420p10 -vframes 1000 -f null -

frame= 1000 fps=251 q=-0.0 Lsize=N/A time=00:00:16.71 bitrate=N/A speed=4.19x

./ffmpeg -threads 16 -i 4K_HLG.MP4 -an -sn -vf format=yuv420p10,format=p010 -vframes 1000 -f null -

frame= 1000 fps=168 q=-0.0 Lsize=N/A time=00:00:16.71 bitrate=N/A speed=2.81x

gnattu · 2024-06-25T16:13:12Z

We really can't expect all our users to have i9/R9/Apple Silicon Max. FWIW in this simple test it already clearly hurts performance, although it wouldn't be so noticeable in a real world testing because this version is already much better than the zscale+tonemap.orig.
./ffmpeg -threads 16 -i 4K_HLG.MP4 -an -sn -vf format=yuv420p10 -vframes 1000 -f null -
frame= 1000 fps=251 q=-0.0 Lsize=N/A time=00:00:16.71 bitrate=N/A speed=4.19x
./ffmpeg -threads 16 -i 4K_HLG.MP4 -an -sn -vf format=yuv420p10,format=p010 -vframes 1000 -f null -
frame= 1000 fps=168 q=-0.0 Lsize=N/A time=00:00:16.71 bitrate=N/A speed=2.81x

Do you have time for a reference c implementation of yuv420p? I can port SIMD version to it.

nyanmisaka · 2024-06-25T16:43:05Z

We really can't expect all our users to have i9/R9/Apple Silicon Max. FWIW in this simple test it already clearly hurts performance, although it wouldn't be so noticeable in a real world testing because this version is already much better than the zscale+tonemap.orig.
./ffmpeg -threads 16 -i 4K_HLG.MP4 -an -sn -vf format=yuv420p10 -vframes 1000 -f null -
frame= 1000 fps=251 q=-0.0 Lsize=N/A time=00:00:16.71 bitrate=N/A speed=4.19x
./ffmpeg -threads 16 -i 4K_HLG.MP4 -an -sn -vf format=yuv420p10,format=p010 -vframes 1000 -f null -
frame= 1000 fps=168 q=-0.0 Lsize=N/A time=00:00:16.71 bitrate=N/A speed=2.81x
Do you have time for a reference c implementation of yuv420p? I can port SIMD version to it.

Not yet. But here are some examples. Just convert it to use in a for loop. The difference between nv12/p01x and yuv420p/yuv420p1x is only whether U and V are interleaved.

jellyfin-ffmpeg/debian/patches/0005-add-cuda-tonemap-impl.patch

Lines 745 to 883 in 44e1bf0

    
           +static __inline__ __device__ T read_sample(const FFCUDAFrame& frame, int x, int y) 
        
           +{ 
        
           +    T* ptr = (T*)(frame.data[p] + (y * frame.linesize[p])); 
        
           +    return ptr[x]; 
        
           +} 
        
           + 
        
           +// Per-format read functions 
        
           +static __inline__ __device__ ushort3 read_p016(const FFCUDAFrame& frame, int x, int y) 
        
           +{ 
        
           +    return make_ushort3(read_sample<unsigned short, 0>(frame, x,          y), 
        
           +                        read_sample<unsigned short, 1>(frame, (x & ~1),     y / 2), 
        
           +                        read_sample<unsigned short, 1>(frame, (x & ~1) + 1, y / 2)); 
        
           +} 
        
           + 
        
           +static __inline__ __device__ ushort3 read_p010(const FFCUDAFrame& frame, int x, int y) 
        
           +{ 
        
           +    ushort3 val = read_p016(frame, x, y); 
        
           +    return make_ushort3(val.x >> 6, 
        
           +                        val.y >> 6, 
        
           +                        val.z >> 6); 
        
           +} 
        
           + 
        
           +static __inline__ __device__ ushort3 read_yuv420p16(const FFCUDAFrame& frame, int x, int y) 
        
           +{ 
        
           +    return make_ushort3(read_sample<unsigned short, 0>(frame, x,      y), 
        
           +                        read_sample<unsigned short, 1>(frame, x / 2, y / 2), 
        
           +                        read_sample<unsigned short, 2>(frame, x / 2, y / 2)); 
        
           +} 
        
           + 
        
           +static __inline__ __device__ ushort3 read_yuv420p10(const FFCUDAFrame& frame, int x, int y) 
        
           +{ 
        
           +    ushort3 val = read_yuv420p16(frame, x, y); 
        
           +    return make_ushort3(val.x >> 6, 
        
           +                        val.y >> 6, 
        
           +                        val.z >> 6); 
        
           +} 
        
           + 
        
           +// Generic read functions 
        
           +static __inline__ __device__ ushort3 read_px(const FFCUDAFrame& frame, int x, int y) 
        
           +{ 
        
           +    if (fmt_src == AV_PIX_FMT_P010) 
        
           +        return read_p010(frame, x, y); 
        
           +    else if (fmt_src == AV_PIX_FMT_P016) 
        
           +        return read_p016(frame, x, y); 
        
           +    else 
        
           +        return make_ushort3(0, 0, 0); 
        
           +} 
        
           + 
        
           +static __inline__ __device__ float sample_to_float(unsigned short i) 
        
           +{ 
        
           +    return (float)i / ((1 << depth_src) - 1); 
        
           +} 
        
           + 
        
           +static __inline__ __device__ float3 pixel_to_float3(ushort3 flt) 
        
           +{ 
        
           +    return make_float3(sample_to_float(flt.x), 
        
           +                       sample_to_float(flt.y), 
        
           +                       sample_to_float(flt.z)); 
        
           +} 
        
           + 
        
           +static __inline__ __device__ float3 read_px_flt(const FFCUDAFrame& frame, int x, int y) 
        
           +{ 
        
           +    return pixel_to_float3(read_px(frame, x, y)); 
        
           +} 
        
           + 
        
           +// Single-sample write function 
        
           +template<int p, class T> 
        
           +static __inline__ __device__ void write_sample(const FFCUDAFrame& frame, int x, int y, T sample) 
        
           +{ 
        
           +    T* ptr = (T*)(frame.data[p] + (y * frame.linesize[p])); 
        
           +    ptr[x] = sample; 
        
           +} 
        
           + 
        
           +// Per-format write functions 
        
           +static __inline__ __device__ void write_nv12_2x2(const FFCUDAFrame& frame, int x, int y, ushort3 a, ushort3 b, ushort3 c, ushort3 d, ushort3 chroma) 
        
           +{ 
        
           +    write_sample<0>(frame, x,     y,     (unsigned char)a.x); 
        
           +    write_sample<0>(frame, x + 1, y,     (unsigned char)b.x); 
        
           +    write_sample<0>(frame, x,     y + 1, (unsigned char)c.x); 
        
           +    write_sample<0>(frame, x + 1, y + 1, (unsigned char)d.x); 
        
           + 
        
           +    write_sample<1>(frame, (x & ~1),     y / 2, (unsigned char)chroma.y); 
        
           +    write_sample<1>(frame, (x & ~1) + 1, y / 2, (unsigned char)chroma.z); 
        
           +} 
        
           + 
        
           +static __inline__ __device__ void write_yuv420p_2x2(const FFCUDAFrame& frame, int x, int y, ushort3 a, ushort3 b, ushort3 c, ushort3 d, ushort3 chroma) 
        
           +{ 
        
           +    write_sample<0>(frame, x,     y,     (unsigned char)a.x); 
        
           +    write_sample<0>(frame, x + 1, y,     (unsigned char)b.x); 
        
           +    write_sample<0>(frame, x,     y + 1, (unsigned char)c.x); 
        
           +    write_sample<0>(frame, x + 1, y + 1, (unsigned char)d.x); 
        
           + 
        
           +    write_sample<1>(frame, x / 2, y / 2, (unsigned char)chroma.y); 
        
           +    write_sample<2>(frame, x / 2, y / 2, (unsigned char)chroma.z); 
        
           +} 
        
           + 
        
           +static __inline__ __device__ void write_p016_2x2(const FFCUDAFrame& frame, int x, int y, ushort3 a, ushort3 b, ushort3 c, ushort3 d, ushort3 chroma) 
        
           +{ 
        
           +    write_sample<0>(frame, x,     y,     (unsigned short)a.x); 
        
           +    write_sample<0>(frame, x + 1, y,     (unsigned short)b.x); 
        
           +    write_sample<0>(frame, x,     y + 1, (unsigned short)c.x); 
        
           +    write_sample<0>(frame, x + 1, y + 1, (unsigned short)d.x); 
        
           + 
        
           +    write_sample<1>(frame, (x & ~1),     y / 2, (unsigned short)chroma.y); 
        
           +    write_sample<1>(frame, (x & ~1) + 1, y / 2, (unsigned short)chroma.z); 
        
           +} 
        
           + 
        
           +static __inline__ __device__ void write_p010_2x2(const FFCUDAFrame& frame, int x, int y, ushort3 a, ushort3 b, ushort3 c, ushort3 d, ushort3 chroma) 
        
           +{ 
        
           +    write_sample<0>(frame, x,     y,     (unsigned short)(a.x << 6)); 
        
           +    write_sample<0>(frame, x + 1, y,     (unsigned short)(b.x << 6)); 
        
           +    write_sample<0>(frame, x,     y + 1, (unsigned short)(c.x << 6)); 
        
           +    write_sample<0>(frame, x + 1, y + 1, (unsigned short)(d.x << 6)); 
        
           + 
        
           +    write_sample<1>(frame, (x & ~1),     y / 2, (unsigned short)(chroma.y << 6)); 
        
           +    write_sample<1>(frame, (x & ~1) + 1, y / 2, (unsigned short)(chroma.z << 6)); 
        
           +} 
        
           + 
        
           +static __inline__ __device__ void write_yuv420p16_2x2(const FFCUDAFrame& frame, int x, int y, ushort3 a, ushort3 b, ushort3 c, ushort3 d, ushort3 chroma) 
        
           +{ 
        
           +    write_sample<0>(frame, x,     y,     (unsigned short)a.x); 
        
           +    write_sample<0>(frame, x + 1, y,     (unsigned short)b.x); 
        
           +    write_sample<0>(frame, x,     y + 1, (unsigned short)c.x); 
        
           +    write_sample<0>(frame, x + 1, y + 1, (unsigned short)d.x); 
        
           + 
        
           +    write_sample<1>(frame, x / 2, y / 2, (unsigned short)chroma.y); 
        
           +    write_sample<2>(frame, x / 2, y / 2, (unsigned short)chroma.z); 
        
           +} 
        
           + 
        
           +static __inline__ __device__ void write_yuv420p10_2x2(const FFCUDAFrame& frame, int x, int y, ushort3 a, ushort3 b, ushort3 c, ushort3 d, ushort3 chroma) 
        
           +{ 
        
           +    write_sample<0>(frame, x,     y,     (unsigned short)(a.x << 6)); 
        
           +    write_sample<0>(frame, x + 1, y,     (unsigned short)(b.x << 6)); 
        
           +    write_sample<0>(frame, x,     y + 1, (unsigned short)(c.x << 6)); 
        
           +    write_sample<0>(frame, x + 1, y + 1, (unsigned short)(d.x << 6)); 
        
           + 
        
           +    write_sample<1>(frame, x / 2, y / 2, (unsigned short)(chroma.y << 6)); 
        
           +    write_sample<2>(frame, x / 2, y / 2, (unsigned short)(chroma.z << 6)); 
        
           +}

gnattu · 2024-06-26T02:03:42Z

gnu.org is down for 9 hours now and failing the pipeline

nyanmisaka · 2024-06-26T02:49:46Z

ffmpeg crashes after inserting a downscaling filter and running for a while.
scale=s=1920x1080:flags=fast_bilinear,tonemapx=...

For example, these resolutions.
1280x720
1920x1080
1920x1088
2560x1440

It can't be reproduced in tonemapx.c. I guess it has something to do with the edges of the image that cannot be accelerated by SIMD.

- <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
- <System>
  <Provider Name="Application Error" /> 
  <EventID Qualifiers="0">1000</EventID> 
  <Version>0</Version> 
  <Level>2</Level> 
  <Task>100</Task> 
  <Opcode>0</Opcode> 
  <Keywords>0x80000000000000</Keywords> 
  <TimeCreated SystemTime="2024-06-26T02:48:40.9359314Z" /> 
  <EventRecordID>216441</EventRecordID> 
  <Correlation /> 
  <Execution ProcessID="0" ThreadID="0" /> 
  <Channel>Application</Channel> 
  <Computer>pc</Computer> 
  <Security /> 
  </System>
- <EventData>
  <Data>ffmpeg.exe</Data> 
  <Data>0.0.0.0</Data> 
  <Data>667ad11b</Data> 
  <Data>avfilter-9.dll</Data> 
  <Data>9.3.100.0</Data> 
  <Data>667ad11b</Data> 
  <Data>c0000005</Data> 
  <Data>000000000032dab4</Data> 
  <Data>3818</Data> 
  <Data>01dac77359173dbe</Data> 
  <Data>C:\Users\usr\Desktop\jellyfin-ffmpeg_6.0.1-7-portable_win64\ffmpeg.exe</Data> 
  <Data>C:\Users\usr\Desktop\jellyfin-ffmpeg_6.0.1-7-portable_win64\avfilter-9.dll</Data> 
  <Data>7b50b856-5989-4ef4-ad9f-03ded7842dfe</Data> 
  <Data /> 
  <Data /> 
  </EventData>
  </Event>

gnattu · 2024-06-26T03:12:14Z

How long did you let it run? I cannot reproduce on my machine.

gnattu · 2024-06-26T03:16:14Z

interesting, only occurs with windows builds

nyanmisaka · 2024-06-26T03:40:18Z

It happens randomly. Not sure if this has to do with gcc options and versions.

https://github.com/jellyfin/jellyfin-ffmpeg/blob/jellyfin/Dockerfile.win64.in#L20
https://github.com/jellyfin/jellyfin-ffmpeg/blob/jellyfin/builder/images/base-win64/Dockerfile#L43

Maybe try another builder? ./builder/makeimage.sh win64 gpl && ./builder/build.sh win64 gpl

gnattu · 2024-06-26T03:45:17Z

It happens randomly.

The issue occurs frequently on Windows but never on Linux. Windows reports an access violation, which doesn't make sense to me if all memory accesses are to legal locations on Linux.

nyanmisaka · 2024-06-26T03:50:18Z

It happens randomly.

The issue occurs frequently on Windows but never on Linux. Windows reports an access violation, which doesn't make sense to me if all memory accesses are to legal locations on Linux.

Does the compiler emit the same assembly code from the intrinsics?

gnattu · 2024-06-26T03:51:27Z

./builder/makeimage.sh win64 gpl gives me 18.92 ../src/meson.build:28:12: ERROR: Program 'llvm-dlltool dlltool' not found or not executable when building libplacebo...

nyanmisaka · 2024-06-26T03:54:50Z

./builder/makeimage.sh win64 gpl gives me 18.92 ../src/meson.build:28:12: ERROR: Program 'llvm-dlltool dlltool' not found or not executable when building libplacebo...

I'll take a look at it later. You can drop the vulkan related scripts first.

gnattu · 2024-06-26T04:46:21Z

The problem is now super stupid to me. I commented out both read and write to/from the framebuffer operation and it is still telling me access violation.

gnattu · 2024-06-26T06:52:35Z

Well I found the cause. GCC generated code sequences from _mm256_extract_epi32 and _mm_extract_epi32 intrinsics includes optimizations that do not play well with windows and will cause access violation. By storing the whole register to memory and read from there workarounds the access violation on windows.

nyanmisaka · 2024-06-26T07:03:20Z

Well I found the cause. GCC generated code sequences from _mm256_extract_epi32 and _mm_extract_epi32 intrinsics includes optimizations that do not play well with windows and will cause access violation. By storing the whole register to memory and read from there workarounds the access violation on windows.

I guess this uncertainty from the compiler is one of the reasons why upstream FFmpeg only accepts assembly code.

gnattu · 2024-06-26T15:07:01Z

Well it turns out that it is not that simple.

Now I really believe it is due to gcc+windows. I can workaround this issue by reducing the -filter_threads to a lower value like 8. When this value is set to something like 16 or higher, it is super easy to trigger the access violation again.

To make it even worse, it seems like only cap the concurrency in tonemapx is not enough and this has to be a global option which means all filters in the chain has to be concurrency capped.

With a global concurrency of 24 and we only spawn 1 job for tonemapx, the access violation still happens after a few moments.

nyanmisaka · 2024-06-26T15:22:27Z

Well it turns out that it is not that simple.

Now I really believe it is due to gcc+windows. I can workaround this issue by reducing the -filter_threads to a lower value like 8. When this value is set to something like 16 or higher, it is super easy to trigger the access violation again.

To make it even worse, it seems like only cap the concurrency in tonemapx is not enough and this has to be a global option which means all filters in the chain has to be concurrency capped.

With a global concurrency of 24 and we only spawn 1 job for tonemapx, the access violation still happens after a few moments.

Take it easy. We still have several months to investigate before JF 10.10.

Could it be related to LTO and GCC auto-vectorization?
https://github.com/jellyfin/jellyfin-ffmpeg/blob/jellyfin/debian/patches/0035-enable-gcc-vectorization-and-lto-auto.patch

gnattu · 2024-06-26T15:55:09Z

__attribute__((optimize("-fno-tree-vectorize"))) on simd functions made little to no difference

What works is that by implementing the write whole block back to memory and read from there(no usage of _mm256_extract_epi32 ) plus a global threads cap of 8 and a normal video resolution (like 1920x1080), and an abnormal output of 1928x1080 requires even lower thread limit to 4.

I however observed that zscale can support a huge amount of concurrency (with 1920x1080 works with 24 and 1928x1080 works with 16) after the _mm256_extract_epi32 workarounds being implemented. Maybe ffmpeg's native scaling filter is having some problem?

Edit: it seems like zscale works even without modification? At least it works with -filter_threads 16 and all the resolution combination you mentioned.

nyanmisaka · 2024-06-26T16:31:39Z

Have you seen these? MingGW-W64 seems to be quite fragile in handling AVX.

https://stackoverflow.com/questions/71859992/what-is-causing-this-memory-access-violation-error-0xc0000005-when-using-eigen

https://stackoverflow.com/questions/30928265/mingw64-is-incapable-of-32-byte-stack-alignment-required-for-avx-on-windows-x64

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54412

edit:
If this is the reason, a workaround is to add the compiler option -Wa,-muse-unaligned-vector-move to mingw.

https://github.com/search?q=use-unaligned-vector-move&type=code

gnattu · 2024-06-26T16:58:50Z

Have you seen these? MingGW-W64 seems to be quite fragile in handling AVX.

https://stackoverflow.com/questions/71859992/what-is-causing-this-memory-access-violation-error-0xc0000005-when-using-eigen

https://stackoverflow.com/questions/30928265/mingw64-is-incapable-of-32-byte-stack-alignment-required-for-avx-on-windows-x64

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54412

Well it is worse than I think. The problem now is that the sse version also has such problems and I have no idea why. It does support higher concurrency though, but when the threads is too much the access violation will eventually come even with only sse. Still, zscale works better.

nyanmisaka · 2024-06-26T17:08:02Z

Have you seen these? MingGW-W64 seems to be quite fragile in handling AVX.
https://stackoverflow.com/questions/71859992/what-is-causing-this-memory-access-violation-error-0xc0000005-when-using-eigen
https://stackoverflow.com/questions/30928265/mingw64-is-incapable-of-32-byte-stack-alignment-required-for-avx-on-windows-x64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54412

Well it is worse than I think. The problem now is that the sse version also has such problems and I have no idea why. It does support higher concurrency though, but when the threads is too much the access violation will eventually come even with only sse. Still, zscale works better.

Perhaps you should try compiling with MSVC to see if this is just another mingw gcc issue.
https://github.com/ShiftMediaProject/FFVS-Project-Generator

gnattu · 2024-06-26T17:15:00Z

Perhaps you should try compiling with MSVC to see if this is just another mingw gcc issue. https://github.com/ShiftMediaProject/FFVS-Project-Generator

Will figure it out later. -Wa,-muse-unaligned-vector-move does not help so this is another issue i think. But I still removed avx on windows due to its known bug.

gnattu · 2024-06-26T19:18:10Z

The access violation with many threads even happens with msvc(btw our codebase is not really compatible with msvc and have to comment out a lot of things to make it compile). Perhaps I have to try that project generator to debug with visual studio to see what happens...

gnattu · 2024-06-28T17:36:04Z

I "fixed" the access violation on Windows by refactoring the memory store logic and using more stable range clipping.

The new memory store logic improved performance for all platforms:

i9-12900 with AVX2: 77fps->87fps (Linux)
M1 Max with NEON: 57fps->60fps

Now AVX is usable on Windows with compiler flag -Wa,-muse-unaligned-vector-move. The performance impact of this flag is around 5%. i9-12900 got 82fps on Windows.

gnattu · 2024-06-28T18:17:14Z

A question about the yuv420p implementation: If most of the software decoder and encoders are not expecting p01x frames, isn't it safe to just drop the support for such frames and supports yuv420 exclusively? For what use case the p01x is preferred?

debian/patches/0080-add-tonemapx-filter.patch

nyanmisaka · 2024-06-30T11:35:30Z

There are also some warnings that can be eliminated before merging. There should be more in the GH actions log.

For example:

libavfilter/vf_tonemapx.c: In function ‘tonemap_frame_p016_p010_2_nv12’:
libavfilter/vf_tonemapx.c:495:13: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
  495 |             int r00 = r[0], g00 = g[0], b00 = b[0];
      |             ^~~
libavfilter/vf_tonemapx.c: In function ‘tonemap_frame_420p10_2_420p’:
libavfilter/vf_tonemapx.c:585:13: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
  585 |             int r00 = r[0], g00 = g[0], b00 = b[0];
      |             ^~~
libavfilter/vf_tonemapx.c: In function ‘tonemap_frame_420p10_2_420p10’:
libavfilter/vf_tonemapx.c:675:13: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
  675 |             int r00 = r[0], g00 = g[0], b00 = b[0];
      |             ^~~
libavfilter/vf_tonemapx.c: In function ‘tonemap_frame_p016_p010_2_p016_p010’:
libavfilter/vf_tonemapx.c:767:13: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
  767 |             int r00 = r[0], g00 = g[0], b00 = b[0];
      |             ^~~
libavfilter/vf_tonemapx.c: In function ‘filter_slice_planar10’:
libavfilter/vf_tonemapx.c:851:43: warning: passing argument 1 of ‘s->tonemap_func_planar10’ from incompatible pointer type [-Wincompatible-pointer-types]
  851 |     s->tonemap_func_planar10(out->data[0] + out->linesize[0] * slice_start,
      |                              ~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                                           |
      |                                           uint8_t * {aka unsigned char *}
libavfilter/vf_tonemapx.c:851:43: note: expected ‘uint16_t *’ {aka ‘short unsigned int *’} but argument is of type ‘uint8_t *’ {aka ‘unsigned char *’}
libavfilter/vf_tonemapx.c:852:43: warning: passing argument 2 of ‘s->tonemap_func_planar10’ from incompatible pointer type [-Wincompatible-pointer-types]
  852 |                              out->data[1] + out->linesize[1] * AV_CEIL_RSHIFT(slice_start, desc->log2_chroma_h),
libavfilter/vf_tonemapx.c:852:43: note: expected ‘uint16_t *’ {aka ‘short unsigned int *’} but argument is of type ‘uint8_t *’ {aka ‘unsigned char *’}
libavfilter/vf_tonemapx.c:853:43: warning: passing argument 3 of ‘s->tonemap_func_planar10’ from incompatible pointer type [-Wincompatible-pointer-types]
  853 |                              out->data[2] + out->linesize[2] * AV_CEIL_RSHIFT(slice_start, desc->log2_chroma_h),
libavfilter/vf_tonemapx.c:853:43: note: expected ‘uint16_t *’ {aka ‘short unsigned int *’} but argument is of type ‘uint8_t *’ {aka ‘unsigned char *’}
libavfilter/vf_tonemapx.c: In function ‘filter_slice_biplanar10’:
libavfilter/vf_tonemapx.c:870:45: warning: passing argument 1 of ‘s->tonemap_func_biplanar10’ from incompatible pointer type [-Wincompatible-pointer-types]
  870 |     s->tonemap_func_biplanar10(out->data[0] + out->linesize[0] * slice_start,
      |                                ~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                                             |
      |                                             uint8_t * {aka unsigned char *}
libavfilter/vf_tonemapx.c:870:45: note: expected ‘uint16_t *’ {aka ‘short unsigned int *’} but argument is of type ‘uint8_t *’ {aka ‘unsigned char *’}
libavfilter/vf_tonemapx.c:871:45: warning: passing argument 2 of ‘s->tonemap_func_biplanar10’ from incompatible pointer type [-Wincompatible-pointer-types]
  871 |                                out->data[1] + out->linesize[1] * AV_CEIL_RSHIFT(slice_start, desc->log2_chroma_h),
libavfilter/vf_tonemapx.c:871:45: note: expected ‘uint16_t *’ {aka ‘short unsigned int *’} but argument is of type ‘uint8_t *’ {aka ‘unsigned char *’}

nyanmisaka · 2024-06-30T14:44:35Z

It seems that ffmpeg has a HAVE_INTRINSICS_NEON generated from configure. Is it necessary to add one for SSE and AVX as well? Just for completeness, in case our downstream users (e.g. SynoCommunity) are using legacy compilers and resulting in breakages.

edit:

Disable corresponding SIMD code when header files or funcs are not available
When disabling SIMD using ffmpeg flag, the corresponding code should be excluded --disable-neon
Runtime SIMD check

gnattu · 2024-06-30T15:08:05Z

It seems that ffmpeg has a HAVE_INTRINSICS_NEON generated from configure. Is it necessary to add one for SSE and AVX as well? Just for completeness, in case our downstream users (e.g. SynoCommunity) are using legacy compilers and resulting in breakages.

edit:

Disable corresponding SIMD code when header files or funcs are not available

When disabling SIMD using ffmpeg flag, the corresponding code should be excluded --disable-neon

Runtime SIMD check

I couldn't care less about those ancient compilers. I even dropped debian buster support due to its ancient gcc. Extend the macro checking for configure flag and disable with --disable-neon is probably all I'm going to do. Check for header existence is out of scope at least to me, and checking for function availability is almost impossible unless you write a super long configure script to test compile the whole file and then define the disable intrinsics macro, but I'm not going to do that

nyanmisaka · 2024-06-30T15:15:26Z

It seems that ffmpeg has a HAVE_INTRINSICS_NEON generated from configure. Is it necessary to add one for SSE and AVX as well? Just for completeness, in case our downstream users (e.g. SynoCommunity) are using legacy compilers and resulting in breakages.
edit:

Disable corresponding SIMD code when header files or funcs are not available

When disabling SIMD using ffmpeg flag, the corresponding code should be excluded --disable-neon

Runtime SIMD check

I couldn't care less about those ancient compilers. I even dropped debian buster support due to its ancient gcc. Extend the macro checking for configure flag and disable with --disable-neon is probably all I'm going to do. Check for header existence is out of scope at least to me, and checking for function availability is almost impossible unless you write a super long configure script to test compile the whole file and then define the disable intrinsics macro, but I'm not going to do that

Indeed, it is not practical to test all functions. How about checking with compiler version? When the gcc/clang version is lower than required, disable SIMD. Other compilers can be reasonably ignored.

gnattu · 2024-06-30T16:41:38Z

Indeed, it is not practical to test all functions. How about checking with compiler version? When the gcc/clang version is lower than required, disable SIMD. Other compilers can be reasonably ignored.

Done

docker-build-win64.sh

debian/patches/0080-add-tonemapx-filter.patch

gnattu · 2024-07-01T13:53:50Z

I made a windows clang build for testing: https://github.com/gnattu/jellyfin-ffmpeg/releases/tag/win64-clang

This performs faster than the gcc version(at least on my own machine), but more testing is needed.

nyanmisaka · 2024-07-02T15:57:16Z

0080-add-simd-optimized-tonemapx-filter.patch

I made some minor improvements:

Briefly test immtrin.h availability instead of relying entirely on compiler version
Comment out intrin code when disabling intrinsics, to prevent them from still being compiled and possibly throwing errors
Fix warning: unused variable 'cpu_flags' when intrinsics is completely disabled

Two Q:

Any chance to re-enable intrinsics for gcc 9? Our focal/20.04 build still uses it.
Seems that immintrin.h already includes emmintrin.h + smmintrin.h, maybe just including immintrin.h is enough?

gnattu · 2024-07-02T18:36:27Z

Briefly test immtrin.h availability instead of relying entirely on compiler version

This seems unnecessary because you have to test compiler version after all. GCC has immtrin.h since 4.x (forget which precisely), but what is available in that header file varies largely which means intrinsics will be unavailable even with the presence of this file and you cannot test all instructions. Unless there is a stupid compiler that lies about itself and pretend to be a modern gcc/clang and does not implement x86 intrinsics.

Any chance to re-enable intrinsics for gcc 9? Our focal/20.04 build still uses it.

It should compile, though I cannot guarantee if it really works.

Seems that immintrin.h already includes emmintrin.h + smmintrin.h, maybe just including immintrin.h is enough?

This should work on all modern compilers, not sure for ancient gcc

nyanmisaka · 2024-07-03T03:49:42Z

This seems unnecessary because you have to test compiler version after all. GCC has immtrin.h since 4.x (forget which precisely), but what is available in that header file varies largely which means intrinsics will be unavailable even with the presence of this file and you cannot test all instructions. Unless there is a stupid compiler that lies about itself and pretend to be a modern gcc/clang and does not implement x86 intrinsics.

At least this ensures that the current gcc/clang environment is capable of handling some kind of intrinsics before actually building ffmpeg, no?

If immintrin.h or one of its included headers is missing, or the package containing it is not installed or is corrupted, then simply checking the version of gcc/clang is not sufficient.

As for whether the contents of the headers are out of date, I think that is beyond our scope. If it's really necessary, we can add them manually, just like you did for _mm_storeu_si32() in gcc10-.

It should compile, though I cannot guarantee if it really works.

20.04 will reach EOL in April 2025. We can try enabling it and see.

This should work on all modern compilers, not sure for ancient gcc

I think we don't care about those ancient compilers.

Co-authored-by: Nyanmisaka <nst799610810@gmail.com>

resolved

gnattu requested a review from a team June 25, 2024 13:07

gnattu mentioned this pull request Jun 25, 2024

avfilter/tonemap: add simd implementation for sse and neon #401

Closed

gnattu mentioned this pull request Jun 28, 2024

Fix current CI #408

Merged

nyanmisaka reviewed Jun 30, 2024

View reviewed changes

gnattu added 3 commits June 30, 2024 21:42

tonemapx/avfilter: don't redefine white and float eps

c6c64b3

avfilter/tonemapx: use AVFILTER instead of FFMPEG_VF as header prefix

6fb9416

avfilter/tonemapx: code cleanup

8694c6a

avfilter/tonemapx: merge sse and avx multiplication

96654ff

avfilter/tonemapx: add compile time check for simd optimizations

6200f53

avfilter/tonemapx: correctly disable neon at compile time

85dc409

nyanmisaka reviewed Jul 1, 2024

View reviewed changes

docker-build-win64.sh Outdated Show resolved Hide resolved

debian/patches/0080-add-tonemapx-filter.patch Outdated Show resolved Hide resolved

gnattu added 3 commits July 1, 2024 14:24

avfilter/tonemapx: fix indent

5b15f00

ci: move windows gcc workaround to configure script

713e707

avfilter/tonemapx: fix more indent

5c9a090

gnattu and others added 2 commits July 3, 2024 12:03

avfilter/tonemapx: compiler config improvements

9432639

Co-authored-by: Nyanmisaka <nst799610810@gmail.com>

avfilter/tonemapx: allow gcc9 to use intrinsics

5f0da24

nyanmisaka approved these changes Jul 3, 2024

View reviewed changes

avfilter/tonemapx: always target specific extension

10cd1a1

nyanmisaka approved these changes Jul 6, 2024

View reviewed changes

gnattu merged commit b28524f into jellyfin Jul 7, 2024
25 checks passed

gnattu deleted the simd-tonemapx branch July 7, 2024 14:40

This was referenced Jul 7, 2024

[Test only] Test perf of tonemapx filter #402

Closed

Add software tonemap filter support jellyfin/jellyfin#12270

Merged

avfilter/tonemapx: add simd optimized tonemapx #407

avfilter/tonemapx: add simd optimized tonemapx #407

Conversation

gnattu commented Jun 25, 2024

nyanmisaka commented Jun 25, 2024

gnattu commented Jun 25, 2024 • edited Loading

nyanmisaka commented Jun 25, 2024

gnattu commented Jun 25, 2024

nyanmisaka commented Jun 25, 2024

gnattu commented Jun 26, 2024

nyanmisaka commented Jun 26, 2024 • edited Loading

gnattu commented Jun 26, 2024

gnattu commented Jun 26, 2024

nyanmisaka commented Jun 26, 2024

gnattu commented Jun 26, 2024

nyanmisaka commented Jun 26, 2024

gnattu commented Jun 26, 2024

nyanmisaka commented Jun 26, 2024

gnattu commented Jun 26, 2024

gnattu commented Jun 26, 2024

nyanmisaka commented Jun 26, 2024

gnattu commented Jun 26, 2024 • edited Loading

nyanmisaka commented Jun 26, 2024

gnattu commented Jun 26, 2024 • edited Loading

nyanmisaka commented Jun 26, 2024 • edited Loading

gnattu commented Jun 26, 2024 • edited Loading

nyanmisaka commented Jun 26, 2024

gnattu commented Jun 26, 2024

gnattu commented Jun 26, 2024

gnattu commented Jun 28, 2024

gnattu commented Jun 28, 2024

nyanmisaka commented Jun 30, 2024

nyanmisaka commented Jun 30, 2024 • edited Loading

gnattu commented Jun 30, 2024

nyanmisaka commented Jun 30, 2024

gnattu commented Jun 30, 2024

gnattu commented Jul 1, 2024 • edited Loading

nyanmisaka commented Jul 2, 2024

gnattu commented Jul 2, 2024

nyanmisaka commented Jul 3, 2024

gnattu commented Jun 25, 2024 •

edited

Loading

nyanmisaka commented Jun 26, 2024 •

edited

Loading

gnattu commented Jun 26, 2024 •

edited

Loading

gnattu commented Jun 26, 2024 •

edited

Loading

nyanmisaka commented Jun 26, 2024 •

edited

Loading

gnattu commented Jun 26, 2024 •

edited

Loading

nyanmisaka commented Jun 30, 2024 •

edited

Loading

gnattu commented Jul 1, 2024 •

edited

Loading