Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

avfilter/tonemapx: add simd optimized tonemapx #407

Merged
merged 27 commits into from
Jul 7, 2024
Merged

Conversation

gnattu
Copy link
Member

@gnattu gnattu commented Jun 25, 2024

This includes NEON for ARMv8, SSE for x86-64-v2 and AVX+FMA for x86-64-v3

Test result with 4K HEVC 10bit HLG input, encoding with libx264 veryfast using bt2390:

Intel Core i9-12900:

tonemapx.c: 57fps
tonemapx.sse: 74fps
tonemapx.avx: 77fps

Apple M1 Max:

tonemapx.c:43fps
tonemapx.neon: 57fps

For comparison, original zscale+tonemap simd results:

Intel Core i9-12900:

tonemap.avx: 40fps
tonemap.sse: 40fps
tonemap.c: 32fps

Apple M1 Max:

tonemap.neon: 44fps
tonemap.c: 35fps

The original implementation is too memory heavy that dual-channel desktop CPUs are easily memory bounded due to the intermediate RGBF32 framebuffer sharing with zscale. Tonemapx lowered the the bandwidth requirement which brings significant performance gain to bandwidth limited platforms. Even for bandwidth-rich M1 Max it still provides significant performance boost due to better cache hitrate.

Changes

  • Fixed an issue that the input parameter is not really initialized
  • Added neon, avx and sse code path
  • Use function pointer instead of branching in each thread for tonemap function selection

Issues

Replaces #401

@nyanmisaka
Copy link
Member

Got 74fps with the tonemapx.avx on 5950x on Windows. I'm quite pleased with the performance.

Further adding yuv420p10 input and yuv420p output support should squeeze out more FPS, because IIRC yuv420p1x to p01x conversion in FFmpeg is written in C, and other encoders such as libx265 and svt-av1 do not support nv12 input.

@gnattu
Copy link
Member Author

gnattu commented Jun 25, 2024

Further adding yuv420p10 input and yuv420p output support should squeeze out more FPS, because IIRC yuv420p1x to p01x conversion in FFmpeg is written in C, and other encoders such as libx265 and svt-av1 do not support nv12 input.

Probably not that much for bandwidth rich systems as this conversion is mainly memory operation not compute operation, could be useful for bandwidth constrained platforms if we process yuv420p directly as the frame copy is further reduced and our avx implementation could see more improvements.

@nyanmisaka
Copy link
Member

Further adding yuv420p10 input and yuv420p output support should squeeze out more FPS, because IIRC yuv420p1x to p01x conversion in FFmpeg is written in C, and other encoders such as libx265 and svt-av1 do not support nv12 input.

Probably not that much for bandwidth rich systems as this conversion is mainly memory operation not compute operation, could be useful for bandwidth constrained platforms if we process yuv420p directly as the frame copy is further reduced and our avx implementation could see more improvements.

We really can't expect all our users to have i9/R9/Apple Silicon Max. FWIW in this simple test it already clearly hurts performance, although it wouldn't be so noticeable in a real world testing because this version is already much better than the zscale+tonemap.orig.

./ffmpeg -threads 16 -i 4K_HLG.MP4 -an -sn -vf format=yuv420p10 -vframes 1000 -f null -

frame= 1000 fps=251 q=-0.0 Lsize=N/A time=00:00:16.71 bitrate=N/A speed=4.19x

./ffmpeg -threads 16 -i 4K_HLG.MP4 -an -sn -vf format=yuv420p10,format=p010 -vframes 1000 -f null -

frame= 1000 fps=168 q=-0.0 Lsize=N/A time=00:00:16.71 bitrate=N/A speed=2.81x

@gnattu
Copy link
Member Author

gnattu commented Jun 25, 2024

We really can't expect all our users to have i9/R9/Apple Silicon Max. FWIW in this simple test it already clearly hurts performance, although it wouldn't be so noticeable in a real world testing because this version is already much better than the zscale+tonemap.orig.

./ffmpeg -threads 16 -i 4K_HLG.MP4 -an -sn -vf format=yuv420p10 -vframes 1000 -f null -

frame= 1000 fps=251 q=-0.0 Lsize=N/A time=00:00:16.71 bitrate=N/A speed=4.19x

./ffmpeg -threads 16 -i 4K_HLG.MP4 -an -sn -vf format=yuv420p10,format=p010 -vframes 1000 -f null -

frame= 1000 fps=168 q=-0.0 Lsize=N/A time=00:00:16.71 bitrate=N/A speed=2.81x

Do you have time for a reference c implementation of yuv420p? I can port SIMD version to it.

@nyanmisaka
Copy link
Member

We really can't expect all our users to have i9/R9/Apple Silicon Max. FWIW in this simple test it already clearly hurts performance, although it wouldn't be so noticeable in a real world testing because this version is already much better than the zscale+tonemap.orig.

./ffmpeg -threads 16 -i 4K_HLG.MP4 -an -sn -vf format=yuv420p10 -vframes 1000 -f null -

frame= 1000 fps=251 q=-0.0 Lsize=N/A time=00:00:16.71 bitrate=N/A speed=4.19x

./ffmpeg -threads 16 -i 4K_HLG.MP4 -an -sn -vf format=yuv420p10,format=p010 -vframes 1000 -f null -

frame= 1000 fps=168 q=-0.0 Lsize=N/A time=00:00:16.71 bitrate=N/A speed=2.81x

Do you have time for a reference c implementation of yuv420p? I can port SIMD version to it.

Not yet. But here are some examples. Just convert it to use in a for loop. The difference between nv12/p01x and yuv420p/yuv420p1x is only whether U and V are interleaved.

+static __inline__ __device__ T read_sample(const FFCUDAFrame& frame, int x, int y)
+{
+ T* ptr = (T*)(frame.data[p] + (y * frame.linesize[p]));
+ return ptr[x];
+}
+
+// Per-format read functions
+static __inline__ __device__ ushort3 read_p016(const FFCUDAFrame& frame, int x, int y)
+{
+ return make_ushort3(read_sample<unsigned short, 0>(frame, x, y),
+ read_sample<unsigned short, 1>(frame, (x & ~1), y / 2),
+ read_sample<unsigned short, 1>(frame, (x & ~1) + 1, y / 2));
+}
+
+static __inline__ __device__ ushort3 read_p010(const FFCUDAFrame& frame, int x, int y)
+{
+ ushort3 val = read_p016(frame, x, y);
+ return make_ushort3(val.x >> 6,
+ val.y >> 6,
+ val.z >> 6);
+}
+
+static __inline__ __device__ ushort3 read_yuv420p16(const FFCUDAFrame& frame, int x, int y)
+{
+ return make_ushort3(read_sample<unsigned short, 0>(frame, x, y),
+ read_sample<unsigned short, 1>(frame, x / 2, y / 2),
+ read_sample<unsigned short, 2>(frame, x / 2, y / 2));
+}
+
+static __inline__ __device__ ushort3 read_yuv420p10(const FFCUDAFrame& frame, int x, int y)
+{
+ ushort3 val = read_yuv420p16(frame, x, y);
+ return make_ushort3(val.x >> 6,
+ val.y >> 6,
+ val.z >> 6);
+}
+
+// Generic read functions
+static __inline__ __device__ ushort3 read_px(const FFCUDAFrame& frame, int x, int y)
+{
+ if (fmt_src == AV_PIX_FMT_P010)
+ return read_p010(frame, x, y);
+ else if (fmt_src == AV_PIX_FMT_P016)
+ return read_p016(frame, x, y);
+ else
+ return make_ushort3(0, 0, 0);
+}
+
+static __inline__ __device__ float sample_to_float(unsigned short i)
+{
+ return (float)i / ((1 << depth_src) - 1);
+}
+
+static __inline__ __device__ float3 pixel_to_float3(ushort3 flt)
+{
+ return make_float3(sample_to_float(flt.x),
+ sample_to_float(flt.y),
+ sample_to_float(flt.z));
+}
+
+static __inline__ __device__ float3 read_px_flt(const FFCUDAFrame& frame, int x, int y)
+{
+ return pixel_to_float3(read_px(frame, x, y));
+}
+
+// Single-sample write function
+template<int p, class T>
+static __inline__ __device__ void write_sample(const FFCUDAFrame& frame, int x, int y, T sample)
+{
+ T* ptr = (T*)(frame.data[p] + (y * frame.linesize[p]));
+ ptr[x] = sample;
+}
+
+// Per-format write functions
+static __inline__ __device__ void write_nv12_2x2(const FFCUDAFrame& frame, int x, int y, ushort3 a, ushort3 b, ushort3 c, ushort3 d, ushort3 chroma)
+{
+ write_sample<0>(frame, x, y, (unsigned char)a.x);
+ write_sample<0>(frame, x + 1, y, (unsigned char)b.x);
+ write_sample<0>(frame, x, y + 1, (unsigned char)c.x);
+ write_sample<0>(frame, x + 1, y + 1, (unsigned char)d.x);
+
+ write_sample<1>(frame, (x & ~1), y / 2, (unsigned char)chroma.y);
+ write_sample<1>(frame, (x & ~1) + 1, y / 2, (unsigned char)chroma.z);
+}
+
+static __inline__ __device__ void write_yuv420p_2x2(const FFCUDAFrame& frame, int x, int y, ushort3 a, ushort3 b, ushort3 c, ushort3 d, ushort3 chroma)
+{
+ write_sample<0>(frame, x, y, (unsigned char)a.x);
+ write_sample<0>(frame, x + 1, y, (unsigned char)b.x);
+ write_sample<0>(frame, x, y + 1, (unsigned char)c.x);
+ write_sample<0>(frame, x + 1, y + 1, (unsigned char)d.x);
+
+ write_sample<1>(frame, x / 2, y / 2, (unsigned char)chroma.y);
+ write_sample<2>(frame, x / 2, y / 2, (unsigned char)chroma.z);
+}
+
+static __inline__ __device__ void write_p016_2x2(const FFCUDAFrame& frame, int x, int y, ushort3 a, ushort3 b, ushort3 c, ushort3 d, ushort3 chroma)
+{
+ write_sample<0>(frame, x, y, (unsigned short)a.x);
+ write_sample<0>(frame, x + 1, y, (unsigned short)b.x);
+ write_sample<0>(frame, x, y + 1, (unsigned short)c.x);
+ write_sample<0>(frame, x + 1, y + 1, (unsigned short)d.x);
+
+ write_sample<1>(frame, (x & ~1), y / 2, (unsigned short)chroma.y);
+ write_sample<1>(frame, (x & ~1) + 1, y / 2, (unsigned short)chroma.z);
+}
+
+static __inline__ __device__ void write_p010_2x2(const FFCUDAFrame& frame, int x, int y, ushort3 a, ushort3 b, ushort3 c, ushort3 d, ushort3 chroma)
+{
+ write_sample<0>(frame, x, y, (unsigned short)(a.x << 6));
+ write_sample<0>(frame, x + 1, y, (unsigned short)(b.x << 6));
+ write_sample<0>(frame, x, y + 1, (unsigned short)(c.x << 6));
+ write_sample<0>(frame, x + 1, y + 1, (unsigned short)(d.x << 6));
+
+ write_sample<1>(frame, (x & ~1), y / 2, (unsigned short)(chroma.y << 6));
+ write_sample<1>(frame, (x & ~1) + 1, y / 2, (unsigned short)(chroma.z << 6));
+}
+
+static __inline__ __device__ void write_yuv420p16_2x2(const FFCUDAFrame& frame, int x, int y, ushort3 a, ushort3 b, ushort3 c, ushort3 d, ushort3 chroma)
+{
+ write_sample<0>(frame, x, y, (unsigned short)a.x);
+ write_sample<0>(frame, x + 1, y, (unsigned short)b.x);
+ write_sample<0>(frame, x, y + 1, (unsigned short)c.x);
+ write_sample<0>(frame, x + 1, y + 1, (unsigned short)d.x);
+
+ write_sample<1>(frame, x / 2, y / 2, (unsigned short)chroma.y);
+ write_sample<2>(frame, x / 2, y / 2, (unsigned short)chroma.z);
+}
+
+static __inline__ __device__ void write_yuv420p10_2x2(const FFCUDAFrame& frame, int x, int y, ushort3 a, ushort3 b, ushort3 c, ushort3 d, ushort3 chroma)
+{
+ write_sample<0>(frame, x, y, (unsigned short)(a.x << 6));
+ write_sample<0>(frame, x + 1, y, (unsigned short)(b.x << 6));
+ write_sample<0>(frame, x, y + 1, (unsigned short)(c.x << 6));
+ write_sample<0>(frame, x + 1, y + 1, (unsigned short)(d.x << 6));
+
+ write_sample<1>(frame, x / 2, y / 2, (unsigned short)(chroma.y << 6));
+ write_sample<2>(frame, x / 2, y / 2, (unsigned short)(chroma.z << 6));
+}

@gnattu
Copy link
Member Author

gnattu commented Jun 26, 2024

gnu.org is down for 9 hours now and failing the pipeline

@nyanmisaka
Copy link
Member

nyanmisaka commented Jun 26, 2024

ffmpeg crashes after inserting a downscaling filter and running for a while.
scale=s=1920x1080:flags=fast_bilinear,tonemapx=...

For example, these resolutions.
1280x720
1920x1080
1920x1088
2560x1440

It can't be reproduced in tonemapx.c. I guess it has something to do with the edges of the image that cannot be accelerated by SIMD.

- <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
- <System>
  <Provider Name="Application Error" /> 
  <EventID Qualifiers="0">1000</EventID> 
  <Version>0</Version> 
  <Level>2</Level> 
  <Task>100</Task> 
  <Opcode>0</Opcode> 
  <Keywords>0x80000000000000</Keywords> 
  <TimeCreated SystemTime="2024-06-26T02:48:40.9359314Z" /> 
  <EventRecordID>216441</EventRecordID> 
  <Correlation /> 
  <Execution ProcessID="0" ThreadID="0" /> 
  <Channel>Application</Channel> 
  <Computer>pc</Computer> 
  <Security /> 
  </System>
- <EventData>
  <Data>ffmpeg.exe</Data> 
  <Data>0.0.0.0</Data> 
  <Data>667ad11b</Data> 
  <Data>avfilter-9.dll</Data> 
  <Data>9.3.100.0</Data> 
  <Data>667ad11b</Data> 
  <Data>c0000005</Data> 
  <Data>000000000032dab4</Data> 
  <Data>3818</Data> 
  <Data>01dac77359173dbe</Data> 
  <Data>C:\Users\usr\Desktop\jellyfin-ffmpeg_6.0.1-7-portable_win64\ffmpeg.exe</Data> 
  <Data>C:\Users\usr\Desktop\jellyfin-ffmpeg_6.0.1-7-portable_win64\avfilter-9.dll</Data> 
  <Data>7b50b856-5989-4ef4-ad9f-03ded7842dfe</Data> 
  <Data /> 
  <Data /> 
  </EventData>
  </Event>

@gnattu
Copy link
Member Author

gnattu commented Jun 26, 2024

How long did you let it run? I cannot reproduce on my machine.

@gnattu
Copy link
Member Author

gnattu commented Jun 26, 2024

interesting, only occurs with windows builds

@nyanmisaka
Copy link
Member

It happens randomly. Not sure if this has to do with gcc options and versions.

https://github.com/jellyfin/jellyfin-ffmpeg/blob/jellyfin/Dockerfile.win64.in#L20
https://github.com/jellyfin/jellyfin-ffmpeg/blob/jellyfin/builder/images/base-win64/Dockerfile#L43

Maybe try another builder? ./builder/makeimage.sh win64 gpl && ./builder/build.sh win64 gpl

@gnattu
Copy link
Member Author

gnattu commented Jun 26, 2024

It happens randomly.

The issue occurs frequently on Windows but never on Linux. Windows reports an access violation, which doesn't make sense to me if all memory accesses are to legal locations on Linux.

@nyanmisaka
Copy link
Member

It happens randomly.

The issue occurs frequently on Windows but never on Linux. Windows reports an access violation, which doesn't make sense to me if all memory accesses are to legal locations on Linux.

Does the compiler emit the same assembly code from the intrinsics?

@gnattu
Copy link
Member Author

gnattu commented Jun 26, 2024

./builder/makeimage.sh win64 gpl gives me 18.92 ../src/meson.build:28:12: ERROR: Program 'llvm-dlltool dlltool' not found or not executable when building libplacebo...

@nyanmisaka
Copy link
Member

./builder/makeimage.sh win64 gpl gives me 18.92 ../src/meson.build:28:12: ERROR: Program 'llvm-dlltool dlltool' not found or not executable when building libplacebo...

I'll take a look at it later. You can drop the vulkan related scripts first.

@gnattu
Copy link
Member Author

gnattu commented Jun 26, 2024

The problem is now super stupid to me. I commented out both read and write to/from the framebuffer operation and it is still telling me access violation.

@gnattu
Copy link
Member Author

gnattu commented Jun 26, 2024

Well I found the cause. GCC generated code sequences from _mm256_extract_epi32 and _mm_extract_epi32 intrinsics includes optimizations that do not play well with windows and will cause access violation. By storing the whole register to memory and read from there workarounds the access violation on windows.

@nyanmisaka
Copy link
Member

Well I found the cause. GCC generated code sequences from _mm256_extract_epi32 and _mm_extract_epi32 intrinsics includes optimizations that do not play well with windows and will cause access violation. By storing the whole register to memory and read from there workarounds the access violation on windows.

I guess this uncertainty from the compiler is one of the reasons why upstream FFmpeg only accepts assembly code.

@gnattu
Copy link
Member Author

gnattu commented Jun 26, 2024

Well it turns out that it is not that simple.

Now I really believe it is due to gcc+windows. I can workaround this issue by reducing the -filter_threads to a lower value like 8. When this value is set to something like 16 or higher, it is super easy to trigger the access violation again.

To make it even worse, it seems like only cap the concurrency in tonemapx is not enough and this has to be a global option which means all filters in the chain has to be concurrency capped.

With a global concurrency of 24 and we only spawn 1 job for tonemapx, the access violation still happens after a few moments.

@nyanmisaka
Copy link
Member

Well it turns out that it is not that simple.

Now I really believe it is due to gcc+windows. I can workaround this issue by reducing the -filter_threads to a lower value like 8. When this value is set to something like 16 or higher, it is super easy to trigger the access violation again.

To make it even worse, it seems like only cap the concurrency in tonemapx is not enough and this has to be a global option which means all filters in the chain has to be concurrency capped.

With a global concurrency of 24 and we only spawn 1 job for tonemapx, the access violation still happens after a few moments.

Take it easy. We still have several months to investigate before JF 10.10.

Could it be related to LTO and GCC auto-vectorization?
https://github.com/jellyfin/jellyfin-ffmpeg/blob/jellyfin/debian/patches/0035-enable-gcc-vectorization-and-lto-auto.patch

@gnattu
Copy link
Member Author

gnattu commented Jun 26, 2024

__attribute__((optimize("-fno-tree-vectorize"))) on simd functions made little to no difference

What works is that by implementing the write whole block back to memory and read from there(no usage of _mm256_extract_epi32 ) plus a global threads cap of 8 and a normal video resolution (like 1920x1080), and an abnormal output of 1928x1080 requires even lower thread limit to 4.

I however observed that zscale can support a huge amount of concurrency (with 1920x1080 works with 24 and 1928x1080 works with 16) after the _mm256_extract_epi32 workarounds being implemented. Maybe ffmpeg's native scaling filter is having some problem?

Edit: it seems like zscale works even without modification? At least it works with -filter_threads 16 and all the resolution combination you mentioned.

@nyanmisaka
Copy link
Member

nyanmisaka commented Jun 26, 2024

@gnattu
Copy link
Member Author

gnattu commented Jun 26, 2024

Have you seen these? MingGW-W64 seems to be quite fragile in handling AVX.

https://stackoverflow.com/questions/71859992/what-is-causing-this-memory-access-violation-error-0xc0000005-when-using-eigen

https://stackoverflow.com/questions/30928265/mingw64-is-incapable-of-32-byte-stack-alignment-required-for-avx-on-windows-x64

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54412

Well it is worse than I think. The problem now is that the sse version also has such problems and I have no idea why. It does support higher concurrency though, but when the threads is too much the access violation will eventually come even with only sse. Still, zscale works better.

@nyanmisaka
Copy link
Member

Have you seen these? MingGW-W64 seems to be quite fragile in handling AVX.
https://stackoverflow.com/questions/71859992/what-is-causing-this-memory-access-violation-error-0xc0000005-when-using-eigen
https://stackoverflow.com/questions/30928265/mingw64-is-incapable-of-32-byte-stack-alignment-required-for-avx-on-windows-x64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54412

Well it is worse than I think. The problem now is that the sse version also has such problems and I have no idea why. It does support higher concurrency though, but when the threads is too much the access violation will eventually come even with only sse. Still, zscale works better.

Perhaps you should try compiling with MSVC to see if this is just another mingw gcc issue.
https://github.com/ShiftMediaProject/FFVS-Project-Generator

@gnattu
Copy link
Member Author

gnattu commented Jun 26, 2024

Perhaps you should try compiling with MSVC to see if this is just another mingw gcc issue. https://github.com/ShiftMediaProject/FFVS-Project-Generator

Will figure it out later. -Wa,-muse-unaligned-vector-move does not help so this is another issue i think. But I still removed avx on windows due to its known bug.

@gnattu
Copy link
Member Author

gnattu commented Jun 26, 2024

The access violation with many threads even happens with msvc(btw our codebase is not really compatible with msvc and have to comment out a lot of things to make it compile). Perhaps I have to try that project generator to debug with visual studio to see what happens...

@gnattu
Copy link
Member Author

gnattu commented Jun 28, 2024

I "fixed" the access violation on Windows by refactoring the memory store logic and using more stable range clipping.

The new memory store logic improved performance for all platforms:

i9-12900 with AVX2: 77fps->87fps (Linux)
M1 Max with NEON: 57fps->60fps

Now AVX is usable on Windows with compiler flag -Wa,-muse-unaligned-vector-move. The performance impact of this flag is around 5%. i9-12900 got 82fps on Windows.

@gnattu
Copy link
Member Author

gnattu commented Jun 28, 2024

A question about the yuv420p implementation: If most of the software decoder and encoders are not expecting p01x frames, isn't it safe to just drop the support for such frames and supports yuv420 exclusively? For what use case the p01x is preferred?

@gnattu gnattu mentioned this pull request Jun 28, 2024
debian/patches/0080-add-tonemapx-filter.patch Outdated Show resolved Hide resolved
debian/patches/0080-add-tonemapx-filter.patch Outdated Show resolved Hide resolved
debian/patches/0080-add-tonemapx-filter.patch Outdated Show resolved Hide resolved
debian/patches/0080-add-tonemapx-filter.patch Outdated Show resolved Hide resolved
debian/patches/0080-add-tonemapx-filter.patch Outdated Show resolved Hide resolved
debian/patches/0080-add-tonemapx-filter.patch Outdated Show resolved Hide resolved
debian/patches/0080-add-tonemapx-filter.patch Outdated Show resolved Hide resolved
debian/patches/0080-add-tonemapx-filter.patch Outdated Show resolved Hide resolved
debian/patches/0080-add-tonemapx-filter.patch Outdated Show resolved Hide resolved
debian/patches/0080-add-tonemapx-filter.patch Outdated Show resolved Hide resolved
@nyanmisaka
Copy link
Member

There are also some warnings that can be eliminated before merging. There should be more in the GH actions log.

For example:

libavfilter/vf_tonemapx.c: In function ‘tonemap_frame_p016_p010_2_nv12’:
libavfilter/vf_tonemapx.c:495:13: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
  495 |             int r00 = r[0], g00 = g[0], b00 = b[0];
      |             ^~~
libavfilter/vf_tonemapx.c: In function ‘tonemap_frame_420p10_2_420p’:
libavfilter/vf_tonemapx.c:585:13: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
  585 |             int r00 = r[0], g00 = g[0], b00 = b[0];
      |             ^~~
libavfilter/vf_tonemapx.c: In function ‘tonemap_frame_420p10_2_420p10’:
libavfilter/vf_tonemapx.c:675:13: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
  675 |             int r00 = r[0], g00 = g[0], b00 = b[0];
      |             ^~~
libavfilter/vf_tonemapx.c: In function ‘tonemap_frame_p016_p010_2_p016_p010’:
libavfilter/vf_tonemapx.c:767:13: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
  767 |             int r00 = r[0], g00 = g[0], b00 = b[0];
      |             ^~~
libavfilter/vf_tonemapx.c: In function ‘filter_slice_planar10’:
libavfilter/vf_tonemapx.c:851:43: warning: passing argument 1 of ‘s->tonemap_func_planar10’ from incompatible pointer type [-Wincompatible-pointer-types]
  851 |     s->tonemap_func_planar10(out->data[0] + out->linesize[0] * slice_start,
      |                              ~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                                           |
      |                                           uint8_t * {aka unsigned char *}
libavfilter/vf_tonemapx.c:851:43: note: expected ‘uint16_t *’ {aka ‘short unsigned int *’} but argument is of type ‘uint8_t *’ {aka ‘unsigned char *’}
libavfilter/vf_tonemapx.c:852:43: warning: passing argument 2 of ‘s->tonemap_func_planar10’ from incompatible pointer type [-Wincompatible-pointer-types]
  852 |                              out->data[1] + out->linesize[1] * AV_CEIL_RSHIFT(slice_start, desc->log2_chroma_h),
libavfilter/vf_tonemapx.c:852:43: note: expected ‘uint16_t *’ {aka ‘short unsigned int *’} but argument is of type ‘uint8_t *’ {aka ‘unsigned char *’}
libavfilter/vf_tonemapx.c:853:43: warning: passing argument 3 of ‘s->tonemap_func_planar10’ from incompatible pointer type [-Wincompatible-pointer-types]
  853 |                              out->data[2] + out->linesize[2] * AV_CEIL_RSHIFT(slice_start, desc->log2_chroma_h),
libavfilter/vf_tonemapx.c:853:43: note: expected ‘uint16_t *’ {aka ‘short unsigned int *’} but argument is of type ‘uint8_t *’ {aka ‘unsigned char *’}
libavfilter/vf_tonemapx.c: In function ‘filter_slice_biplanar10’:
libavfilter/vf_tonemapx.c:870:45: warning: passing argument 1 of ‘s->tonemap_func_biplanar10’ from incompatible pointer type [-Wincompatible-pointer-types]
  870 |     s->tonemap_func_biplanar10(out->data[0] + out->linesize[0] * slice_start,
      |                                ~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                                             |
      |                                             uint8_t * {aka unsigned char *}
libavfilter/vf_tonemapx.c:870:45: note: expected ‘uint16_t *’ {aka ‘short unsigned int *’} but argument is of type ‘uint8_t *’ {aka ‘unsigned char *’}
libavfilter/vf_tonemapx.c:871:45: warning: passing argument 2 of ‘s->tonemap_func_biplanar10’ from incompatible pointer type [-Wincompatible-pointer-types]
  871 |                                out->data[1] + out->linesize[1] * AV_CEIL_RSHIFT(slice_start, desc->log2_chroma_h),
libavfilter/vf_tonemapx.c:871:45: note: expected ‘uint16_t *’ {aka ‘short unsigned int *’} but argument is of type ‘uint8_t *’ {aka ‘unsigned char *’}

@nyanmisaka
Copy link
Member

nyanmisaka commented Jun 30, 2024

It seems that ffmpeg has a HAVE_INTRINSICS_NEON generated from configure. Is it necessary to add one for SSE and AVX as well? Just for completeness, in case our downstream users (e.g. SynoCommunity) are using legacy compilers and resulting in breakages.

edit:

  • Disable corresponding SIMD code when header files or funcs are not available
  • When disabling SIMD using ffmpeg flag, the corresponding code should be excluded --disable-neon
  • Runtime SIMD check

@gnattu
Copy link
Member Author

gnattu commented Jun 30, 2024

It seems that ffmpeg has a HAVE_INTRINSICS_NEON generated from configure. Is it necessary to add one for SSE and AVX as well? Just for completeness, in case our downstream users (e.g. SynoCommunity) are using legacy compilers and resulting in breakages.

edit:

  • Disable corresponding SIMD code when header files or funcs are not available
  • When disabling SIMD using ffmpeg flag, the corresponding code should be excluded --disable-neon
  • Runtime SIMD check

I couldn't care less about those ancient compilers. I even dropped debian buster support due to its ancient gcc. Extend the macro checking for configure flag and disable with --disable-neon is probably all I'm going to do. Check for header existence is out of scope at least to me, and checking for function availability is almost impossible unless you write a super long configure script to test compile the whole file and then define the disable intrinsics macro, but I'm not going to do that

@nyanmisaka
Copy link
Member

It seems that ffmpeg has a HAVE_INTRINSICS_NEON generated from configure. Is it necessary to add one for SSE and AVX as well? Just for completeness, in case our downstream users (e.g. SynoCommunity) are using legacy compilers and resulting in breakages.
edit:

  • Disable corresponding SIMD code when header files or funcs are not available
  • When disabling SIMD using ffmpeg flag, the corresponding code should be excluded --disable-neon
  • Runtime SIMD check

I couldn't care less about those ancient compilers. I even dropped debian buster support due to its ancient gcc. Extend the macro checking for configure flag and disable with --disable-neon is probably all I'm going to do. Check for header existence is out of scope at least to me, and checking for function availability is almost impossible unless you write a super long configure script to test compile the whole file and then define the disable intrinsics macro, but I'm not going to do that

Indeed, it is not practical to test all functions. How about checking with compiler version? When the gcc/clang version is lower than required, disable SIMD. Other compilers can be reasonably ignored.

@gnattu
Copy link
Member Author

gnattu commented Jun 30, 2024

Indeed, it is not practical to test all functions. How about checking with compiler version? When the gcc/clang version is lower than required, disable SIMD. Other compilers can be reasonably ignored.

Done

docker-build-win64.sh Outdated Show resolved Hide resolved
debian/patches/0080-add-tonemapx-filter.patch Outdated Show resolved Hide resolved
@gnattu
Copy link
Member Author

gnattu commented Jul 1, 2024

I made a windows clang build for testing: https://github.com/gnattu/jellyfin-ffmpeg/releases/tag/win64-clang

This performs faster than the gcc version(at least on my own machine), but more testing is needed.

@nyanmisaka
Copy link
Member

0080-add-simd-optimized-tonemapx-filter.patch

I made some minor improvements:

  • Briefly test immtrin.h availability instead of relying entirely on compiler version
  • Comment out intrin code when disabling intrinsics, to prevent them from still being compiled and possibly throwing errors
  • Fix warning: unused variable 'cpu_flags' when intrinsics is completely disabled

Two Q:

  • Any chance to re-enable intrinsics for gcc 9? Our focal/20.04 build still uses it.
  • Seems that immintrin.h already includes emmintrin.h + smmintrin.h, maybe just including immintrin.h is enough?

@gnattu
Copy link
Member Author

gnattu commented Jul 2, 2024

Briefly test immtrin.h availability instead of relying entirely on compiler version

This seems unnecessary because you have to test compiler version after all. GCC has immtrin.h since 4.x (forget which precisely), but what is available in that header file varies largely which means intrinsics will be unavailable even with the presence of this file and you cannot test all instructions. Unless there is a stupid compiler that lies about itself and pretend to be a modern gcc/clang and does not implement x86 intrinsics.

Any chance to re-enable intrinsics for gcc 9? Our focal/20.04 build still uses it.

It should compile, though I cannot guarantee if it really works.

Seems that immintrin.h already includes emmintrin.h + smmintrin.h, maybe just including immintrin.h is enough?

This should work on all modern compilers, not sure for ancient gcc

@nyanmisaka
Copy link
Member

This seems unnecessary because you have to test compiler version after all. GCC has immtrin.h since 4.x (forget which precisely), but what is available in that header file varies largely which means intrinsics will be unavailable even with the presence of this file and you cannot test all instructions. Unless there is a stupid compiler that lies about itself and pretend to be a modern gcc/clang and does not implement x86 intrinsics.

At least this ensures that the current gcc/clang environment is capable of handling some kind of intrinsics before actually building ffmpeg, no?

If immintrin.h or one of its included headers is missing, or the package containing it is not installed or is corrupted, then simply checking the version of gcc/clang is not sufficient.

As for whether the contents of the headers are out of date, I think that is beyond our scope. If it's really necessary, we can add them manually, just like you did for _mm_storeu_si32() in gcc10-.

It should compile, though I cannot guarantee if it really works.

20.04 will reach EOL in April 2025. We can try enabling it and see.

This should work on all modern compilers, not sure for ancient gcc

I think we don't care about those ancient compilers.

@gnattu gnattu merged commit b28524f into jellyfin Jul 7, 2024
25 checks passed
@gnattu gnattu deleted the simd-tonemapx branch July 7, 2024 14:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants