avfilter/tonemap: add simd implementation for sse and neon #401

gnattu · 2024-06-19T14:17:21Z

Currently only reinhard, linear and none has simd
implmentation, all other methods will fallback to scaler implementation.

Reinhard is the preferred way on CPU because it is fast and produces subjectively satisfactory outputs as the result tend to look brighter.

Test result with 4K HEVC 10bit HLG input, encoding with libx264 veryfast and reinhard method:

Apple M1 Max:

tonemap.neon: 44fps
tonemap.c: 35fps

Intel Core i9-12900:

tonemap.sse: 40fps
tonemap.c: 32fps

Both resulted in ~25% perf gain.

Changes

Issues

Currently only reinhard, linear and none has simd implmentation, all other methods will fallback to scaler implementation. Reinhard is the preferred way on CPU because it is fast and produces subjectively satisfactory outputs as the result tend to look brighter. Test result with 4K HEVC 10bit HLG input, encoding with libx264 veryfast and reinhard method: Apple M1 Max: tonemap.neon: 44fps tonemap.c: 35fps Intel Core i9-12900: tonemap.sse: 40fps tonemap.c: 32fps Both resulted in ~25% perf gain.

gnattu · 2024-06-19T14:18:29Z

AVX implementation was also attempted but there is no measurable perf gain. I dropped that draft to simply the logic.

These intrinsics requires armv8 cpu

nyanmisaka · 2024-06-19T15:26:15Z

I have a draft of an improved sw tonemap filter, but it doesn't have intrinsics/assembly support yet. If you're interested you can test it out and see how it performs now vs zscale+tonemap combo.

gnattu · 2024-06-19T15:53:54Z

I have a draft of an improved sw tonemap filter, but it doesn't have intrinsics/assembly support yet. If you're interested you can test it out and see how it performs now vs zscale+tonemap combo.

zscale does color space conversion and linearization very fast as it is already using SIMD-optimized LUT so the scaler filter can hardly beat that. What we can do with that draft is to implement dovi reshaping and use that for dovi inputs, and we may even only implement the reshaping part so that we can pipe it into zscale for linearization and then do tonemap with this filter.

The dovi reshaping part has a lot of simd optimization opportunities as there are a lot of matrix operations. Compute power of floats is also a time-consuming task which means an SIMD optimized LUT is a must for CPU. This is also the reason why BT2390 is not an easy task on CPU.

nyanmisaka · 2024-06-20T08:13:19Z

I have a draft of an improved sw tonemap filter, but it doesn't have intrinsics/assembly support yet. If you're interested you can test it out and see how it performs now vs zscale+tonemap combo.

zscale does color space conversion and linearization very fast as it is already using SIMD-optimized LUT so the scaler filter can hardly beat that. What we can do with that draft is to implement dovi reshaping and use that for dovi inputs, and we may even only implement the reshaping part so that we can pipe it into zscale for linearization and then do tonemap with this filter.

The dovi reshaping part has a lot of simd optimization opportunities as there are a lot of matrix operations. Compute power of floats is also a time-consuming task which means an SIMD optimized LUT is a must for CPU. This is also the reason why BT2390 is not an easy task on CPU.

What ffmpeg command did you use to test zscale+tonemap?
It seems difficult to add LUT support for dovi reshaping, and libplacebo doesn't do it either.

gnattu · 2024-06-20T11:05:15Z

What ffmpeg command did you use to test zscale+tonemap?

Full command:

/path/to/ffmpeg -noautorotate -i file:"/path/to/input.mp4" -codec:v:0 libx264 -preset veryfast -crf 23 -maxrate 9871252 -bufsize 19742504 -x264opts:0 subme=0:me_range=4:rc_lookahead=10:me=dia:no_chroma_me:8x8dct=0:partitions=none -force_key_frames:0 "expr:gte(t,n_forced*3)" -sc_threshold:v:0 0 -vf "setparams=color_primaries=bt2020:color_trc=smpte2084:colorspace=bt2020nc,format=yuv420p,zscale=t=linear:npl=100,format=gbrpf32le,zscale=p=bt709,tonemap=tonemap=reinhard:desat=0:peak=100,zscale=t=bt709:m=bt709,format=yuv420p" -codec:a:0 copy -copyts -avoid_negative_ts disabled test.mp4

On some processor and input video combination, you need to reduce the -thread to a lower number like 1 to see the actual perf improvements with SIMD optimization introduced in this PR. My guess would be that having high thread pressure made the cache hit rate low enough.

It seems difficult to add LUT support for dovi reshaping, and libplacebo doesn't do it either.

It is also fine. This PR does not add LUT either, it just computes multiple pixels with SIMD at the same time and that's why reinhard is used. We can do the same with dovi reshaping.

gnattu · 2024-06-25T13:07:56Z

Closed in favor of #407

gnattu requested a review from a team June 19, 2024 14:17

Shadowghost approved these changes Jun 19, 2024

View reviewed changes

avfilter/tonemap: remove armv7 as neon targets

2eb8f96

These intrinsics requires armv8 cpu

gnattu mentioned this pull request Jun 25, 2024

avfilter/tonemapx: add simd optimized tonemapx #407

Merged

gnattu closed this Jun 25, 2024

gnattu deleted the tonemap-simd branch July 7, 2024 17:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

avfilter/tonemap: add simd implementation for sse and neon #401

avfilter/tonemap: add simd implementation for sse and neon #401

gnattu commented Jun 19, 2024

gnattu commented Jun 19, 2024

nyanmisaka commented Jun 19, 2024

gnattu commented Jun 19, 2024

nyanmisaka commented Jun 20, 2024

gnattu commented Jun 20, 2024

gnattu commented Jun 25, 2024

avfilter/tonemap: add simd implementation for sse and neon #401

avfilter/tonemap: add simd implementation for sse and neon #401

Conversation

gnattu commented Jun 19, 2024

gnattu commented Jun 19, 2024

nyanmisaka commented Jun 19, 2024

gnattu commented Jun 19, 2024

nyanmisaka commented Jun 20, 2024

gnattu commented Jun 20, 2024

gnattu commented Jun 25, 2024