Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

avfilter/tonemap: add simd implementation for sse and neon #401

Closed
wants to merge 2 commits into from

Conversation

gnattu
Copy link
Member

@gnattu gnattu commented Jun 19, 2024

Currently only reinhard, linear and none has simd
implmentation, all other methods will fallback to scaler implementation.

Reinhard is the preferred way on CPU because it is fast and produces subjectively satisfactory outputs as the result tend to look brighter.

Test result with 4K HEVC 10bit HLG input, encoding with libx264 veryfast and reinhard method:

Apple M1 Max:

tonemap.neon: 44fps
tonemap.c: 35fps

Intel Core i9-12900:

tonemap.sse: 40fps
tonemap.c: 32fps

Both resulted in ~25% perf gain.

Changes

Issues

Currently only reinhard, linear and none has simd
implmentation, all other methods will fallback to scaler
implementation.

Reinhard is the preferred way on CPU because it is fast and
produces subjectively satisfactory outputs as the result
tend to look brighter.

Test result with 4K HEVC 10bit HLG input, encoding with libx264
veryfast and reinhard method:

Apple M1 Max:

tonemap.neon: 44fps
tonemap.c: 35fps

Intel Core i9-12900:

tonemap.sse: 40fps
tonemap.c: 32fps

Both resulted in ~25% perf gain.
@gnattu gnattu requested a review from a team June 19, 2024 14:17
@gnattu
Copy link
Member Author

gnattu commented Jun 19, 2024

AVX implementation was also attempted but there is no measurable perf gain. I dropped that draft to simply the logic.

These intrinsics requires armv8 cpu
@nyanmisaka
Copy link
Member

I have a draft of an improved sw tonemap filter, but it doesn't have intrinsics/assembly support yet. If you're interested you can test it out and see how it performs now vs zscale+tonemap combo.

@gnattu
Copy link
Member Author

gnattu commented Jun 19, 2024

I have a draft of an improved sw tonemap filter, but it doesn't have intrinsics/assembly support yet. If you're interested you can test it out and see how it performs now vs zscale+tonemap combo.

zscale does color space conversion and linearization very fast as it is already using SIMD-optimized LUT so the scaler filter can hardly beat that. What we can do with that draft is to implement dovi reshaping and use that for dovi inputs, and we may even only implement the reshaping part so that we can pipe it into zscale for linearization and then do tonemap with this filter.

The dovi reshaping part has a lot of simd optimization opportunities as there are a lot of matrix operations. Compute power of floats is also a time-consuming task which means an SIMD optimized LUT is a must for CPU. This is also the reason why BT2390 is not an easy task on CPU.

@nyanmisaka
Copy link
Member

I have a draft of an improved sw tonemap filter, but it doesn't have intrinsics/assembly support yet. If you're interested you can test it out and see how it performs now vs zscale+tonemap combo.

zscale does color space conversion and linearization very fast as it is already using SIMD-optimized LUT so the scaler filter can hardly beat that. What we can do with that draft is to implement dovi reshaping and use that for dovi inputs, and we may even only implement the reshaping part so that we can pipe it into zscale for linearization and then do tonemap with this filter.

The dovi reshaping part has a lot of simd optimization opportunities as there are a lot of matrix operations. Compute power of floats is also a time-consuming task which means an SIMD optimized LUT is a must for CPU. This is also the reason why BT2390 is not an easy task on CPU.

What ffmpeg command did you use to test zscale+tonemap?
It seems difficult to add LUT support for dovi reshaping, and libplacebo doesn't do it either.

@gnattu
Copy link
Member Author

gnattu commented Jun 20, 2024

What ffmpeg command did you use to test zscale+tonemap?

Full command:

/path/to/ffmpeg -noautorotate -i file:"/path/to/input.mp4" -codec:v:0 libx264 -preset veryfast -crf 23 -maxrate 9871252 -bufsize 19742504 -x264opts:0 subme=0:me_range=4:rc_lookahead=10:me=dia:no_chroma_me:8x8dct=0:partitions=none -force_key_frames:0 "expr:gte(t,n_forced*3)" -sc_threshold:v:0 0 -vf "setparams=color_primaries=bt2020:color_trc=smpte2084:colorspace=bt2020nc,format=yuv420p,zscale=t=linear:npl=100,format=gbrpf32le,zscale=p=bt709,tonemap=tonemap=reinhard:desat=0:peak=100,zscale=t=bt709:m=bt709,format=yuv420p" -codec:a:0 copy -copyts -avoid_negative_ts disabled test.mp4

On some processor and input video combination, you need to reduce the -thread to a lower number like 1 to see the actual perf improvements with SIMD optimization introduced in this PR. My guess would be that having high thread pressure made the cache hit rate low enough.

It seems difficult to add LUT support for dovi reshaping, and libplacebo doesn't do it either.

It is also fine. This PR does not add LUT either, it just computes multiple pixels with SIMD at the same time and that's why reinhard is used. We can do the same with dovi reshaping.

@gnattu
Copy link
Member Author

gnattu commented Jun 25, 2024

Closed in favor of #407

@gnattu gnattu closed this Jun 25, 2024
@gnattu gnattu deleted the tonemap-simd branch July 7, 2024 17:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants