Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jpeg encoding code optimization #1632

Merged

Conversation

br3aker
Copy link
Contributor

@br3aker br3aker commented May 22, 2021

Prerequisites

  • I have written a descriptive pull-request title
  • I have verified that there are no overlapping pull-requests open
  • I have verified that I am following the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
  • I have provided test coverage for my change (where applicable)

Description

This is a draft due to 4 and some small internal jpeg changes I will adress in a separate comment. Want to hear your thoughts on this.

Jpeg format encoding optimization & code cleanup.

  1. Moved scan encoding logic to a separate class HuffmanScanEncoder following logic of HuffmanScanDecoder in jpeg decoding code (did todo comment from Encode444/Encode420, basically).
  2. Previously, encoder was writing encoded scan byte-by-byte, buffer logic existed but it wasn't used properly. This PR fixes it and introduces new buffer size - 1024, Looks like it's optimal for general use case, lesser size would lead to worse performance, bigger size doesn't bring any performance boost at all (on my machine atleast). Right now it is defined by constant at HuffmanScanEncoder class, it can be injected from the callee/global config but I'm not sure it is really needed.
  3. Introduced 256bit 8x8 matrix forward/inverse DC transforms via avx instruction set.
  4. (This is questionable) Introduced explicit layout for Block8x8F. Most, if not all, avx-intrinsic code use a lot of Unsafe.As casts like here:
var valueVec = Vector256.Create(value);
Unsafe.As<Vector4, Vector256<float>>(ref this.V0L) = Avx.Multiply(Unsafe.As<Vector4, Vector256<float>>(ref this.V0L), valueVec);

With new layout it can be simplified to:

var valueVec = Vector256.Create(value);
this.V0 = Avx.Multiply(this.V0, valueVec);

There's also a performance gain but it's too small to really affect any heavy algorithms like jpeg decoding/encoding.
Note: This PR does not change any existing code with Unsafe.As calls. If it gets merged - this can be a separate codequality issue (good-first-issue maybe?)

Stats

Benchmarks are based on this noise image with different sizes:
200x200

Also tested some random images from beeheads demo but they are too large for gaining stable results in a reasonable amount of time :P
Even so, got ~100-200ms performance gain on those.

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-6700K CPU 4.00GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=6.0.100-preview.3.21202.5

Jpeg encoding to MemoryStream (disk writes are less stable):

Method Mean Size Iterations
Current 3 ms 200x200 250
Current 11 ms 400x400 250
Current 44 ms 800x800 250
Current 177 ms 1600x1600 250
New 2 ms 200x200 250
New 8 ms 400x400 250
New 32 ms 800x800 250
New 133 ms 1600x1600 250

Avx accelerated cosine transforms:

Method Mean Error StdDev Ratio
TransformFDCT 64.43 ns 0.541 ns 0.506 ns 1.00
TransformFDCT_intrinsics 50.24 ns 0.361 ns 0.320 ns 0.78
Method Mean Error StdDev Ratio
TransformIDCT 75.21 ns 0.382 ns 0.339 ns 1.00
TransformIDCT_intrinsics 58.96 ns 0.306 ns 0.271 ns 0.78

Dmitry Pentin added 30 commits May 18, 2021 12:03
Improved performance, no need for Unsafe calls.
…bility for FDCT8x4 calls using FDCT8x8(ref Block8x8F, ref Block8x8F) method
Copy link
Contributor

@saucecontrol saucecontrol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice improvements! Left a comment on the SIMD 2:1 box filter if you want to push it a bit further

@br3aker
Copy link
Contributor Author

br3aker commented Jun 8, 2021

@saucecontrol did a little research about 8x4 -> 4x2 downscaling based on your ideas.

  1. A lot of total method time was spent on the final permutation. I guess line crossing was a thing, your approach with vpermq works 25% faster:
Method Mean Error StdDev Ratio
Shuffle_vec 1.962 ns 0.0146 ns 0.0136 ns 1.00
Shuffle_qword 1.485 ns 0.0124 ns 0.0110 ns 0.76

Note that this benchmark uses current add(shuffle, shuffle) implementation, not horizontal add.


  1. vhaddps doesn't really give any performance boost:
Method Mean Error StdDev Ratio
Shuffle_vec 1.962 ns 0.0146 ns 0.0136 ns 1.00
HorizontalAdd_vec 1.863 ns 0.0148 ns 0.0131 ns 0.95
Shuffle_qword 1.485 ns 0.0124 ns 0.0110 ns 0.76
HorizontalAdd_qword 1.497 ns 0.0114 ns 0.0101 ns 0.76

vpermq gives a little to no performance boost but it reduces code by a lot, it completely eliminates 2 custom methods and looks cleaner!


  1. Horizontal adds are slower than usual adds as far as I know. Your implementation uses 2 hadd & 1 add, we can change it to 1 hadd and 2 add:
//v0 = 0 | 1 | 2 | 3 || 4 | 5 | 6 | 7
//v1 = 8 | 9 | a | b || c | d | e | f
//v2 = g | h | i | j || k | l | m | n
//v3 = o | p | q | r || s | t | u | v

v0 = vaddps v0 ,v2
v1 = vaddps v1, v3
//v0 = 0g | 1h | 2i | 3j || 4k | 5l | 6m | 7n
//v1 = 8o | 9p | aq | br || cs | dt | eu | fv

v0 = vhaddps v0, v1
v0 = vmulps v0, 0.25f
//v0 = 01gh|23ij|89op|abqr||45kl|67mn|cdst|efuv

vpermq v0, mask 0,2,1,3
//v0 = 01gh|23ij|45kl|67mn||89op|abqr|cdst|efuv

And it runs almost 50% faster:

Method Mean Error StdDev Ratio
Current 1.962 ns 0.0146 ns 0.0136 ns 1.00
HorizontalAdd_qword 1.497 ns 0.0114 ns 0.0101 ns 0.76
NewImpl 1.018 ns 0.0151 ns 0.0134 ns 0.52

The only thing that bothers me is quad word permutation:

...
Vector256<float> avg2x2 = ...; // average for 4 values
...
var per = Avx2.Permute4x64(Unsafe.As<Vector256<float>, Vector256<double>>(ref avg2x2), 0b11_01_10_00);
return Unsafe.As<Vector256<double>, Vector256<float>>(ref per);

I guess without manual IL code injection (which is not that easy by itself) it's impossible to write it better?

@saucecontrol
Copy link
Contributor

saucecontrol commented Jun 8, 2021

That's great! I had the same thought about the horizontal adds late last night but was too lazy to go back upstairs to my computer 😆

Horizontal add is implemented in microcode in the processor, and it's effectively broken down into a shift (with an order that doesn't map to an instruction exposed publicly), an add, and an unpack. So it's actually only one cycle faster than the current code that does 3 shifts and an add. But yeah, doing the vertical portion first really cleans it up.

And hopefully the smaller code makes it possible to get rid of the code in the outer method that spills those to the stack. (I didn't spend enough time looking at that to figure out what it's doing entirely).

Oh, and you don't need Unsafe.As to cast between vector types. avg2x2.AsDouble() is the same as Unsafe.As<Vector256<float>, Vector256<double>>(ref avg2x2). All the AsXXX methods on the vector types are intrinsics that compile to a no-op.

@br3aker
Copy link
Contributor Author

br3aker commented Jun 8, 2021

Yep, code is much cleaner and smaller with your additions. Didn't know about no-op for those casts.

Thank you for your contribution to this pr, getting smarter everyday! 😄

@br3aker
Copy link
Contributor Author

br3aker commented Jun 10, 2021

@JimBobSquarePants waiting for new commits to be built & checked and we are ready to merge I guess!
If fresh pre/post full encoding benchmarks are needed - will gladly execute and provide them tomorrow.

@JimBobSquarePants
Copy link
Member

JimBobSquarePants commented Jun 10, 2021

@br3aker Looks like you've got a PlatformNotSupportedException around System.Runtime.Intrinsics.X86.Avx2.Permute4x64 on macOS .NET 5.

https://github.com/SixLabors/ImageSharp/pull/1632/checks?check_run_id=2789197791#step:10:256

Edit.
Are we looking at a potential regression in the .NET runtime? Our usage of Permute4x64 appears to be protected by
Avx2.IsSupported and we're building and testing against 5.300 version of the SDK.

dotnet/runtime#43812

@br3aker
Copy link
Contributor Author

br3aker commented Jun 10, 2021

@JimBobSquarePants my bad, did check for Avx, not Avx2. Pushed fixes for this and your comment about SimdUtils class.

Copy link
Member

@JimBobSquarePants JimBobSquarePants left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see any issues here. @br3aker really, really great work!!

I'll leave it one more day so others can have a final look though before merging though s this is tricky stuff and I might have missed something.

@br3aker
Copy link
Contributor Author

br3aker commented Jun 10, 2021

@JimBobSquarePants fixed disposable thingy in benchmarks & updated them:

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-6700K CPU 4.00GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=6.0.100-preview.3.21202.5
  [Host]     : .NET Core 3.1.13 (CoreCLR 4.700.21.11102, CoreFX 4.700.21.11602), X64 RyuJIT  [AttachedDebugger]
  DefaultJob : .NET Core 3.1.13 (CoreCLR 4.700.21.11102, CoreFX 4.700.21.11602), X64 RyuJIT
Method Quality Mean Error StdDev Ratio RatioSD
'System.Drawing Jpeg 4:2:0' 75 30.60 ms 0.496 ms 0.464 ms 1.00 0.00
'ImageSharp Jpeg 4:2:0' 75 29.86 ms 0.350 ms 0.311 ms 0.98 0.02
'ImageSharp Jpeg 4:4:4' 75 45.36 ms 0.899 ms 1.036 ms 1.48 0.05
'System.Drawing Jpeg 4:2:0' 90 34.05 ms 0.669 ms 0.687 ms 1.00 0.00
'ImageSharp Jpeg 4:2:0' 90 37.26 ms 0.706 ms 0.660 ms 1.10 0.03
'ImageSharp Jpeg 4:4:4' 90 52.54 ms 0.579 ms 0.514 ms 1.55 0.04
'System.Drawing Jpeg 4:2:0' 100 39.36 ms 0.267 ms 0.237 ms 1.00 0.00
'ImageSharp Jpeg 4:2:0' 100 42.44 ms 0.410 ms 0.383 ms 1.08 0.01
'ImageSharp Jpeg 4:4:4' 100 70.88 ms 0.508 ms 0.450 ms 1.80 0.02

Already beating System.Drawing in the 75 quality scenario. Looks like 90-100 have a constant delta betweem them. Will work on this on a separate PR. I think this is doable 😄

@br3aker
Copy link
Contributor Author

br3aker commented Jun 10, 2021

Waaaat, first time ever seeing OutOfMemory exception in tests in this PR for... decoding?

@antonfirsov
Copy link
Member

antonfirsov commented Jun 10, 2021

Looks like the ImageMagick reference decoder is also very greedy on memory, don't mind it, nothing to do with your code.

Copy link
Member

@antonfirsov antonfirsov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice work!

@JimBobSquarePants
Copy link
Member

Lets get this merged!! 🎉

@JimBobSquarePants JimBobSquarePants merged commit 11e31ed into SixLabors:master Jun 10, 2021
@JimBobSquarePants JimBobSquarePants added this to the 1.1.0 milestone Jun 10, 2021
@br3aker br3aker deleted the jpeg-decode-encode-optimization branch June 15, 2021 08:40
@JimBobSquarePants JimBobSquarePants mentioned this pull request Sep 24, 2021
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants