Jpeg encoding code optimization #1632

br3aker · 2021-05-22T11:23:39Z

Prerequisites

I have written a descriptive pull-request title
I have verified that there are no overlapping pull-requests open
I have verified that I am following the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
I have provided test coverage for my change (where applicable)

Description

This is a draft due to 4 and some small internal jpeg changes I will adress in a separate comment. Want to hear your thoughts on this.

Jpeg format encoding optimization & code cleanup.

Moved scan encoding logic to a separate class HuffmanScanEncoder following logic of HuffmanScanDecoder in jpeg decoding code (did todo comment from Encode444/Encode420, basically).
Previously, encoder was writing encoded scan byte-by-byte, buffer logic existed but it wasn't used properly. This PR fixes it and introduces new buffer size - 1024, Looks like it's optimal for general use case, lesser size would lead to worse performance, bigger size doesn't bring any performance boost at all (on my machine atleast). Right now it is defined by constant at HuffmanScanEncoder class, it can be injected from the callee/global config but I'm not sure it is really needed.
Introduced 256bit 8x8 matrix forward/inverse DC transforms via avx instruction set.
(This is questionable) Introduced explicit layout for Block8x8F. Most, if not all, avx-intrinsic code use a lot of Unsafe.As casts like here:

var valueVec = Vector256.Create(value);
Unsafe.As<Vector4, Vector256<float>>(ref this.V0L) = Avx.Multiply(Unsafe.As<Vector4, Vector256<float>>(ref this.V0L), valueVec);

With new layout it can be simplified to:

var valueVec = Vector256.Create(value);
this.V0 = Avx.Multiply(this.V0, valueVec);

There's also a performance gain but it's too small to really affect any heavy algorithms like jpeg decoding/encoding.
Note: This PR does not change any existing code with Unsafe.As calls. If it gets merged - this can be a separate codequality issue (good-first-issue maybe?)

Stats

Benchmarks are based on this noise image with different sizes:

Also tested some random images from beeheads demo but they are too large for gaining stable results in a reasonable amount of time :P
Even so, got ~100-200ms performance gain on those.

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-6700K CPU 4.00GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=6.0.100-preview.3.21202.5

Jpeg encoding to MemoryStream (disk writes are less stable):

Method	Mean	Size	Iterations
Current	3 ms	200x200	250
Current	11 ms	400x400	250
Current	44 ms	800x800	250
Current	177 ms	1600x1600	250
New	2 ms	200x200	250
New	8 ms	400x400	250
New	32 ms	800x800	250
New	133 ms	1600x1600	250

Avx accelerated cosine transforms:

Method	Mean	Error	StdDev	Ratio
TransformFDCT	64.43 ns	0.541 ns	0.506 ns	1.00
TransformFDCT_intrinsics	50.24 ns	0.361 ns	0.320 ns	0.78

Method	Mean	Error	StdDev	Ratio
TransformIDCT	75.21 ns	0.382 ns	0.339 ns	1.00
TransformIDCT_intrinsics	58.96 ns	0.306 ns	0.271 ns	0.78

Improved performance, no need for Unsafe calls.

…bility for FDCT8x4 calls using FDCT8x8(ref Block8x8F, ref Block8x8F) method

…t file

…ant instead of a 'magic value'

…nstant instead of a 'magic value'

… stream writes

…ce gain

…ations

This reverts commit fbf0ff1.

This reverts commit 20236b8.

…ds, added debug assertion

saucecontrol

Really nice improvements! Left a comment on the SIMD 2:1 box filter if you want to push it a bit further

src/ImageSharp/Formats/Jpeg/Components/Encoder/RgbToYCbCrConverterVectorized.cs

br3aker · 2021-06-08T16:25:53Z

@saucecontrol did a little research about 8x4 -> 4x2 downscaling based on your ideas.

A lot of total method time was spent on the final permutation. I guess line crossing was a thing, your approach with vpermq works 25% faster:

Method	Mean	Error	StdDev	Ratio
Shuffle_vec	1.962 ns	0.0146 ns	0.0136 ns	1.00
Shuffle_qword	1.485 ns	0.0124 ns	0.0110 ns	0.76

Note that this benchmark uses current add(shuffle, shuffle) implementation, not horizontal add.

vhaddps doesn't really give any performance boost:

Method	Mean	Error	StdDev	Ratio
Shuffle_vec	1.962 ns	0.0146 ns	0.0136 ns	1.00
HorizontalAdd_vec	1.863 ns	0.0148 ns	0.0131 ns	0.95
Shuffle_qword	1.485 ns	0.0124 ns	0.0110 ns	0.76
HorizontalAdd_qword	1.497 ns	0.0114 ns	0.0101 ns	0.76

vpermq gives a little to no performance boost but it reduces code by a lot, it completely eliminates 2 custom methods and looks cleaner!

Horizontal adds are slower than usual adds as far as I know. Your implementation uses 2 hadd & 1 add, we can change it to 1 hadd and 2 add:

//v0 = 0 | 1 | 2 | 3 || 4 | 5 | 6 | 7
//v1 = 8 | 9 | a | b || c | d | e | f
//v2 = g | h | i | j || k | l | m | n
//v3 = o | p | q | r || s | t | u | v

v0 = vaddps v0 ,v2
v1 = vaddps v1, v3
//v0 = 0g | 1h | 2i | 3j || 4k | 5l | 6m | 7n
//v1 = 8o | 9p | aq | br || cs | dt | eu | fv

v0 = vhaddps v0, v1
v0 = vmulps v0, 0.25f
//v0 = 01gh|23ij|89op|abqr||45kl|67mn|cdst|efuv

vpermq v0, mask 0,2,1,3
//v0 = 01gh|23ij|45kl|67mn||89op|abqr|cdst|efuv

And it runs almost 50% faster:

Method	Mean	Error	StdDev	Ratio
Current	1.962 ns	0.0146 ns	0.0136 ns	1.00
HorizontalAdd_qword	1.497 ns	0.0114 ns	0.0101 ns	0.76
NewImpl	1.018 ns	0.0151 ns	0.0134 ns	0.52

The only thing that bothers me is quad word permutation:

...
Vector256<float> avg2x2 = ...; // average for 4 values
...
var per = Avx2.Permute4x64(Unsafe.As<Vector256<float>, Vector256<double>>(ref avg2x2), 0b11_01_10_00);
return Unsafe.As<Vector256<double>, Vector256<float>>(ref per);

I guess without manual IL code injection (which is not that easy by itself) it's impossible to write it better?

saucecontrol · 2021-06-08T18:27:03Z

That's great! I had the same thought about the horizontal adds late last night but was too lazy to go back upstairs to my computer 😆

Horizontal add is implemented in microcode in the processor, and it's effectively broken down into a shift (with an order that doesn't map to an instruction exposed publicly), an add, and an unpack. So it's actually only one cycle faster than the current code that does 3 shifts and an add. But yeah, doing the vertical portion first really cleans it up.

And hopefully the smaller code makes it possible to get rid of the code in the outer method that spills those to the stack. (I didn't spend enough time looking at that to figure out what it's doing entirely).

Oh, and you don't need Unsafe.As to cast between vector types. avg2x2.AsDouble() is the same as Unsafe.As<Vector256<float>, Vector256<double>>(ref avg2x2). All the AsXXX methods on the vector types are intrinsics that compile to a no-op.

br3aker · 2021-06-08T20:04:12Z

Yep, code is much cleaner and smaller with your additions. Didn't know about no-op for those casts.

Thank you for your contribution to this pr, getting smarter everyday! 😄

br3aker · 2021-06-10T01:10:50Z

@JimBobSquarePants waiting for new commits to be built & checked and we are ready to merge I guess!
If fresh pre/post full encoding benchmarks are needed - will gladly execute and provide them tomorrow.

JimBobSquarePants · 2021-06-10T01:32:32Z

@br3aker Looks like you've got a PlatformNotSupportedException around System.Runtime.Intrinsics.X86.Avx2.Permute4x64 on macOS .NET 5.

https://github.com/SixLabors/ImageSharp/pull/1632/checks?check_run_id=2789197791#step:10:256

Edit.
Are we looking at a potential regression in the .NET runtime? Our usage of Permute4x64 appears to be protected by
Avx2.IsSupported and we're building and testing against 5.300 version of the SDK.

dotnet/runtime#43812

src/ImageSharp/Common/Helpers/SimdUtils.HwIntrinsics.cs

br3aker · 2021-06-10T02:20:09Z

@JimBobSquarePants my bad, did check for Avx, not Avx2. Pushed fixes for this and your comment about SimdUtils class.

JimBobSquarePants

I can't see any issues here. @br3aker really, really great work!!

I'll leave it one more day so others can have a final look though before merging though s this is tricky stuff and I might have missed something.

tests/ImageSharp.Benchmarks/Codecs/Jpeg/EncodeJpeg.cs

br3aker · 2021-06-10T14:30:03Z

@JimBobSquarePants fixed disposable thingy in benchmarks & updated them:

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-6700K CPU 4.00GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=6.0.100-preview.3.21202.5
  [Host]     : .NET Core 3.1.13 (CoreCLR 4.700.21.11102, CoreFX 4.700.21.11602), X64 RyuJIT  [AttachedDebugger]
  DefaultJob : .NET Core 3.1.13 (CoreCLR 4.700.21.11102, CoreFX 4.700.21.11602), X64 RyuJIT

Method	Quality	Mean	Error	StdDev	Ratio	RatioSD
'System.Drawing Jpeg 4:2:0'	75	30.60 ms	0.496 ms	0.464 ms	1.00	0.00
'ImageSharp Jpeg 4:2:0'	75	29.86 ms	0.350 ms	0.311 ms	0.98	0.02
'ImageSharp Jpeg 4:4:4'	75	45.36 ms	0.899 ms	1.036 ms	1.48	0.05

'System.Drawing Jpeg 4:2:0'	90	34.05 ms	0.669 ms	0.687 ms	1.00	0.00
'ImageSharp Jpeg 4:2:0'	90	37.26 ms	0.706 ms	0.660 ms	1.10	0.03
'ImageSharp Jpeg 4:4:4'	90	52.54 ms	0.579 ms	0.514 ms	1.55	0.04

'System.Drawing Jpeg 4:2:0'	100	39.36 ms	0.267 ms	0.237 ms	1.00	0.00
'ImageSharp Jpeg 4:2:0'	100	42.44 ms	0.410 ms	0.383 ms	1.08	0.01
'ImageSharp Jpeg 4:4:4'	100	70.88 ms	0.508 ms	0.450 ms	1.80	0.02

Already beating System.Drawing in the 75 quality scenario. Looks like 90-100 have a constant delta betweem them. Will work on this on a separate PR. I think this is doable 😄

br3aker · 2021-06-10T14:42:57Z

Waaaat, first time ever seeing OutOfMemory exception in tests in this PR for... decoding?

antonfirsov · 2021-06-10T14:48:23Z

Looks like the ImageMagick reference decoder is also very greedy on memory, don't mind it, nothing to do with your code.

antonfirsov

LGTM, nice work!

JimBobSquarePants · 2021-06-10T16:20:58Z

Lets get this merged!! 🎉

Dmitry Pentin added 30 commits May 18, 2021 12:03

Block8x8F explicit layout & 256bit rows support

7229dbf

Block8x8F.MultiplyInPlace no longer use unsafe casts

fbf0ff1

Improved performance, no need for Unsafe calls.

Block8x8F.TransposeInto no longer uses unsafe casts (partially)

20236b8

Implemented FDCT8x8 using avx instruction set, added backward compati…

e5188fe

…bility for FDCT8x4 calls using FDCT8x8(ref Block8x8F, ref Block8x8F) method

Implemented IDCT algorithm with avx/fma, move IDCT code to a differen…

513e86a

…t file

Fixed "constant" vectors naming

81c21e5

RgbToYCbCrConverterLut.Convert main loop routine now uses named const…

9bf9644

…ant instead of a 'magic value'

LuminanceForwardConverter.Convert main loop routine now uses named co…

347ac36

…nstant instead of a 'magic value'

WriteDefineHuffmanTables(...) no longer relies on external buffer for…

86a6d8b

… stream writes

[WIP] Partially moved encoding logic to a separate class

f001755

Removed write buffer parameter injection

d91fc40

[WIP] Moved SOS writing logic to separate class

66b5a8d

Removed unrelevant code from JpegDecoderCore

0d7e4b1

Removed remaining unrelevant code from JpegEncoderCore

d593479

Optimized jpeg encoder stream Write calls but a lot -> huge performan…

296ee10

…ce gain

Removed obsolete parameter config from various methods

56822d1

YCbCrEncoder now has builtin temporal 8x8F blocks for internal calcul…

690e80c

…ations

Updated & fixed xml documentation

b3a9938

Small QoL fixes

4e73471

Moved quantization table initialization logic to JpegEncoderCore

368f89e

Fixed comments

9d7adb6

Renamed YCbCrEncoder to HuffmanScanEncoder as it is in decoding logic

3380bdf

Moved encode method choice to the JpegEncoderCore

7e0a317

Fixed unresolved reference this.colorType

1b1d136

Added QoL throw helper method for jpeg w/h size check before encoding

5b05a0a

Moved end of image marker writing code to a separate method

84a143d

Rolled back to initial JpegEncoderCore options implementation.

d4fa8b2

Revert "Block8x8F.MultiplyInPlace no longer use unsafe casts"

980f2d2

This reverts commit fbf0ff1.

Revert "Block8x8F.TransposeInto no longer uses unsafe casts (partially)"

f1886ad

This reverts commit 20236b8.

Made DCT code prettier with SimdUtils, added summary to 8x8 dct metho…

a8f717d

…ds, added debug assertion

Removed obsolete code

0e07a8e

saucecontrol reviewed Jun 8, 2021

View reviewed changes

src/ImageSharp/Formats/Jpeg/Components/Encoder/RgbToYCbCrConverterVectorized.cs Outdated Show resolved Hide resolved

Dmitry Pentin added 2 commits June 9, 2021 15:49

Optimized vector rgb pixel matrix scaling

0013c54

Added tests for vector rgb pixel matrix scaling

35daf21

Fixed build error due to invalid using

121d1fa

JimBobSquarePants reviewed Jun 10, 2021

View reviewed changes

src/ImageSharp/Common/Helpers/SimdUtils.HwIntrinsics.cs Outdated Show resolved Hide resolved

Dmitry Pentin added 3 commits June 10, 2021 05:09

Moved jpeg matrix scaler to jpeg converter

20a0d84

Moved jpeg converter scaler tests to to jpeg converter tests

6d4e2ee

Fixed invalid curly braces, added debug Avx2 check

ce1d992

JimBobSquarePants approved these changes Jun 10, 2021

View reviewed changes

tests/ImageSharp.Benchmarks/Codecs/Jpeg/EncodeJpeg.cs Show resolved Hide resolved

Dmitry Pentin added 2 commits June 10, 2021 17:26

Improved benchmark for jpeg encoder

8bbcd65

Updated benchmark results

ab8ed08

antonfirsov approved these changes Jun 10, 2021

View reviewed changes

JimBobSquarePants merged commit 11e31ed into SixLabors:master Jun 10, 2021

JimBobSquarePants added this to the 1.1.0 milestone Jun 10, 2021

antonfirsov mentioned this pull request Jun 13, 2021

StyleCop warnings no longer fail the build #1658

Closed

4 tasks

br3aker deleted the jpeg-decode-encode-optimization branch June 15, 2021 08:40

JimBobSquarePants mentioned this pull request Sep 24, 2021

Jpeg encoder optimization #1761

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jpeg encoding code optimization #1632

Jpeg encoding code optimization #1632

br3aker commented May 22, 2021 •

edited

Loading

saucecontrol left a comment

br3aker commented Jun 8, 2021 •

edited

Loading

saucecontrol commented Jun 8, 2021 •

edited

Loading

br3aker commented Jun 8, 2021 •

edited

Loading

br3aker commented Jun 10, 2021

JimBobSquarePants commented Jun 10, 2021 •

edited

Loading

br3aker commented Jun 10, 2021

JimBobSquarePants left a comment

br3aker commented Jun 10, 2021 •

edited

Loading

br3aker commented Jun 10, 2021

antonfirsov commented Jun 10, 2021 •

edited

Loading

antonfirsov left a comment

JimBobSquarePants commented Jun 10, 2021

Jpeg encoding code optimization #1632

Jpeg encoding code optimization #1632

Conversation

br3aker commented May 22, 2021 • edited Loading

Prerequisites

Description

Stats

saucecontrol left a comment

Choose a reason for hiding this comment

br3aker commented Jun 8, 2021 • edited Loading

saucecontrol commented Jun 8, 2021 • edited Loading

br3aker commented Jun 8, 2021 • edited Loading

br3aker commented Jun 10, 2021

JimBobSquarePants commented Jun 10, 2021 • edited Loading

br3aker commented Jun 10, 2021

JimBobSquarePants left a comment

Choose a reason for hiding this comment

br3aker commented Jun 10, 2021 • edited Loading

br3aker commented Jun 10, 2021

antonfirsov commented Jun 10, 2021 • edited Loading

antonfirsov left a comment

Choose a reason for hiding this comment

JimBobSquarePants commented Jun 10, 2021

br3aker commented May 22, 2021 •

edited

Loading

br3aker commented Jun 8, 2021 •

edited

Loading

saucecontrol commented Jun 8, 2021 •

edited

Loading

br3aker commented Jun 8, 2021 •

edited

Loading

JimBobSquarePants commented Jun 10, 2021 •

edited

Loading

br3aker commented Jun 10, 2021 •

edited

Loading

antonfirsov commented Jun 10, 2021 •

edited

Loading