Jpeg encoder optimization #1761

br3aker · 2021-09-12T22:20:25Z

Prerequisites

I have written a descriptive pull-request title
I have verified that there are no overlapping pull-requests open
I have verified that I am following the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
I have provided test coverage for my change (where applicable)

This PR has some performance tweaks & optimizations:

Quantization

Fixed Spectral blocks rounding errors in jpeg encoder #1751
Added SIMD support for zig-zag ordering (encoding only)
Quantization now uses reciprocal tables for a multiplication op instead of a division

Benchmark (it's included in the PR):

Method	Job	Mean	Error	StdDev	Ratio
Quantize	No HwIntrinsics	73.34 ns	1.081 ns	1.011 ns	1.00
Quantize	SSE	24.11 ns	0.298 ns	0.279 ns	0.33
Quantize	AVX	15.90 ns	0.074 ns	0.065 ns	0.22

FDCT

remade current implementation for ported libjpeg-turbo scalar implementation
implemented simd implementation for scalar implementation

Benchmark:

Method	Mean	Error	StdDev	Ratio
Master	36.27 ns	0.255 ns	0.226 ns	1.00
PR	30.32 ns	0.115 ns	0.108 ns	0.84

Huffman Encoding

Completely redone encoding logic, less if checks & less binary shifts and &
Small fixes here and there

Benchmark

It's really hard to test general image encoding/decoding thingy via BenchmarkDotNet so I wrote some custom code for a fixed amount of iterations (300 in following results) of encoding jpeg into MemoryStream:

// ycbcr 4:4:4
q=100
Master: 26,56ms
PR:     20,22ms
q=90
Master: 19,27ms
PR:     13,95ms
q=75
Master: 18,38ms
PR:     13,45ms
q=50
Master: 16,86ms
PR:     12,56ms

// ycbcr 4:2:0
q=100
Master: 19,41ms
PR:     15,07ms
q=90
Master: 14,56ms
PR:     11,07ms
q=75
Master: 14,05ms
PR:     10,76ms
q=50
Master: 12,67ms
PR:     9,8ms

// luminance only
q=100
Master: 18,59ms
PR:     14,79ms
q=90
Master: 14,62ms
PR:     11,66ms
q=75
Master: 14,23ms
PR:     11,41ms
q=50
Master: 12,98ms
PR:     10,39ms

…alid order

…C pressure

JimBobSquarePants · 2021-09-27T11:03:02Z

@br3aker Thanks for the excellent explanation. I forgot we were dealing with bits not bytes! Agreed re comments, please add more.

src/ImageSharp/Formats/Jpeg/Components/FastFloatingPointDCT.Intrinsic.cs

br3aker · 2021-09-28T20:46:42Z

Hope this passes all tests, fixed almost everything except adding comments so you can review Vector4 fdct stuff, will add comments tomorrow.

Good news, new scalar transpose implementation is faster than the current one and does not rely on Vector4 API:

Method	Job	Mean	Error	StdDev	Ratio
OLD TransposeInto	No HwIntrinsics	14.558 ns	0.0834 ns	0.0739 ns	1.00
NEW TransposeInplace	No HwIntrinsics	12.531 ns	0.0637 ns	0.0565 ns	0.86

antonfirsov

A few nits left, otherwise this is good to merge:

Address this point
Add some comments.
Figure out what to do with the underscores in the names.

antonfirsov · 2021-09-29T12:40:04Z

src/ImageSharp/Formats/Jpeg/Components/FastFloatingPointDCT.Intrinsic.cs

+ /// Requires Avx support.
+ /// </remarks>
+ /// <param name="block">Input matrix.</param>
+ public static void FDCT8x8_Avx(ref Block8x8F block)


@JimBobSquarePants I don't want to be the bad cop blocking the PR on the underscore stuff in these names, because I find it more readable in situations like this, but I think some StyleCop analyzer fails to kick in here.

What are your recommendations to proceed?

Underscores are ok in tests, lowercase, never.

I thought underscore and uppercase after was the answer? FDCT8x8Avx is unreadable imo, this is internal stuff only used inside 'main' FDCT method so underscore may be a good separator for simd implementations. Anyways, underscores in DCT methods were long before this PR:

ImageSharp/src/ImageSharp/Formats/Jpeg/Components/FastFloatingPointDCT.cs

Line 386 in 7506c9d

public static void IDCT8x4_RightPart(ref Block8x8F s, ref Block8x8F d)

Removing 8x8 from the name for FdctAvx is not an option because we already have 8x4 fdct for SSE and possible future JpegXL has variable size FDCT's.

There's no way I'm blocking this on naming. Happy to make an exception.

antonfirsov

OK, I think we can keep this open for a few more days, so @br3aker if you have the time and interest to address the remaining nits, good, if not we'll as is :)

br3aker · 2021-09-29T21:17:35Z

@antonfirsov I will fix everything you pointed out, don't merge before it! :)
Just had a very busy day today.

saucecontrol · 2021-09-30T05:15:28Z

I'm late to the party here, but I just want to say this is really great work @br3aker!

br3aker · 2021-10-01T19:59:10Z

src/ImageSharp/Formats/Jpeg/Components/FastFloatingPointDCT.cs

+ if (Vector.IsHardwareAccelerated)
+ {
+ ForwardTransform_Vector4(ref block);
+ }


Note: code coverage is worse than before because of this path, it simply can't check it vs sse call from remote executor.

antonfirsov

Latest changes looking good. Anything else left or shall we merge?

JimBobSquarePants · 2021-10-02T00:51:28Z

I’m happy if you’re happy!

antonfirsov · 2021-10-02T01:01:38Z

I meant if @br3aker has something else in his mind. (Most likely no, but I want to go for sure.)

br3aker · 2021-10-02T09:49:22Z

I guess it's done. Thanks everyone for the contribution to this!

antonfirsov · 2021-10-02T11:24:01Z

@br3aker thanks again for the great work!

antonfirsov · 2021-10-02T11:24:24Z

@JimBobSquarePants I'm not a big fan of the following:

Requires us to use admin rights to merge the PR, although there is nothing wrong with the test coverage in reality.

JimBobSquarePants · 2021-10-02T11:27:19Z

Admin rights are fine IMO. It’s rare that coverage is an issue and forces us to sense check. We’re already really disciplined but I still feel that we should have rules in place.

antonfirsov · 2021-10-02T11:31:10Z

Fine then. Just don't want to get into position like Hungary's ruling party that changes the Constitution every time some little thing is in their way.

br3aker · 2021-10-02T11:32:26Z

I'm not sure coverage regression can be fixed, we can test different implementation explicitly in separate tests but main FDCT method with different hardware-dependent paths won't be covered.

JimBobSquarePants · 2021-10-02T11:35:31Z

I tend to take the coverage report as a guide not an absolute since it at times seems wildly inaccurate. We should never chase exact coverage anyway just be aware of serious regression. The manual step helps this for me (since I get excited about perf)

antonfirsov · 2021-10-02T11:47:49Z

My main concern is that with this level of inaccuracy, we can get used to ignoring the check-in gates, which we should never do in case of real issues like test failures.

Dmitry Pentin added 30 commits August 17, 2021 12:24

Moved stuff bytes injection to outer method

e83cb95

Optimized byte emition, ouput images are corrupted due to msb-lsb inv…

739f520

…alid order

Fixed byte flush order, fixed last byte padding

8a08259

Greatly reduced operations per emit call

4c14c57

Merged huffman prefix & value Emit() calls

c39a203

Sandbox code & results

93044e4

Fixed last valuable index logic

cc45eed

Optimized lvi calculation via lzcnt intrinsic

937a868

Sandbox code & results

f9b36e7

Removed unused methods & constructor, fixed warnings

787ffa5

Added sse/avx vector fields to the Block8x8, small QOL fixes

a75d6e6

8x8 matrices small fixes

2bccda8

Fixed last stream flush

8098e8e

Fixed lvi

e5fec97

Docs, fixes, added support for other subsamples/color types

81349f2

New zig-zag implementation

6c5cf28

Removed obsolete code, tests cleanup

a220b3d

Added DCT in place

cc99da3

Update sandbox

839da83

1

e3d3280

Merge branch 'master' into jpeg-encoder-optimization

7e4aa46

Fixed switch for color type

81204d3

Fixed failing tests

7a21a88

Fixed sandbox

4d58866

Slightly improved tiff decoding with jpeg data, removed unnecessary G…

0b55bed

…C pressure

Fixed sandbox

17ca003

Rolled back to original implementation for rounding via scalar code

ea09d59

New FDCT method, reciprocal quantization

2f143bf

Tidied up DCT code

fb038aa

Removed excess code, added benchmarks

9973e8d

antonfirsov reviewed Sep 27, 2021

View reviewed changes

src/ImageSharp/Formats/Jpeg/Components/FastFloatingPointDCT.Intrinsic.cs Outdated Show resolved Hide resolved

src/ImageSharp/Formats/Jpeg/Components/FastFloatingPointDCT.Intrinsic.cs Outdated Show resolved Hide resolved

Dmitry Pentin added 6 commits September 28, 2021 18:54

Naming fix & simd else if branch

6532552

DCT fixes, ifdef & accessor

7831caa

Naming fix

dce87fe

Improved scalar transpose implementation

e4b32db

FDCT sse path via Vector4

bd9f06f

FDCT fma usage

e9eaa52

antonfirsov reviewed Sep 29, 2021

View reviewed changes

antonfirsov approved these changes Sep 29, 2021

View reviewed changes

Dmitry Pentin added 3 commits October 1, 2021 22:35

Docs

4ff2984

Quant table adjustment method

aae451c

Access modifier fix

2dfbff5

br3aker commented Oct 1, 2021

View reviewed changes

antonfirsov approved these changes Oct 2, 2021

View reviewed changes

antonfirsov merged commit 2f903c7 into SixLabors:master Oct 2, 2021

br3aker deleted the jpeg-encoder-optimization branch October 5, 2021 12:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jpeg encoder optimization #1761

Jpeg encoder optimization #1761

br3aker commented Sep 12, 2021 •

edited

Loading

JimBobSquarePants commented Sep 27, 2021

br3aker commented Sep 28, 2021

antonfirsov left a comment

antonfirsov Sep 29, 2021

br3aker Sep 29, 2021 •

edited

Loading

JimBobSquarePants Sep 29, 2021

JimBobSquarePants Sep 29, 2021

antonfirsov left a comment

br3aker commented Sep 29, 2021

saucecontrol commented Sep 30, 2021

br3aker Oct 1, 2021

antonfirsov left a comment

JimBobSquarePants commented Oct 2, 2021

antonfirsov commented Oct 2, 2021

br3aker commented Oct 2, 2021

antonfirsov commented Oct 2, 2021

antonfirsov commented Oct 2, 2021

JimBobSquarePants commented Oct 2, 2021

antonfirsov commented Oct 2, 2021

br3aker commented Oct 2, 2021

JimBobSquarePants commented Oct 2, 2021

antonfirsov commented Oct 2, 2021 •

edited

Loading

Jpeg encoder optimization #1761

Jpeg encoder optimization #1761

Conversation

br3aker commented Sep 12, 2021 • edited Loading

Prerequisites

Quantization

FDCT

Huffman Encoding

Benchmark

JimBobSquarePants commented Sep 27, 2021

br3aker commented Sep 28, 2021

antonfirsov left a comment

Choose a reason for hiding this comment

antonfirsov Sep 29, 2021

Choose a reason for hiding this comment

br3aker Sep 29, 2021 • edited Loading

Choose a reason for hiding this comment

JimBobSquarePants Sep 29, 2021

Choose a reason for hiding this comment

JimBobSquarePants Sep 29, 2021

Choose a reason for hiding this comment

antonfirsov left a comment

Choose a reason for hiding this comment

br3aker commented Sep 29, 2021

saucecontrol commented Sep 30, 2021

br3aker Oct 1, 2021

Choose a reason for hiding this comment

antonfirsov left a comment

Choose a reason for hiding this comment

JimBobSquarePants commented Oct 2, 2021

antonfirsov commented Oct 2, 2021

br3aker commented Oct 2, 2021

antonfirsov commented Oct 2, 2021

antonfirsov commented Oct 2, 2021

JimBobSquarePants commented Oct 2, 2021

antonfirsov commented Oct 2, 2021

br3aker commented Oct 2, 2021

JimBobSquarePants commented Oct 2, 2021

antonfirsov commented Oct 2, 2021 • edited Loading

br3aker commented Sep 12, 2021 •

edited

Loading

br3aker Sep 29, 2021 •

edited

Loading

antonfirsov commented Oct 2, 2021 •

edited

Loading