Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jpeg encoder optimization #1761

Merged
merged 54 commits into from
Oct 2, 2021

Conversation

br3aker
Copy link
Contributor

@br3aker br3aker commented Sep 12, 2021

Prerequisites

  • I have written a descriptive pull-request title
  • I have verified that there are no overlapping pull-requests open
  • I have verified that I am following the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
  • I have provided test coverage for my change (where applicable)

This PR has some performance tweaks & optimizations:

Quantization

  1. Fixed Spectral blocks rounding errors in jpeg encoder #1751
  2. Added SIMD support for zig-zag ordering (encoding only)
  3. Quantization now uses reciprocal tables for a multiplication op instead of a division

Benchmark (it's included in the PR):

Method Job Mean Error StdDev Ratio
Quantize No HwIntrinsics 73.34 ns 1.081 ns 1.011 ns 1.00
Quantize SSE 24.11 ns 0.298 ns 0.279 ns 0.33
Quantize AVX 15.90 ns 0.074 ns 0.065 ns 0.22

FDCT

  1. remade current implementation for ported libjpeg-turbo scalar implementation
  2. implemented simd implementation for scalar implementation

Benchmark:

Method Mean Error StdDev Ratio
Master 36.27 ns 0.255 ns 0.226 ns 1.00
PR 30.32 ns 0.115 ns 0.108 ns 0.84

Huffman Encoding

  1. Completely redone encoding logic, less if checks & less binary shifts and &
  2. Small fixes here and there

Benchmark

It's really hard to test general image encoding/decoding thingy via BenchmarkDotNet so I wrote some custom code for a fixed amount of iterations (300 in following results) of encoding jpeg into MemoryStream:

// ycbcr 4:4:4
q=100
Master: 26,56ms
PR:     20,22ms
q=90
Master: 19,27ms
PR:     13,95ms
q=75
Master: 18,38ms
PR:     13,45ms
q=50
Master: 16,86ms
PR:     12,56ms

// ycbcr 4:2:0
q=100
Master: 19,41ms
PR:     15,07ms
q=90
Master: 14,56ms
PR:     11,07ms
q=75
Master: 14,05ms
PR:     10,76ms
q=50
Master: 12,67ms
PR:     9,8ms

// luminance only
q=100
Master: 18,59ms
PR:     14,79ms
q=90
Master: 14,62ms
PR:     11,66ms
q=75
Master: 14,23ms
PR:     11,41ms
q=50
Master: 12,98ms
PR:     10,39ms

@JimBobSquarePants
Copy link
Member

@br3aker Thanks for the excellent explanation. I forgot we were dealing with bits not bytes! Agreed re comments, please add more.

@br3aker
Copy link
Contributor Author

br3aker commented Sep 28, 2021

Hope this passes all tests, fixed almost everything except adding comments so you can review Vector4 fdct stuff, will add comments tomorrow.

Good news, new scalar transpose implementation is faster than the current one and does not rely on Vector4 API:

Method Job Mean Error StdDev Ratio
OLD TransposeInto No HwIntrinsics 14.558 ns 0.0834 ns 0.0739 ns 1.00
NEW TransposeInplace No HwIntrinsics 12.531 ns 0.0637 ns 0.0565 ns 0.86

Copy link
Member

@antonfirsov antonfirsov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few nits left, otherwise this is good to merge:

  • Address this point
  • Add some comments.
  • Figure out what to do with the underscores in the names.

/// Requires Avx support.
/// </remarks>
/// <param name="block">Input matrix.</param>
public static void FDCT8x8_Avx(ref Block8x8F block)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JimBobSquarePants I don't want to be the bad cop blocking the PR on the underscore stuff in these names, because I find it more readable in situations like this, but I think some StyleCop analyzer fails to kick in here.

What are your recommendations to proceed?

Copy link
Contributor Author

@br3aker br3aker Sep 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Underscores are ok in tests, lowercase, never.

I thought underscore and uppercase after was the answer? FDCT8x8Avx is unreadable imo, this is internal stuff only used inside 'main' FDCT method so underscore may be a good separator for simd implementations. Anyways, underscores in DCT methods were long before this PR:

public static void IDCT8x4_RightPart(ref Block8x8F s, ref Block8x8F d)

Removing 8x8 from the name for FdctAvx is not an option because we already have 8x4 fdct for SSE and possible future JpegXL has variable size FDCT's.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no way I'm blocking this on naming. Happy to make an exception.

Copy link
Member

@antonfirsov antonfirsov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think we can keep this open for a few more days, so @br3aker if you have the time and interest to address the remaining nits, good, if not we'll :shipit: as is :)

@br3aker
Copy link
Contributor Author

br3aker commented Sep 29, 2021

@antonfirsov I will fix everything you pointed out, don't merge before it! :)
Just had a very busy day today.

@saucecontrol
Copy link
Contributor

I'm late to the party here, but I just want to say this is really great work @br3aker!

Comment on lines +107 to +110
if (Vector.IsHardwareAccelerated)
{
ForwardTransform_Vector4(ref block);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: code coverage is worse than before because of this path, it simply can't check it vs sse call from remote executor.

Copy link
Member

@antonfirsov antonfirsov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest changes looking good. Anything else left or shall we merge?

@JimBobSquarePants
Copy link
Member

I’m happy if you’re happy!

@antonfirsov
Copy link
Member

I meant if @br3aker has something else in his mind. (Most likely no, but I want to go for sure.)

@br3aker
Copy link
Contributor Author

br3aker commented Oct 2, 2021

I guess it's done. Thanks everyone for the contribution to this!

@antonfirsov
Copy link
Member

@br3aker thanks again for the great work!

@antonfirsov antonfirsov merged commit 2f903c7 into SixLabors:master Oct 2, 2021
@antonfirsov
Copy link
Member

@JimBobSquarePants I'm not a big fan of the following:
image

Requires us to use admin rights to merge the PR, although there is nothing wrong with the test coverage in reality.

@JimBobSquarePants
Copy link
Member

Admin rights are fine IMO. It’s rare that coverage is an issue and forces us to sense check. We’re already really disciplined but I still feel that we should have rules in place.

@antonfirsov
Copy link
Member

Fine then. Just don't want to get into position like Hungary's ruling party that changes the Constitution every time some little thing is in their way.

@br3aker
Copy link
Contributor Author

br3aker commented Oct 2, 2021

I'm not sure coverage regression can be fixed, we can test different implementation explicitly in separate tests but main FDCT method with different hardware-dependent paths won't be covered.

@JimBobSquarePants
Copy link
Member

I tend to take the coverage report as a guide not an absolute since it at times seems wildly inaccurate. We should never chase exact coverage anyway just be aware of serious regression. The manual step helps this for me (since I get excited about perf)

@antonfirsov
Copy link
Member

antonfirsov commented Oct 2, 2021

My main concern is that with this level of inaccuracy, we can get used to ignoring the check-in gates, which we should never do in case of real issues like test failures.

@br3aker br3aker deleted the jpeg-encoder-optimization branch October 5, 2021 12:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Spectral blocks rounding errors in jpeg encoder
5 participants