Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple performance improvements for ldm #2464

Merged
merged 4 commits into from
Jan 22, 2021
Merged

Simple performance improvements for ldm #2464

merged 4 commits into from
Jan 22, 2021

Conversation

mpu
Copy link
Contributor

@mpu mpu commented Jan 11, 2021

Description

This PR contains some simple performance improvements to the ldm matcher, together with some other cosmetic changes that I made "en passant" (variable renames, typo fixes, ...).

The changes to the ldm matcher only affect performance and the new code should produce precisely the same output as the baseline. Two changes helped improve the performance relatively uniformly:

  • Instead of computing the tag with a branch and a fair amount of arithmetic operations in the hot loop of ZSTD_ldm_generateSequences we now compute a tag mask once before entering the loop and use it to check if the rolling hash satisfies the "tag condition". The tag mask is computed in such a way that the new tag condition is equivalent to the original one.
  • When the tag condition is met but we did not find a match in the ldm hash table (it is the common case), we insert a new entry in the hash table using ZSTD_ldm_insertEntry instead of ZSTD_ldm_makeEntryAndInsertByTag which performs some redundant checks.

Benchmarks

The following files were used to assess the impact of the change:

  • hhvm-rt.tar (5.9G) - an image of production version of HHVM
  • l1m.tar (2.0G) - two snapshots of the linux kernel source code taken a month apart
  • l1y.tar (1.9G) - two snapshots of the linux kernel source code taken a year apart
  • l5.tar (544M) - the first 544M of an archive containing the linux kernel source code

The results are compiled in the table below. All times are computed as the best over a set of 5 runs.

FILE CONFIG DEFLATE Δ TIME Δ
data/hhvm-rt.tar --long=27 -1 = - 07.69%
data/l1m.tar --long=27 -1 = - 07.15%
data/l1y.tar --long=27 -1 = - 05.76%
data/l5.tar --long=27 -1 = - 01.63%
data/hhvm-rt.tar --long=30 -1 = - 05.23%
data/l1m.tar --long=30 -1 = - 05.97%
data/l1y.tar --long=30 -1 = - 07.31%
data/l5.tar --long=30 -1 = - 04.13%
data/hhvm-rt.tar --long=27 -8 = - 07.26%
data/l1m.tar --long=27 -8 = - 10.21%
data/l1y.tar --long=27 -8 = - 05.42%
data/l5.tar --long=27 -8 = - 06.97%
data/hhvm-rt.tar --long=30 -8 = - 04.97%
data/l1m.tar --long=30 -8 = - 06.05%
data/l1y.tar --long=30 -8 = - 09.55%
data/l5.tar --long=30 -8 = - 01.70%
data/hhvm-rt.tar --long=27 -16 = + 01.20%
data/l1m.tar --long=27 -16 = - 00.55%
data/l1y.tar --long=27 -16 = + 06.68%
data/l5.tar --long=27 -16 = + 01.80%
data/hhvm-rt.tar --long=30 -16 = + 02.35%
data/l1m.tar --long=30 -16 = + 02.68%
data/l1y.tar --long=30 -16 = + 01.26%
data/l5.tar --long=30 -16 = + 00.35%
data/hhvm-rt.tar --long=27 -19 = - 00.09%
data/l1m.tar --long=27 -19 = + 01.43%
data/l1y.tar --long=27 -19 = - 00.52%
data/l5.tar --long=27 -19 = + 02.94%
data/hhvm-rt.tar --long=30 -19 = + 02.88%
data/l1m.tar --long=30 -19 = - 00.51%
data/l1y.tar --long=30 -19 = + 01.14%
data/l5.tar --long=30 -19 = - 02.13%

unsigned const offset = *pOffset;

*(ZSTD_ldm_getBucket(ldmState, hash, ldmParams) + offset) = entry;
*pOffset = (offset + 1) & (((U32)1 << ldmParams.bucketSizeLog) - 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I believe the right hand side needs an explicit cast to appease the warning

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the hint. This should be fixed by my latest commit.

@senhuang42
Copy link
Contributor

Thanks for the improvements, this generally looks pretty good to me!

The circle CI failure can probably get fixed with a rebase on top of the latest dev branch.

@terrelln
Copy link
Contributor

I've measured a 2.5% - 4.5% speed improvement on several different files, which is in line with the measurements you've provided.

There are remaining calls to ZSTD_ldm_makeEntryAndInsertByTag() when we do find a match, and in ZSTD_ldm_fillLdmHashTable(). Do you think that it would speed up LDM to replace it with ZSTD_ldm_insertEntry() in ZSTD_ldm_fillLdmHashTable()?

Comment on lines 379 to 383
ldmEntry_t entry;

entry.offset = curr;
entry.checksum = checksum;
ZSTD_ldm_insertEntry(ldmState, hash, entry, *params);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also insert when we find an entry right? Could we use ZSTD_ldm_insertEntry() in that case too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I thought it'd not bring much of a win but for the sake of consistency I updated it.

@mpu
Copy link
Contributor Author

mpu commented Jan 20, 2021

Thanks for your comments! I rebased on top of a recent dev branch and added one commit to address the feedback.
The performance characteristics should not have changed much. Nonetheless, I will run a final benchmark to make sure that is the case.

Do you think that it would speed up LDM to replace it with ZSTD_ldm_insertEntry() in ZSTD_ldm_fillLdmHashTable()?

In fillLdmHashTable the calls to makeEntryAndInsertByTag are not doing any superfluous work and likely get inlined, so I do not think we would gain anything by removing them.

Copy link
Contributor

@terrelln terrelln left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! I just have one minor nit about a renaming and it looks good to me.

@@ -299,10 +295,13 @@ static size_t ZSTD_ldm_generateSequences_internal(
U64 rollingHash = 0;

while (ip <= ilimit) {
U32 const currentOffset = (U32)(ip - base);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is not an offset it is an "index". Either keep curr or use currentIndex (note: current cannot be used because it conflicts with a macro in the Linux kernel).

I know that it is called offset in the ldmEntry_t struct, but that is a bad name. It should be renamed to index. If you want to include that renaming in this diff or a followup that would be great.

@mpu
Copy link
Contributor Author

mpu commented Jan 22, 2021

I renamed the variable to currentIndex. I also quickly re-run parts of the benchmarks on a relatively small but stable machine and got the following encouraging results.

FILE CONFIG DEFLATE Δ TIME Δ
data/l5.tar --long=27 -1 = - 07.23%
data/l5.tar --long=27 -8 = - 01.65%
data/l5.tar --long=27 -16 = - 00.80%
data/l5.tar --long=27 -19 = + 00.12%
data/l5.tar --long=30 -1 = - 09.20%
data/l5.tar --long=30 -8 = + 00.24%
data/l5.tar --long=30 -16 = - 00.78%
data/l5.tar --long=30 -19 = + 00.17%

Copy link
Contributor

@terrelln terrelln left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks for the PR! Looking forward to the next one :)

@terrelln terrelln merged commit f5b3f64 into facebook:dev Jan 22, 2021
@Cyan4973
Copy link
Contributor

Thanks @mpu !

I tested your PR on a (fairly stable) desktop-class workstation,
and measured an ~8-9% improvement to LDM top speed (combined with multithreaded (low levels <=4) compression)
in line with this PR announce.

@mpu mpu deleted the ldmfixes branch January 25, 2021 08:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants