[WIP] Remove tag space initalization for rowHash #3426

yoniko · 2023-01-13T22:43:14Z

Based on #2971 with an added modification that solves the regression in zstd -b5e7 enwik8 -B128K runs.
This is still a WIP and mostly up to get tested in CI and so other people can review the approach.

The objective here is to remove the initalization of tag space as it's costly when dealing with small data.
However, there are two downsides to doing so, one of them is dealt with here:

If the same tag space is reused if can lead to performance regressions due to hash collisions with previous compressions. To avoid this I added a "salt" to the hash that changes in every match state reset and is XORed to the hash.
Valgrind will alert on usage of uninit memory, this isn't solved in this patch.

Benchmarks in different scenarios available in this spreadsheet.

terrelln · 2023-01-14T00:51:15Z

To avoid this I added a "salt" to the hash that changes in ever match state reset and is XORed to the hash.

Clever!

Valgrind will alert on usage of uninit memory, this isn't solved in this patch.

I'd recommend initializing the memory once, when it is allocated, but then when memory is reused not re-initializing it. This matches the approach we take with our tables, and will avoid all uninitialized memory accesses.

You should be able to achieve that using the cwksp. You'd probably want to move the allocation above the opt parser space, but below the table space. You could add another "phase" like aligned_initialized, that happens after tables, but before aligned, and make sure that it always returns memory that is initialized to something.

CC @felixhandte

yoniko · 2023-01-14T01:32:42Z

I'd recommend initializing the memory once, when it is allocated, but then when memory is reused not re-initializing it. This matches the approach we take with our tables, and will avoid all uninitialized memory accesses.

I agree that this is probably a better approach than always resetting the space, but it can still have a performance penalty.
Specifically, when using stream compression with small data using the CCtx only once.
Maybe it can be paired with another idea @Cyan4973 suggested, which is to use a lower hash log on a Z_STREAM_END directive that ends a small block.

terrelln · 2023-01-14T01:35:01Z

Specifically, when using stream compression with small data using the CCtx only once.

Yeah, but in this case we are already zeroing the hash table, which is 4x larger than the tag space. And generally, I'm more concerned about context-reuse performance.

yoniko · 2023-01-14T01:49:35Z

Yeah, but in this case we are already zeroing the hash table, which is 4x larger than the tag space. And generally, I'm more concerned about context-reuse performance.

Are you talking about the indices? there's no real reason to zero them out either.
In any case, you make a good point and suggestion.
This is probably the way to go here for now, and we can go back to single-use context optimizations another time.

felixhandte · 2023-01-24T18:11:33Z

lib/compress/zstd_compress.c

        /* ZSTD_wildcopy() is used to copy into the literals buffer,
         * so we have to oversize the buffer by WILDCOPY_OVERLENGTH bytes.
         */
-        zc->seqStore.litStart = ZSTD_cwksp_reserve_buffer(ws, blockSize + WILDCOPY_OVERLENGTH);


Can we just continue to call _reserve_buffer() in all these callsites to distinguish them from allocations that actually require aligned memory. And then in the cwksp implementation, we can have _reserve_buffer() just call _reserve_aligned().

felixhandte · 2023-01-24T18:22:24Z

lib/compress/zstd_compress.c

+            int needTagTableInit = 1;
+#ifdef HAS_SECURE_RANDOM
+            if(forWho == ZSTD_resetTarget_CCtx) {
+                size_t randomGenerated = getSecureRandom(&ms->hashSalt, sizeof(ms->hashSalt));


I continue to think that you don't need a secure random on each reset (if ever), and instead you just need a nonce that can be incremented on each reset (maybe initialized as a secure random on context creation). As discussed, the speed of small compressions matters.

Have you benchmarked the cost of this call yet?

felixhandte · 2023-01-24T18:32:02Z

lib/compress/zstd_cwksp.h

@@ -556,10 +570,11 @@ MEM_STATIC void ZSTD_cwksp_clear(ZSTD_cwksp* ws) {
 #endif

    ws->tableEnd = ws->objectEnd;
-    ws->allocStart = ws->workspaceEnd;
+    ws->allocStart = (void*)((size_t)ws->workspaceEnd & ~(ZSTD_CWKSP_ALIGNMENT_BYTES-1));
+    ws->initOnceStart = ws->workspaceEnd;


Doesn't this mean that you are re-init'ing the memory on every compression? The workspace is cleared on every ctx reset, IIRC.

Yup, this was the wrong fix on my part, I've updated the PR with a better solution which is to not msan poison the initOnce memory.

felixhandte · 2023-01-24T19:13:13Z

lib/compress/zstd_cwksp.h

- * - Aligned: these buffers are used for various purposes that require 4 byte
- *   alignment, but don't require any initialization before they're used. These
- *   buffers are each aligned to 64 bytes.
+ * - Init once: these buffers require to be initialized at least once before


It's not clear to me why introducing the init-once region requires the removal of the unaligned buffer region. I mean, sure, the buffer region would no longer be attached to the unaligned end of the workspace, and those allocations would now sit between two aligned regions. But it would be more compact to pad the alignment of the buffers once at the edges, rather than round each buffer up to 64 byte alignment.

felixhandte · 2023-01-24T19:15:47Z

lib/compress/zstd_cwksp.h

@@ -237,7 +243,7 @@ MEM_STATIC size_t ZSTD_cwksp_bytes_to_align_ptr(void* ptr, const size_t alignByt
    size_t const alignBytesMask = alignBytes - 1;
    size_t const bytes = (alignBytes - ((size_t)ptr & (alignBytesMask))) & alignBytesMask;
    assert((alignBytes & alignBytesMask) == 0);
-    assert(bytes != ZSTD_CWKSP_ALIGNMENT_BYTES);
+    assert(bytes < alignBytes);


…tag space initialization. Add salting to hash to reduce collision when re-using hash table across multiple compressions. Salting the hash makes it so hashes from previous compressions won't match to hashes of similar data in current compression

…es the memory

1. Converted all unaligned buffer allocation to aligned buffer allocations 2. Added init once aligned memory buffers - Moved tag table to init once allocation when strong random is available - Bugfix in hash salting

…poisoning it

- Fix off by one bug in `ZSTD_cwksp_owns_buffer` - Better handle MSAN for init once memory - Allow to pass custom MOREFLAGS into msan-% targets in Makefile

yoniko · 2023-01-26T22:25:39Z

Due to complexity vs added benefits it has been decided to put this PR on hold.

rincebrain · 2023-02-17T23:40:10Z

Curious if there are any plans for an alternate approach, as I was playing a bit with updating the version of zstd in OpenZFS and it seems like this regression might be why I'm seeing a really terrible regression in performance in levels 9 and 12, to the point that using 15 was twice as fast as 12 in my early tests.

Going to rework the code to be cleaner so I can post things for people to see and experiment with and confirm I didn't replace memcpy with a small woodland creature hand-copying bytes, just wanted to ask if there were plans for this I should wait on, or try to come up with another solution that doesn't regress performance that badly.

yoniko · 2023-02-19T02:58:06Z

@rincebrain - I doubt it unless you are using streaming compression of small data without specifying an end directive (I haven't found this can kind of usage in OpenZFS).
If you want to measure the impact that this PR could possibly provide then you might want to just apply #2971 and benchmark.

rincebrain · 2023-02-19T04:20:40Z

I did, it gets better but still not the same.

OpenZFS hands zstd multiple-of-two records between let's say 4k and 16M, not using the streaming interface, always independent startup/teardown.

I can, and will, go bisect between 1.4.5 and now and confirm which version it was that this changed, but since it's spending all its time in ZSTD_RowFindBestMatch, it seemed a reasonable guess, and figured I'd ask.

yoniko · 2023-02-22T20:59:07Z

I did, it gets better but still not the same.

How much better for this one change?

OpenZFS hands zstd multiple-of-two records between let's say 4k and 16M, not using the streaming interface, always independent startup/teardown.

Are you seeing the improvements across the different sizes or only for specific size ranges?
For context - different sizes result in different compression strategies for Zstd.

I can, and will, go bisect between 1.4.5 and now and confirm which version it was that this changed, but since it's spending all its time in ZSTD_RowFindBestMatch, it seemed a reasonable guess, and figured I'd ask.

RowHash didn't exist in 1.4.5, so it'd make sense you'll see a big change when it was introduced in 1.5.0.
You can also try disabling RowHash by setting the parameter ZSTD_c_useRowMatchFinder to ZSTD_ps_disable.

Finally, sounds like there might be a bigger issue here, it'd be a good idea to open an issue with some more information so it can be tracked properly.

yoniko · 2023-03-07T02:46:33Z

Hi @rincebrain,
I'm trying to look some more into this but I don't have an OpenZFS setup and can't reproduce.
Did you bisect?
Is there a simple way for me to reproduce the regression you are seeing?

rincebrain · 2023-03-07T03:04:58Z

Even if you had one, it wouldn't help, since OpenZFS ships 1.4.5, so you'd need to slam it in there.

No, I had been looking at other things recently; I'll prioritize getting back to this.

yoniko · 2023-03-08T01:01:32Z

Even if you had one, it wouldn't help, since OpenZFS ships 1.4.5, so you'd need to slam it in there.

No, I had been looking at other things recently; I'll prioritize getting back to this.

I meant a dev setup.
In any case, I can't seem to reproduce the issue right now.
We target another release (1.5.5) in the next few weeks, if you could help us reproduce the issue we will look into including a fix in the new release.

rincebrain · 2023-03-08T01:32:30Z

I've got it reproducing at the moment, I'm working on narrowing down a more useful test case than "feed 30 GB in and notice it takes markedly longer".

Though I will say, in my testing, flipping ZSTD_c_useRowMatchFinder to 0 removes the difference above the noise threshold...

yoniko · 2023-03-08T02:19:02Z

Though I will say, in my testing, flipping ZSTD_c_useRowMatchFinder to 0 removes the difference above the noise threshold...

That great, it means we are certain this is where the issue is.
Generally, I'd expect row match finder to be faster than the alternative, so it's worth looking into.

Can you also post your compile flags?

rincebrain · 2023-03-08T08:12:31Z

I'll do you one better. (All of these were built purely by just running "make -j" on a vanilla checkout on a Ryzen 5900X. --single-thread and -B1048576 are because ZFS chunks things up into fixed records that it compresses independently, and each compression run is a single thread, for it.)

$ ~/zstd_all/zstd-1.5.4/zstd -b1 -e15 --single-thread -B1048576 evil_zstd_repro
 1#evil_zstd_repro   :   7340032 ->   7340270 (x1.000), 1674.4 MB/s, 58756.2 MB/s
 2#evil_zstd_repro   :   7340032 ->   5751582 (x1.276), 2535.9 MB/s, 11283.1 MB/s
 3#evil_zstd_repro   :   7340032 ->   5757047 (x1.275), 2370.7 MB/s, 11896.4 MB/s
 4#evil_zstd_repro   :   7340032 ->   5757047 (x1.275), 2441.8 MB/s, 11904.0 MB/s
 5#evil_zstd_repro   :   7340032 ->   5719120 (x1.283),  200.2 MB/s, 13318.1 MB/s
 6#evil_zstd_repro   :   7340032 ->   5719069 (x1.283),  197.5 MB/s, 12677.9 MB/s
 7#evil_zstd_repro   :   7340032 ->   5719010 (x1.283),  190.0 MB/s, 13340.3 MB/s
 8#evil_zstd_repro   :   7340032 ->   5718994 (x1.283),  189.0 MB/s, 13347.2 MB/s
 9#evil_zstd_repro   :   7340032 ->   5718994 (x1.283),  173.6 MB/s, 13341.5 MB/s
10#evil_zstd_repro   :   7340032 ->   5718986 (x1.283),  177.5 MB/s, 13335.9 MB/s
11#evil_zstd_repro   :   7340032 ->   5718986 (x1.283),  185.0 MB/s, 13325.1 MB/s
12#evil_zstd_repro   :   7340032 ->   5718986 (x1.283),  184.9 MB/s, 13323.0 MB/s
13#evil_zstd_repro   :   7340032 ->   5712884 (x1.285),  225.4 MB/s  13259.3 MB/s
14#evil_zstd_repro   :   7340032 ->   5712884 (x1.285),  226.4 MB/s  13263.4 MB/s
15#evil_zstd_repro   :   7340032 ->   5712884 (x1.285),  218.5 MB/s  13254.4 MB/s
$ ~/zstd_all/zstd-1.5.4/zstd -b1 -e15 --single-thread -B1048576 --no-row-match-finder evil_zstd_repro
 1#evil_zstd_repro   :   7340032 ->   7340270 (x1.000), 2377.6 MB/s, 58086.6 MB/s
 2#evil_zstd_repro   :   7340032 ->   5751582 (x1.276), 4295.5 MB/s, 11169.7 MB/s
 3#evil_zstd_repro   :   7340032 ->   5757047 (x1.275), 4089.6 MB/s, 10887.6 MB/s
 4#evil_zstd_repro   :   7340032 ->   5757047 (x1.275), 4253.6 MB/s, 10090.1 MB/s
 5#evil_zstd_repro   :   7340032 ->   5713005 (x1.285),  587.0 MB/s, 13188.2 MB/s
 6#evil_zstd_repro   :   7340032 ->   5712943 (x1.285),  663.0 MB/s, 11521.7 MB/s
 7#evil_zstd_repro   :   7340032 ->   5712948 (x1.285),  685.5 MB/s, 13215.1 MB/s
 8#evil_zstd_repro   :   7340032 ->   5712936 (x1.285),  672.1 MB/s, 13195.1 MB/s
 9#evil_zstd_repro   :   7340032 ->   5712880 (x1.285),  607.7 MB/s, 13204.6 MB/s
10#evil_zstd_repro   :   7340032 ->   5712880 (x1.285),  705.4 MB/s, 13185.5 MB/s
11#evil_zstd_repro   :   7340032 ->   5712880 (x1.285),  700.5 MB/s, 13222.8 MB/s
12#evil_zstd_repro   :   7340032 ->   5712880 (x1.285),  700.7 MB/s, 12062.6 MB/s
13#evil_zstd_repro   :   7340032 ->   5712884 (x1.285),  398.0 MB/s  11147.4 MB/s
14#evil_zstd_repro   :   7340032 ->   5712884 (x1.285),  444.7 MB/s  11139.7 MB/s
15#evil_zstd_repro   :   7340032 ->   5712884 (x1.285),  410.7 MB/s  13195.7 MB/s
$ ~/zstd_all/zstd-1.4.5/zstd -b1 -e15 --single-thread -B1048576 evil_zstd_repro
 1#evil_zstd_repro   :   7340032 ->   7340270 (1.000),1346.7 MB/s ,58277.3 MB/s
 2#evil_zstd_repro   :   7340032 ->   5855542 (1.254),2149.3 MB/s ,13199.2 MB/s
 3#evil_zstd_repro   :   7340032 ->   5757003 (1.275),2066.0 MB/s ,13315.1 MB/s
 4#evil_zstd_repro   :   7340032 ->   5757003 (1.275),2096.2 MB/s ,13321.7 MB/s
 5#evil_zstd_repro   :   7340032 ->   5713014 (1.285), 272.4 MB/s ,13750.2 MB/s
 6#evil_zstd_repro   :   7340032 ->   5713006 (1.285), 269.2 MB/s ,13744.7 MB/s
 7#evil_zstd_repro   :   7340032 ->   5712954 (1.285), 265.5 MB/s ,13742.0 MB/s
 8#evil_zstd_repro   :   7340032 ->   5712938 (1.285), 262.5 MB/s ,13702.3 MB/s
 9#evil_zstd_repro   :   7340032 ->   5712935 (1.285), 270.2 MB/s ,13738.0 MB/s
10#evil_zstd_repro   :   7340032 ->   5712879 (1.285), 278.7 MB/s ,13750.6 MB/s
11#evil_zstd_repro   :   7340032 ->   5712879 (1.285), 268.0 MB/s ,13746.0 MB/s
12#evil_zstd_repro   :   7340032 ->   5712879 (1.285), 274.7 MB/s ,13818.3 MB/s
13#evil_zstd_repro   :   7340032 ->   5712881 (1.285), 223.4 MB/s ,13777.5 MB/s
14#evil_zstd_repro   :   7340032 ->   5712881 (1.285), 224.7 MB/s ,13781.0 MB/s
15#evil_zstd_repro   :   7340032 ->   5712881 (1.285), 224.7 MB/s ,13777.3 MB/s

Have fun

yoniko · 2023-03-09T00:20:56Z

Thank you for reporting, we have reproduced the issued and have pin-pointed its origin.
A fix should be up in the next few days.
See issue #3539 for tracking.

yoniko · 2023-03-09T21:14:56Z

This PR is deprecated, other PRs have been put in its place.

facebook-github-bot added the CLA Signed label Jan 13, 2023

yoniko requested review from Cyan4973 and terrelln January 13, 2023 22:44

yoniko requested a review from felixhandte January 14, 2023 01:26

yoniko force-pushed the no-tag-space-init branch from 6055bef to 8ad4b88 Compare January 14, 2023 01:34

yoniko force-pushed the no-tag-space-init branch 5 times, most recently from 53f926f to ee15d46 Compare January 24, 2023 05:41

felixhandte reviewed Jan 24, 2023

View reviewed changes

yoniko force-pushed the no-tag-space-init branch 5 times, most recently from 0a759ed to 0014780 Compare January 25, 2023 22:50

yoniko added 10 commits January 25, 2023 22:49

Adds secure random to hash salt when available, otherwise always zero…

9614dde

…es the memory

Adding salt after multiplication to preserve determinism

7b0a365

- Change allocation scheme:

3dc904b

1. Converted all unaligned buffer allocation to aligned buffer allocations 2. Added init once aligned memory buffers - Moved tag table to init once allocation when strong random is available - Bugfix in hash salting

Removed unneeded op when salting hash

5170e72

Fixed compilation issues

cc2ce8f

Added debug message

e8a80f0

[wip] Fixed allocation

e9197d0

Set hashSalt when copying dict table

2c5c772

Fuzzer fix - don't use init once memory after clearing workspace and …

23f7f23

…poisoning it

yoniko added 5 commits January 25, 2023 22:49

Fuzzer fix2

f81b969

wip

a6a993e

- Fix uninitalized value use bug in opt parser

31dcaab

- Fix off by one bug in `ZSTD_cwksp_owns_buffer` - Better handle MSAN for init once memory - Allow to pass custom MOREFLAGS into msan-% targets in Makefile

Cleaned up hash salt function parameter

80029bf

Small secure random changes

dd56df3

yoniko force-pushed the no-tag-space-init branch from ce9bc85 to dd56df3 Compare January 26, 2023 17:49

Added a secure random state

13ec401

This was referenced Mar 7, 2023

v1.5.0 speed regressions #2662

Closed

Row hash tag space initialization speed regression #3528

Closed

Merge remote-tracking branch 'origin/dev' into no-tag-space-init

49f6ce5

yoniko mentioned this pull request Mar 9, 2023

Row hash high levels underperform on incompressible data #3539

Closed

yoniko closed this Mar 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Remove tag space initalization for rowHash #3426

[WIP] Remove tag space initalization for rowHash #3426

yoniko commented Jan 13, 2023 •

edited

Loading

terrelln commented Jan 14, 2023

yoniko commented Jan 14, 2023

terrelln commented Jan 14, 2023

yoniko commented Jan 14, 2023

felixhandte Jan 24, 2023

felixhandte Jan 24, 2023

felixhandte Jan 24, 2023

yoniko Jan 24, 2023

felixhandte Jan 24, 2023

felixhandte Jan 24, 2023

yoniko commented Jan 26, 2023

rincebrain commented Feb 17, 2023

yoniko commented Feb 19, 2023

rincebrain commented Feb 19, 2023 •

edited

Loading

yoniko commented Feb 22, 2023

yoniko commented Mar 7, 2023

rincebrain commented Mar 7, 2023

yoniko commented Mar 8, 2023

rincebrain commented Mar 8, 2023

yoniko commented Mar 8, 2023

rincebrain commented Mar 8, 2023 •

edited

Loading

yoniko commented Mar 9, 2023

yoniko commented Mar 9, 2023

[WIP] Remove tag space initalization for rowHash #3426

[WIP] Remove tag space initalization for rowHash #3426

Conversation

yoniko commented Jan 13, 2023 • edited Loading

terrelln commented Jan 14, 2023

yoniko commented Jan 14, 2023

terrelln commented Jan 14, 2023

yoniko commented Jan 14, 2023

felixhandte Jan 24, 2023

Choose a reason for hiding this comment

felixhandte Jan 24, 2023

Choose a reason for hiding this comment

felixhandte Jan 24, 2023

Choose a reason for hiding this comment

yoniko Jan 24, 2023

Choose a reason for hiding this comment

felixhandte Jan 24, 2023

Choose a reason for hiding this comment

felixhandte Jan 24, 2023

Choose a reason for hiding this comment

yoniko commented Jan 26, 2023

rincebrain commented Feb 17, 2023

yoniko commented Feb 19, 2023

rincebrain commented Feb 19, 2023 • edited Loading

yoniko commented Feb 22, 2023

yoniko commented Mar 7, 2023

rincebrain commented Mar 7, 2023

yoniko commented Mar 8, 2023

rincebrain commented Mar 8, 2023

yoniko commented Mar 8, 2023

rincebrain commented Mar 8, 2023 • edited Loading

yoniko commented Mar 9, 2023

yoniko commented Mar 9, 2023

yoniko commented Jan 13, 2023 •

edited

Loading

rincebrain commented Feb 19, 2023 •

edited

Loading

rincebrain commented Mar 8, 2023 •

edited

Loading