-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Remove tag space initalization for rowHash #3426
Conversation
Clever!
I'd recommend initializing the memory once, when it is allocated, but then when memory is reused not re-initializing it. This matches the approach we take with our tables, and will avoid all uninitialized memory accesses. You should be able to achieve that using the cwksp. You'd probably want to move the allocation above the opt parser space, but below the table space. You could add another "phase" like CC @felixhandte |
I agree that this is probably a better approach than always resetting the space, but it can still have a performance penalty. |
6055bef
to
8ad4b88
Compare
Yeah, but in this case we are already zeroing the hash table, which is 4x larger than the tag space. And generally, I'm more concerned about context-reuse performance. |
Are you talking about the indices? there's no real reason to zero them out either. |
53f926f
to
ee15d46
Compare
/* ZSTD_wildcopy() is used to copy into the literals buffer, | ||
* so we have to oversize the buffer by WILDCOPY_OVERLENGTH bytes. | ||
*/ | ||
zc->seqStore.litStart = ZSTD_cwksp_reserve_buffer(ws, blockSize + WILDCOPY_OVERLENGTH); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just continue to call _reserve_buffer()
in all these callsites to distinguish them from allocations that actually require aligned memory. And then in the cwksp implementation, we can have _reserve_buffer()
just call _reserve_aligned()
.
lib/compress/zstd_compress.c
Outdated
int needTagTableInit = 1; | ||
#ifdef HAS_SECURE_RANDOM | ||
if(forWho == ZSTD_resetTarget_CCtx) { | ||
size_t randomGenerated = getSecureRandom(&ms->hashSalt, sizeof(ms->hashSalt)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I continue to think that you don't need a secure random on each reset (if ever), and instead you just need a nonce that can be incremented on each reset (maybe initialized as a secure random on context creation). As discussed, the speed of small compressions matters.
Have you benchmarked the cost of this call yet?
lib/compress/zstd_cwksp.h
Outdated
@@ -556,10 +570,11 @@ MEM_STATIC void ZSTD_cwksp_clear(ZSTD_cwksp* ws) { | |||
#endif | |||
|
|||
ws->tableEnd = ws->objectEnd; | |||
ws->allocStart = ws->workspaceEnd; | |||
ws->allocStart = (void*)((size_t)ws->workspaceEnd & ~(ZSTD_CWKSP_ALIGNMENT_BYTES-1)); | |||
ws->initOnceStart = ws->workspaceEnd; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this mean that you are re-init'ing the memory on every compression? The workspace is cleared on every ctx reset, IIRC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, this was the wrong fix on my part, I've updated the PR with a better solution which is to not msan poison the initOnce memory.
* - Aligned: these buffers are used for various purposes that require 4 byte | ||
* alignment, but don't require any initialization before they're used. These | ||
* buffers are each aligned to 64 bytes. | ||
* - Init once: these buffers require to be initialized at least once before |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear to me why introducing the init-once region requires the removal of the unaligned buffer region. I mean, sure, the buffer region would no longer be attached to the unaligned end of the workspace, and those allocations would now sit between two aligned regions. But it would be more compact to pad the alignment of the buffers once at the edges, rather than round each buffer up to 64 byte alignment.
@@ -237,7 +243,7 @@ MEM_STATIC size_t ZSTD_cwksp_bytes_to_align_ptr(void* ptr, const size_t alignByt | |||
size_t const alignBytesMask = alignBytes - 1; | |||
size_t const bytes = (alignBytes - ((size_t)ptr & (alignBytesMask))) & alignBytesMask; | |||
assert((alignBytes & alignBytesMask) == 0); | |||
assert(bytes != ZSTD_CWKSP_ALIGNMENT_BYTES); | |||
assert(bytes < alignBytes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
0a759ed
to
0014780
Compare
…tag space initialization. Add salting to hash to reduce collision when re-using hash table across multiple compressions. Salting the hash makes it so hashes from previous compressions won't match to hashes of similar data in current compression
1. Converted all unaligned buffer allocation to aligned buffer allocations 2. Added init once aligned memory buffers - Moved tag table to init once allocation when strong random is available - Bugfix in hash salting
- Fix off by one bug in `ZSTD_cwksp_owns_buffer` - Better handle MSAN for init once memory - Allow to pass custom MOREFLAGS into msan-% targets in Makefile
ce9bc85
to
dd56df3
Compare
Due to complexity vs added benefits it has been decided to put this PR on hold. |
Curious if there are any plans for an alternate approach, as I was playing a bit with updating the version of zstd in OpenZFS and it seems like this regression might be why I'm seeing a really terrible regression in performance in levels 9 and 12, to the point that using 15 was twice as fast as 12 in my early tests. Going to rework the code to be cleaner so I can post things for people to see and experiment with and confirm I didn't replace memcpy with a small woodland creature hand-copying bytes, just wanted to ask if there were plans for this I should wait on, or try to come up with another solution that doesn't regress performance that badly. |
@rincebrain - I doubt it unless you are using streaming compression of small data without specifying an end directive (I haven't found this can kind of usage in OpenZFS). |
I did, it gets better but still not the same. OpenZFS hands zstd multiple-of-two records between let's say 4k and 16M, not using the streaming interface, always independent startup/teardown. I can, and will, go bisect between 1.4.5 and now and confirm which version it was that this changed, but since it's spending all its time in ZSTD_RowFindBestMatch, it seemed a reasonable guess, and figured I'd ask. |
How much better for this one change?
Are you seeing the improvements across the different sizes or only for specific size ranges?
RowHash didn't exist in 1.4.5, so it'd make sense you'll see a big change when it was introduced in 1.5.0. Finally, sounds like there might be a bigger issue here, it'd be a good idea to open an issue with some more information so it can be tracked properly. |
Hi @rincebrain, |
Even if you had one, it wouldn't help, since OpenZFS ships 1.4.5, so you'd need to slam it in there. No, I had been looking at other things recently; I'll prioritize getting back to this. |
I meant a dev setup. |
I've got it reproducing at the moment, I'm working on narrowing down a more useful test case than "feed 30 GB in and notice it takes markedly longer". Though I will say, in my testing, flipping ZSTD_c_useRowMatchFinder to 0 removes the difference above the noise threshold... |
That great, it means we are certain this is where the issue is. Can you also post your compile flags? |
I'll do you one better. (All of these were built purely by just running "make -j" on a vanilla checkout on a Ryzen 5900X. --single-thread and -B1048576 are because ZFS chunks things up into fixed records that it compresses independently, and each compression run is a single thread, for it.)
|
Thank you for reporting, we have reproduced the issued and have pin-pointed its origin. |
This PR is deprecated, other PRs have been put in its place. |
Based on #2971 with an added modification that solves the regression in
zstd -b5e7 enwik8 -B128K
runs.This is still a WIP and mostly up to get tested in CI and so other people can review the approach.
The objective here is to remove the initalization of tag space as it's costly when dealing with small data.
However, there are two downsides to doing so, one of them is dealt with here:
Benchmarks in different scenarios available in this spreadsheet.