-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zstd CPU overhead optimizations #11709
base: master
Are you sure you want to change the base?
Conversation
Compression results should be binary-identical, for the purposes of dedupe, decompressed ARC, L2ARC, encryption validation, etc. But I'm not sure how to check/demonstrate that; suggestions welcome. |
Oh yeah, probably only builds on Linux right now because of some dumb unecessary linux-specific debugging stuff. |
This looks interesting! I've been having performance issues with my gaming vm if im writing to a separate test pool with zstd3, even though it only uses 5-10% of the cpu |
I'm going to try to get this building on FreeBSD tonight or tomorrow if I get some time. |
Cool - should work |
I look forward to seeing if it helps! I have other patches that help with similar use-cases but they're probably not great for a server workload. |
Successfully built and booted 8b00ce6 on FreeBSD. I'm going to try your latest changes, but I think I'll probably still need to work around printk. |
Shouldn't matter any more - it'll just use the usual VERIFY() macros mostly, and the less-important logging is stubbed-out. |
It is nice to see ZSTD performance receiving some love :) I had terrible interactive performance with ZFS for my VM gaming until I enabled CONFIG_PREEMPT=y. |
FWIW, cond_resched() is compiled or patched away to nothing in CONFIG_PREEMPTION kernels.
What do you expect won't work? :) |
@adamdmoss Interesting changes! :) That being said: Would this affect the checksumming? If it does this would hit the same roadblocks as updating ZSTD currently does, with no one actively (outside of empty promises) working on solving those. |
It should be bit-identical unless I buggered up the flushing. Not really sure how to verify that though. |
Thats super awesome, getting some improvements in without needing versioning on the ZSTD stack (assuming you didn't bugger up ;-) ) |
@c0d3z3r0 You might want to check this great work out! |
Squeezed a few more % of decompression speed out of this branch by approximately cherrypicking some upstream zstd optimizations. Probably not something I'd want to submit with this PR but I thought that anyone who was testing this branch might enjoy it anyway. :) (edit: I have included this as part of this PR.) |
@adamdmoss those are some impressive improvements! To make this easier to get reviewed and tested it would be helpful if you could propose a patch stack with each of the major logical changes broken in to its own commit. For example, one commit which makes the recycle+reset ZSTD cctx objects change, another which switches to the streaming API, etc. |
Definitely - was actually thinking of splitting it into separate PRs where possible but that's probably overkill. |
@adamdmoss Small note on one of your statements:
This might be due to the fact ZSTD decompression is already a few times faster than compression per thread, so it is significantly less likely to run into performance issues. |
Separate PRs would be fine. In fact, they may be preferable since it'll let us more easily merge this work in chunks. Smaller changes are always easier to review and integrate! |
Also: it would be nice to have some actual speed numbers to compare ourselves, to get an eye on expected performance changes of the respective changes :) Basically something like we did with the original PR and which is also done by ZSTD upstream themselves: "These numbers are for reference only and do not reflect real-world performance" |
I'm trying to whip up the enthusiasm to make a benchmark that anyone else could reproduce. I could do some dumb shell-script stuff but I'd love to get fancy graphs and what-have-you; do we have any recipes for such a thing? |
Yeah, I guess I'll script something local unless someone already has a recipe, it can spit out numbers which I can just worry about graphing later. |
My original benchmark script (co-authored with Allan) which is link on the original ZSTD PR, should still work for raw test output :) Those graphs however, required some manual labor, though the Excel format is also included in the original PR ^^ |
I reckon this is ready for review. |
I took out the streaming API change from the original versions of this PR; it needs more iteration and can be added later. |
Signed-off-by: Adam Moss <c@yotes.com>
has the compression ratio comparisatoin been tested in this latest version? havent seen any update on this topic |
side note, this behavior is problematic with paradigm of "idempotent compression results", in same ways, see #12840 |
yeah this is kinda the big "if", we've seen better performance before in the PoC code by Allan, which in fact just bypassed the compressor for significant amounts of data. |
thats my concern. because if the compression results have lower ratio, it means that we have a higher amount of allocation fails which renders the whole patch here pointless since it basicly does not optimize anything usefull |
It's everyone's favorite time again, benchmark o'clock. You can find many numbers here, but over 5 runs with each of (git master as of a few hours ago, this branch as of a few hours ago):
(See discussion in #12978 for explanations about what numbers are what. If you're wondering why lz4 is in there, I was curious what its variance across runs would be, nothing else.) |
can't see the compression ratio in your sheet. i just see performance values and performance deltas |
Counterintuitively, it's the one labeled "ratio". On that tab, it's always 0 because there was no difference, but if you look in the tabs it sources data from, the actual values are present. |
so the ratio column is a delta value. i mean your left tab shows values and the right tab delta normally. but on the ratio column this makes no sense since both sides are zero. |
No, that's not what my previous or current charts were. The left table is average(one source) - avg(other source), the right is (avg(one source) - avg(other source))/avg(other source). |
mmh if i just read the zstd values. we see that write 6 9 and 12 has identical performance. so i assume all other value differences are just background noise and the performance gain is zero nor measureable in a significant way. anyway. i will run the same test using my bare metal 16 core ryzen and on a core i7 quadcore. so we get maybe some more accurate results and we can also check if the cpu amd vs intel has a change here |
...what chart are you reading? 6, 9, and 12 all had > 5% performance differences between the current code and this PR, on average. |
using the real data from your dropbox. just write speeds of dd and fio |
alot of values are very identical or similar others are dramatic different. that doesnt fit together. i'm current setting up a test in background to verify it on a zero load non vm system |
This was on a baremetal system that was otherwise idle, my 8700k. |
then it doesnt explain the very noisy results. i just do a retest and will see what comes out on my system. but i will only test zstd algos |
okay. this PR here cannot be merged using git apply. too many merge errors. i skip it and follow up with your PR. no time for this |
You can just clone people's branches from Github, rather than trying to re-apply it, if that's the primary difficulty. |
and you think its clever to compare a old codebase with the current upstream master? its about comparing the PR. not about comparing different openzfs versions. in your PR i did manually patch your PR into the upstream version (just one small merge error occured) |
Very cool, thanks for testing! Note, I don't expect significant improvements to throughput unless you're mostly-CPU-limited rather than I/O limited, since this is just a code optimization and compression usually pipelines perfectly with async I/O (ZFS FTW). In this benchmark I'm guessing seq zstd-9,12,15 were pretty CPU-limited. Seq at smaller recordsizes would probably also benefit particularly. |
Comparing upstream at the most recent rebase to the current branch seems like a reasonable thing to do if your goal is to test the changes in behavior versus unmodified master. Testing rebased against latest master doesn't have to be the same step, particularly when your goal seemed to be to verify whether the changes did more than noise. |
i just call it noise. a difference of +. 50 - 100 mbit/s depending in which mood your system is at the current time is not just noise. i can run a test with the original code vs your versoin and the original code is faster than yours. and then i do it again and the effect is vice versa. whats faster now? |
the benchmark test is running on a ramdisk drive. io isnt the barrier here. no real hdd or ssd was involved |
I wasn't claiming I thought it was just noise, just that that was my impression of your assertion - that you thought the benchmarking we were doing was unreliable, and that you would go run benchmarks with lower variance and see whether any differences arose there. |
The performance results above are encouraging, more benchmarks are of course welcome, but this sure looks like something we should try and move forward with. @adamdmoss if you have the time, I think breaking this in to multiple commits (or PRs) would be the way to proceed. It looks like what still needs to be done here is sort out the FreeBSD build issue and the comment pertaining to the user of |
I agree here, performance gains without ratio loss seems enough to put some pressure behind this to get this merged :) |
@adamdmoss Any chance of this going to be made into something that's mergeable? |
#define ZSTD_memmove(d, s, l) memmove((d), (s), (l)) | ||
#define ZSTD_memset(p, v, l) memset((p), (v), (l)) | ||
#endif | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to ask what was done here to confirm compiler builtins are not used as is. Note general kernel code in both FreeBSD and Linux is using them, so something special would have to be happening in zstd code for that to not happen.
I did a quick check by adding this on FreeBSD:
diff --git a/sys/contrib/openzfs/module/zstd/lib/common/xxhash.c b/sys/contrib/openzfs/module/zstd/lib/common/xxhash.c
index 597de18fc895..c3254a1ce646 100644
--- a/sys/contrib/openzfs/module/zstd/lib/common/xxhash.c
+++ b/sys/contrib/openzfs/module/zstd/lib/common/xxhash.c
@@ -86,6 +86,9 @@ static void XXH_free (void* p) { free(p); }
#include <string.h>
static void* XXH_memcpy(void* dest, const void* src, size_t size) { return memcpy(dest,src,size); }+void foomemset(char *buf);
+void foomemset(char *buf) { memset(buf, 0, 20); }
+
... and confirmed foomemset is using the builtin at least on that platform, I don't have handy manner to check Linux right now.
Regardless of what's going on, replacing memcpy et al uses with ZSTD_ prefixed variants is in my opinion too problematic -- for one one will have to keep a look out on each update.
In absolute worst case this can redefine these routines, but as noted previously this should not even be necessary. If things fail on Linux, there is probably a missing include.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to ask what was done here to confirm compiler builtins are not used as is.
The kernel's out-of-line memcpy() etc were hot on perftop; their being visible at all there means they're non-builtins.
replacing memcpy et al uses with ZSTD_ prefixed variants is in my opinion too problematic -- for one one will have to keep a look out on each update.
Upstream zstd already did this
Motivation and Context
This PR improves CPU overhead in Zstd compression and (to a lesser extent) decompression especially at high (zstd-5..zstd-9+) compression levels and smaller recordsizes.
Description
I replaced the zzstd module's memory pool with an object pool, to permit recycle+reset of Zstd cctx objects. This in turn allows partial-reinitialization of these context objects; fully initializing these objects can become a huge overhead (i.e. up to ~50% of the entire CPU cost at high compression levels and low block sizes!). Then I applied the same optimization to decompression, though the overhead/benefit is much smaller there - still measurable.
I also bulk-replaced memcpy/memmove/memset calls inside the Zstd library itself, permitting the compiler to inline some important inner-loop calls (which it wouldn't otherwise know how to do for non-libc string functions). This may look like a pretty invasive change to what should be pristine upstream code, but upstream Zstd has made the same changes for the same reasons since we introduced this code, so any future upgrade won't need to reproduce this change.
Please note, the compressed data should remain 100% binary-identical, so this doesn't affect dedup/l2arc/encryption-validation etc.
How Has This Been Tested?
ZTS and lots of local data shunting.
Various duct-taped benchmarks copying data of various compressibility to and from datasets with varying recordsizes (mostly 4k, 8k, 16k and 1M) and compression levels (mostly zstd-3, zstd-8, zstd-19). Not chosen to demonstrate PR wins, just seemed like a fairly objective mix.
Raw timing results are here sort.xls - after a few hours I failed to get LibreOffice to graph this in a readable format, so maybe someone else can. :) Below is a table of the average performance per test and the percent improvement.
FWIW I also simulated NOSLEEP allocations failing 50% of the time to robustify against this case. (It was so useful that it'd be cool if there was a ZFS-wide random-nosleep-fails as part of (fuzz?) testing. Or maybe there is already. :) )
[edit] Added performance table from raw data.
Types of changes
Checklist:
Signed-off-by
.