-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pipelined Implementation of ZSTD_dfast #2774
Pipelined Implementation of ZSTD_dfast #2774
Conversation
lib/compress/zstd_double_fast.c
Outdated
if (offset_1 > maxRep) offsetSaved = offset_1, offset_1 = 0; | ||
} | ||
|
||
_start: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could it be written as a loop, rather than a goto
?
Note : this may impact some variables' scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I thought about it. To a rough approximation, the original implementation looks like this:
size_t dfast() {
while (ip < ilimit) {
if (search()) goto _match;
continue;
_match:
store();
}
return;
}
This PR changes it to something like this:
size_t dfast() {
_start:
if (ip >= ilimit) goto _cleanup;
init();
do {
if (search()) goto _match;
} while (ip < ilimit);
_cleanup:
return;
_match:
store();
goto _start;
}
Admittedly, this abuses goto
s a bunch. It is quite close to the assembly that gets generated though, which helps me think about the flow.
I believe it could be rewritten to instead look like this:
size_t dfast() {
if (ip < ilimit) {
init();
do {
if (search()) goto _match;
continue;
_match:
store();
init();
} while (ip < ilimit);
}
return;
}
This solves the abuse of goto
, but has two init()
blocks, which I don't like. I'm also somewhat attracted to moving the match code outside of the tight loop because it's then clearer what the hot loop is.
Alternatively, it could maybe be structured like this, which only has one init()
block, but has two loops nested.
I think this would improve the variable scoping concerns.
size_t dfast() {
while (ip < ilimit) {
init();
do {
if (search()) goto _match;
} while (ip < ilimit);
break;
_match:
store();
}
return;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a version of this last approach here. Performance looks good on gcc-10, but it's slower on clang-12. :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without modifying the general structure of the current code,
could the _start:
/ goto _start
(and just this one) be converted into a loop ?
It seems it wouldn't impact the code structure, hence should be essentially equivalent for the compiler,
yet it would reduce the nb of goto
and allows a (slight) reduction in scope of several variables
that don't need to retain their values between loop iterations.
The algorithm itself looks fine. In term of coding style, there is a heavy reliance on I wonder if that's always necessary. An improvement could be to convert "some" of these |
Aside from maybe a latency win in the loop, this means that when we find a short match, we've already done the hash we need to check the next long match.
This costs a little ratio, unfortunately.
Since we're now hashing the position ahead even if we find a long match and don't search that next position, we can write it back into the hashtable even in long matches. This seems to cost us no speed, and improves compression ratio slightly!
This lookup can be advanced to before the short match check because either way we will use it (in the next loop iter or in `_search_next_long`).
This test depended on `_extDict` and `_noDict` compressing identically, which is not a guarantee we make, AFAIK.
9c6e56f
to
79ca830
Compare
benchmark feedback : Anyway, both impacts are fairly small, and both are positive. So, from a measurement perspective, it looks like a pure improvement. |
It's a pity that you could not use the new loop to reduce the scope of some local variables, but as you mentioned that it does negatively impact performance, I guess we'll leave it there. Another small comments is that I noticed in |
This PR takes the ideas from #2749 and applies them to the double-fast implementation.
Description
We start by pulling a single-segment copy out so that we can work on it separately from the DMS implementation.
This implementation makes two changes to how the input is parsed:
ip + 1
when we find a short match, we checkip + step
. This is a pretty minimal change to the parsing behavior, sincestep
is almost always 1.ip + 1
into the hash table even when we take a long match atip
(instead of only in the short match path). It costs us basically nothing to do this because we've already hashed it. This improves compression ratio.Unlike the fast implementation, whose pipelining includes speculative work that we might throw away, this implementation doesn't do any additional work. It just moves some of it earlier. In particular, the crucial observation is that when we do not take a long match at the current position, we are guaranteed to inspect the next long position, either by taking a short match and checking the next one or by not taking the short match and moving on to the next position. So we can frontload that loading work some.
Benchmarks
Silesia Results Table
Benchmarked on, as usual, an Intel Xeon E5-2680 v4 @ 2.40GHz.
On the whole we see improvements in ratio, and improvements on speed on less-compressible inputs. It seems like very compressible inputs are neutral on speed of even maybe slightly slower.
Status
This PR is believed to be speed-positive, ratio-positive, and correct.
To-Do: