Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization: Removed unnecessary divides #306

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

Sc00bz
Copy link
Contributor

@Sc00bz Sc00bz commented Oct 21, 2020

No description provided.

@codecov-io
Copy link

codecov-io commented Oct 21, 2020

Codecov Report

Merging #306 (05a8c29) into master (440ceb9) will decrease coverage by 0.14%.
The diff coverage is 75.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #306      +/-   ##
==========================================
- Coverage   77.34%   77.19%   -0.15%     
==========================================
  Files           9        9              
  Lines         971      978       +7     
==========================================
+ Hits          751      755       +4     
- Misses        220      223       +3     
Impacted Files Coverage Δ
src/opt.c 95.50% <75.00%> (-3.28%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 440ceb9...28cccd1. Read the comment docs.

@Sc00bz
Copy link
Contributor Author

Sc00bz commented Oct 21, 2020

I think the only thing that failed was coverage because it doesn't test for lanes=3 (or in general not 2**N).

@sneves
Copy link
Contributor

sneves commented Nov 3, 2020

Do you have some performance numbers from before/after to show that this is worth doing?

I would personally hoist the divisionless divisions to a set of inline functions, to improve readability. As it stands it looks like magic.

@solardiz
Copy link

solardiz commented Jan 21, 2021

BTW, I think this getting rid of divides is desirable not only for performance, but also for side-channel safety.

Edit: strike out "this" because of the data-dependent branching introduced with this PR as it currently is. Desirable is getting rid of the divides, but not along with introduction of that other issue.

src/opt.c Outdated
ref_lane = (pseudo_rand >> 32) - (((pseudo_rand >> 32) * lanes_reciprocal) >> 32) * instance->lanes;
if (ref_lane >= instance->lanes) {
ref_lane -= instance->lanes;
}
Copy link

@solardiz solardiz Jan 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand, when running with data dependent addressing the ref_lane >= instance->lane check makes things worse by also introducing data dependent branching. Can we avoid that?

Edit: technically, this check and subtraction can be replaced e.g. with signed subtraction, arithmetic right shift to produce a mask, AND, and subtraction. However, perhaps this is worth a comment on why this condition is even possible (is it?) and why at most one subtraction is enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why at most one subtraction is enough

When you do reciprocal = floor(1 / n * 2**32) and floor(x * reciprocal / 2**32) (ie floor(x * floor(1 / n * 2**32) / 2**32)) where x is a 32 bit integer:

x / n
= x * 1 / n
= x * (1 / n * 2**32) / 2**32
< x * floor(1 / n * 2**32) / 2**32  // the thing
< x * (1 / n * 2**32 + 1) / 2**32
= (x / n * 2**32 + x) / 2**32
= x / n + x / 2**32
< x / n + 1  // note x < 2**32 thus x / 2**32 < 1
so
x / n < x * floor(1 / n * 2**32) / 2**32 < x / n + 1
thus
floor(x / n) ≤ floor(x * floor(1 / n * 2**32) / 2**32) ≤ floor(x / n) + 1

Sorry I've been a lazy piece of shit. I don't know what's wrong with me. I need to like do stuff. Anyway hopefully I didn't fuck up that proof since I'm quite drunk but sort of sobered up in the middle it. (I've proved it to myself a few times but I think this is the first time I've been like oh yeah that's like a real proof.)

@solardiz
Copy link

The change made in the "Removed unnecessary divides" commit is also a reminder that we have several checks of the form "0 == a && 0 == b" (and that commit adds one more of those). Checks like this can also be written as "0 == (a | b)", so that the code would typically compile to only one conditional branch instead of two. This is generally beneficial, except where the "0 == a" condition is almost always false (thus, almost always triggering short-circuit evaluation and not reaching the second check).

@Sc00bz
Copy link
Contributor Author

Sc00bz commented Feb 21, 2021

... the data-dependent branching introduced with this PR as it currently is. Desirable is getting rid of the divides, but not along with introduction of that other issue.

When I wrote this I was thinking about this but it's either public if it's doing data independent look ups or doesn't matter if it's doing data dependent look ups.

... also a reminder that we have several checks of the form "0 == a && 0 == b" (and that commit adds one more of those). Checks like this can also be written as "0 == (a | b)", so that the code would typically compile to only one conditional branch instead of two. This is generally beneficial, except where the "0 == a" condition is almost always false (thus, almost always triggering short-circuit evaluation and not reaching the second check).

This would be better but it shouldn't matter too much since this only runs 4 times per pass over memory. Also I was going for less edits.

    starting_index = 0;

    /* Offset of the current block */
    curr_offset = position.lane * instance->lane_length +
                  position.slice * instance->segment_length;
    prev_offset = curr_offset - 1;

    if (0 == position.slice) {
        prev_offset += instance->lane_length;
        if (0 == position.pass) {
            /* we have already generated the first two blocks */
            starting_index = 2;
            curr_offset += 2;
            prev_offset = curr_offset - 1;

            /* Don't forget to generate the first block of addresses: */
            if (data_independent_addressing) {
                next_addresses(&address_block, &input_block);
            }
        }
    }

@solardiz
Copy link

it's either public if it's doing data independent look ups

Right.

or doesn't matter if it's doing data dependent look ups.

This depends on whether you're only thinking in coarse categories such as "side-channel safe vs. unsafe" or also consider real-world nuance within the "unsafe" category. In the real world, different kinds of side-channels might be (un)usable in different circumstances and to different extents.

I wouldn't say that with data dependent lookups it completely doesn't matter that we also have data dependent division or data dependent branching. Ideally, in that flavor we'd have only the data dependent lookups and neither of the other two side-channel leaks.

So to me this PR would make most sense if it removes a side-channel leak (division) without introducing another (data dependent branching). This can in fact be done with only a little bit more code.

Also moved `instance->lanes` onto the stack and fixed formatting.
@Sc00bz
Copy link
Contributor Author

Sc00bz commented Feb 22, 2021

Yeah you're right. OK I removed the if statement.

I kind of want to fix the uint64_t variables, ref_index and ref_lane, that are actually uint32_t but was done probably out of laziness or not wanting to explicitly convert between the two. This is probably causing a few extra instructions up converting a uint32_t during additions and whatnot and 32 bit compiled doing unnecessary add with carries. Also pseudo_rand would get split into 2 uint32_t: pseudo_rand_hi and pseudo_rand_lo. Then getting the high 32 bits from a 32 bit multiply would be replaced with hi32 = (uint32_t) (((uint64_t) a32 * b32) >> 32);. I think after this gets merged I'll fix this.

Oh right I found another divide. It's at the end of index_alpha(). I'll fix this "tomorrow".

absolute_position = (start_position + relative_position) %
instance->lane_length; /* absolute position */

@solardiz
Copy link

Another reason to get rid of these divides is the recently disclosed AMD Zen1 DIV bug:

https://www.openwall.com/lists/oss-security/2023/09/25/3

In the Zen1 microarchitecure, there is one divider in the pipeline which
services uops from both threads.  In the case of #DE, the latched result
from the previous DIV to execute will be forwarded speculatively.

This is a covert channel that allows two threads to communicate without
any system calls.  In also allows userspace to obtain the result of the
most recent DIV instruction executed (even speculatively) in the core,
which can be from a higher privilege context.
The patches for Xen overwrite the buffer in the divider on the
return-to-guest path.

However, as with some prior speculative vulnerabilities, the fix is only
effective in combination with disabling SMT.  For the same reasons as
before, Xen does not disable SMT by default.

The system administrator is required to risk-assess their workload, and
choose whether to enable or disable SMT.  Xen will issue a warning if
SMT is active and the user has not provided an explicit choice via the
smt=<bool> command line option.

I guess with SMT enabled it's possible to infer the other thread's DIV argument ranges anyway due to the instruction's data-dependent timings, but this bug makes it more direct (just trigger and return from #DE in the other hw thread?) and exact (but I'm not sure whether it's maybe only for the quotient or also for the remainder? someone would need to test).

@solardiz
Copy link

@Sc00bz FWIW, the remainder may also be computed more directly, not via subtraction, see: openwall/john#5246 (comment)

"Instead of taking the integer part of the fixed-point quotient, this takes its fractional part, brings it to the target range for the remainder, and then takes the integer part of that."

@solardiz
Copy link

the remainder may also be computed more directly, not via subtraction

So far, I failed to generalize this approach to arbitrary divisors across the full 32-bit dividend range. So I proceeded to test @Sc00bz's approach with subtraction and a fix-up more. It passes my testing, except for divisor = 1, which this PR's code handles specially anyway (along with all powers of 2). It appears possible to get it to work for divisor = 1 as well by simply changing 0x100000000 to 0xffffffff when computing the reciprocal, so maybe this should be done. I guess it can also be a simpler instruction on some architectures (no 64-bit input). I am testing with this program:

#include <stdio.h>
#include <stdint.h>

int main(void)
{
	uint32_t j = 1;
	do {
		printf("Testing %u\n", j);
		uint32_t rec = 0xffffffffU / j;
		//uint32_t rec = (1ull << 32) / j;
		//uint32_t rec = ((1ull << 32) + j - 1) / j;
		//uint32_t rec = ((1ull << 33) + j - 1) / j;
		uint32_t i = 0;
		do {
			//uint32_t res = ((i + 1) * rec * (uint64_t)j) >> 32;
			//uint32_t res = (i * rec * (uint64_t)j) >> 32;
			uint32_t res = i - ((i * (uint64_t)rec) >> 32) * j;
			res += (j & (((uint64_t)res - j) >> 32)) - j;
			//uint32_t res = i - ((i * (uint64_t)rec) >> 33) * j;
			if (res != i % j) {
				printf("%u %% %u = %u got %u\n", i, j, i % j, res);
				break;
			}
		} while (++i);
	} while (++j);

	return 0;
}

Indeed, can't practically test the full (almost) 64-bit space in this way, but can adjust it to test expected edge-cases.

@sneves
Copy link
Contributor

sneves commented Dec 10, 2023

Lemire, Kaser, and Kurz have worked out the details of that approach: https://arxiv.org/abs/1902.01961

@Sc00bz
Copy link
Contributor Author

Sc00bz commented Dec 10, 2023

So I recently found out this is called Barrett reduction (or Wikipedia article on it). It's more for a*b (mod n) where a<n and b<n, but it can be made to work in more general cases like this.

You can think of it as doing:

easyDivide = 2^32
x = floor(easyDivide / n)

quotient = floor(rand * x / easyDivide)
    which is basically:
        quotient = "floor(rand * easyDivide / n / easyDivide)"
        quotient = "floor(rand / n)"

remainder = rand - quotient * n

But since x is an approximation, the quotient is an approximation and is either correct or 1 less assuming rand<easyDivide.


In your test code, you only need to check j<0x1000000 because max lanes is 0xffffff. Also you only need to check near multiples of i because the approximation is on the quotient. So you only need to check if the quotient is correct. Also you can skip nice binary numbers for j.

Here's my test code. I didn't remove nice binary numbers for j. Each output step should take about the same amount of time except the first one "j=1 to 1". Anyway I ran this with rec = 0xffffffffU / j; (for j = 1 to 0xffffff) and rec = (uint32_t) (UINT64_C(0x100000000) / j); (for j = 2 to 0xffffff) and it didn't get any wrong:

#include <stdio.h>
#include <stdint.h>

#ifdef _WIN32
	#include <windows.h>
	typedef LARGE_INTEGER TIMER_TYPE;
	#define TIMER_FUNC(t)             QueryPerformanceCounter(&t)

	inline double TIMER_DIFF(LARGE_INTEGER s, LARGE_INTEGER e)
	{
		LARGE_INTEGER f;
		QueryPerformanceFrequency(&f);
		return ((double) (e.QuadPart - s.QuadPart)) / f.QuadPart;
	}
#else
	#include <sys/time.h>

	typedef timeval TIMER_TYPE;
	#define TIMER_FUNC(t)             gettimeofday(&t, NULL)
	#define TIMER_DIFF(s,e)           ((e.tv_sec - s.tv_sec) + (e.tv_usec - s.tv_usec) / (double) 1000000.0)
#endif

int test(uint32_t i, uint32_t j, uint32_t rec, uint32_t correctQuotient)
{
	uint32_t quotient = (uint32_t) (((uint64_t) i * rec) >> 32) + 1; // correctQuotient or correctQuotient+1
	uint32_t remainder = i - quotient * j; // i < quotient * j, otherwise the remainder is correct
	uint32_t offByOne = ((int32_t) remainder >> 31); // 0 or 0xffffffff "-1"
	quotient  += offByOne;     // quotient or quotient-1
	remainder += j & offByOne; // remainder or remainder+j
	if (quotient != correctQuotient)
	{
		printf("\nError %u / %u = %u got %u\n", i, j, correctQuotient, quotient);
		return 1;
	}
	return 0;
}

int main()
{
	TIMER_TYPE s, sFirst, e;
	uint32_t rec;

	TIMER_FUNC(s);
	sFirst = s;

	// j = 1
	printf("Testing 1 to 1...");
	rec = 0xffffffffU / 1;
	for (uint32_t i = 1; i != 0; i++)
	{
		if (test(i, 1, rec, i))
		{
			return 1;
		}
	}

	for (uint32_t j = 2; j < 0x1000000; j++)
	{
		if (((j - 1) & j) == 0)
		{
			TIMER_FUNC(e);
			printf(" took %f\nTesting %u to %u...", TIMER_DIFF(s, e), j, 2 * j - 1);
			s = e;
		}
		// rec = (uint32_t) (UINT64_C(0x100000000) / j);
		rec = 0xffffffffU / j;
		uint32_t correctQuotient = 0;
		for (uint32_t nj = j; ; nj += j)
		{
			// n*j-1 overflowed
			if (nj - 1 < nj - j) break;
			if (test(nj - 1, j, rec, correctQuotient)) return 1;

			correctQuotient++;

			// n*j overflowed
			if (nj == 0) break;
			if (test(nj, j, rec, correctQuotient)) return 1;
		}
	}
	TIMER_FUNC(e);
	printf(" took %f\nTotal time %f\n", TIMER_DIFF(s, e), TIMER_DIFF(sFirst, e));
	return 0;
}

Oh by:

res += (j & (((uint64_t) res - j) >> 32)) - j;

You meant to do signed right shift:

res += (j & (((int32_t) res - j) >> 31)) - j;

OH LOL you copied that from me... because ref_lane is a uint64_t even though it should be a uint32_t (or I guess size_t) and I didn't want to make more changes than necessary. Either works it's just your test code makes what I did look weirder.


P.S. Using uint32_t rec = 0xffffffffU / j; might break others because you are doing i / j * 0xffffffff / 0x100000000. So the quotient could theoretically be off by 2... I think. Anyway I ran my test code with the rest of j=0x1000000 to 0xffffffff and it didn't get any wrong (with either method).

@Sc00bz
Copy link
Contributor Author

Sc00bz commented Dec 10, 2023

Just realized I should also test max value test(0xffffffff, j, rec, 0xffffffff / j)... OK this also works.

@solardiz
Copy link

solardiz commented Dec 13, 2023

Lemire, Kaser, and Kurz have worked out the details of that approach: https://arxiv.org/abs/1902.01961

Oh, thanks! Also Lemire's blog posts https://lemire.me/blog/2019/02/08/faster-remainders-when-the-divisor-is-a-constant-beating-compilers-and-libdivide/ and https://lemire.me/blog/2019/02/20/more-fun-with-fast-remainders-when-the-divisor-is-a-constant/

So for the full 32-bit dividend range and arbitrary (even if not full 32-bit) divisors, we need wider than 32x32->64 multiplies to retain sufficient precision. (For mod 3 and a few others, I managed to find parameters that work within 32x32->64 across the full 32-bit dividend range, but per the above paper and per my experiments this can't be generalized to work for a wide range of divisors.)

Edit: for those skimming these comments - the above is about my (non-)suggestion to compute the remainder more directly than is done in this PR. The code currently in this PR is likely correct as-is, and should probably be merged. I am not arguing against it.

@codecov-commenter
Copy link

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 81.25000% with 3 lines in your changes missing coverage. Please review.

Project coverage is 77.22%. Comparing base (440ceb9) to head (28cccd1).
Report is 12 commits behind head on master.

Files Patch % Lines
src/opt.c 75.00% 3 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #306      +/-   ##
==========================================
- Coverage   77.34%   77.22%   -0.13%     
==========================================
  Files           9        9              
  Lines         971      979       +8     
==========================================
+ Hits          751      756       +5     
- Misses        220      223       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants