Optimization: Removed unnecessary divides #306

Sc00bz · 2020-10-21T19:59:23Z

No description provided.

codecov-io · 2020-10-21T20:20:34Z

Codecov Report

Merging #306 (05a8c29) into master (440ceb9) will decrease coverage by 0.14%.
The diff coverage is 75.00%.

@@            Coverage Diff             @@
##           master     #306      +/-   ##
==========================================
- Coverage   77.34%   77.19%   -0.15%     
==========================================
  Files           9        9              
  Lines         971      978       +7     
==========================================
+ Hits          751      755       +4     
- Misses        220      223       +3

Impacted Files	Coverage Δ
src/opt.c	`95.50% <75.00%> (-3.28%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 440ceb9...28cccd1. Read the comment docs.

Sc00bz · 2020-10-21T20:26:08Z

I think the only thing that failed was coverage because it doesn't test for lanes=3 (or in general not 2**N).

sneves · 2020-11-03T16:47:30Z

Do you have some performance numbers from before/after to show that this is worth doing?

I would personally hoist the divisionless divisions to a set of inline functions, to improve readability. As it stands it looks like magic.

solardiz · 2021-01-21T00:09:15Z

BTW, I think ~~this~~ getting rid of divides is desirable not only for performance, but also for side-channel safety.

Edit: strike out "this" because of the data-dependent branching introduced with this PR as it currently is. Desirable is getting rid of the divides, but not along with introduction of that other issue.

solardiz · 2021-01-21T11:19:50Z

src/opt.c

+          ref_lane = (pseudo_rand >> 32) - (((pseudo_rand >> 32) * lanes_reciprocal) >> 32) * instance->lanes;
+          if (ref_lane >= instance->lanes) {
+            ref_lane -= instance->lanes;
+          }


As I understand, when running with data dependent addressing the ref_lane >= instance->lane check makes things worse by also introducing data dependent branching. Can we avoid that?

Edit: technically, this check and subtraction can be replaced e.g. with signed subtraction, arithmetic right shift to produce a mask, AND, and subtraction. However, perhaps this is worth a comment on why this condition is even possible (is it?) and why at most one subtraction is enough.

why at most one subtraction is enough

When you do reciprocal = floor(1 / n * 2**32) and floor(x * reciprocal / 2**32) (ie floor(x * floor(1 / n * 2**32) / 2**32)) where x is a 32 bit integer:

x / n = x * 1 / n = x * (1 / n * 2**32) / 2**32 < x * floor(1 / n * 2**32) / 2**32 // the thing < x * (1 / n * 2**32 + 1) / 2**32 = (x / n * 2**32 + x) / 2**32 = x / n + x / 2**32 < x / n + 1 // note x < 2**32 thus x / 2**32 < 1 so x / n < x * floor(1 / n * 2**32) / 2**32 < x / n + 1 thus floor(x / n) ≤ floor(x * floor(1 / n * 2**32) / 2**32) ≤ floor(x / n) + 1

Sorry I've been a lazy piece of shit. I don't know what's wrong with me. I need to like do stuff. Anyway hopefully I didn't fuck up that proof since I'm quite drunk but sort of sobered up in the middle it. (I've proved it to myself a few times but I think this is the first time I've been like oh yeah that's like a real proof.)

solardiz · 2021-01-21T11:30:16Z

The change made in the "Removed unnecessary divides" commit is also a reminder that we have several checks of the form "0 == a && 0 == b" (and that commit adds one more of those). Checks like this can also be written as "0 == (a | b)", so that the code would typically compile to only one conditional branch instead of two. This is generally beneficial, except where the "0 == a" condition is almost always false (thus, almost always triggering short-circuit evaluation and not reaching the second check).

Sc00bz · 2021-02-21T03:28:11Z

... the data-dependent branching introduced with this PR as it currently is. Desirable is getting rid of the divides, but not along with introduction of that other issue.

When I wrote this I was thinking about this but it's either public if it's doing data independent look ups or doesn't matter if it's doing data dependent look ups.

... also a reminder that we have several checks of the form "0 == a && 0 == b" (and that commit adds one more of those). Checks like this can also be written as "0 == (a | b)", so that the code would typically compile to only one conditional branch instead of two. This is generally beneficial, except where the "0 == a" condition is almost always false (thus, almost always triggering short-circuit evaluation and not reaching the second check).

This would be better but it shouldn't matter too much since this only runs 4 times per pass over memory. Also I was going for less edits.

    starting_index = 0;

    /* Offset of the current block */
    curr_offset = position.lane * instance->lane_length +
                  position.slice * instance->segment_length;
    prev_offset = curr_offset - 1;

    if (0 == position.slice) {
        prev_offset += instance->lane_length;
        if (0 == position.pass) {
            /* we have already generated the first two blocks */
            starting_index = 2;
            curr_offset += 2;
            prev_offset = curr_offset - 1;

            /* Don't forget to generate the first block of addresses: */
            if (data_independent_addressing) {
                next_addresses(&address_block, &input_block);
            }
        }
    }

solardiz · 2021-02-21T12:08:23Z

it's either public if it's doing data independent look ups

Right.

or doesn't matter if it's doing data dependent look ups.

This depends on whether you're only thinking in coarse categories such as "side-channel safe vs. unsafe" or also consider real-world nuance within the "unsafe" category. In the real world, different kinds of side-channels might be (un)usable in different circumstances and to different extents.

I wouldn't say that with data dependent lookups it completely doesn't matter that we also have data dependent division or data dependent branching. Ideally, in that flavor we'd have only the data dependent lookups and neither of the other two side-channel leaks.

So to me this PR would make most sense if it removes a side-channel leak (division) without introducing another (data dependent branching). This can in fact be done with only a little bit more code.

Also moved `instance->lanes` onto the stack and fixed formatting.

Sc00bz · 2021-02-22T01:51:09Z

Yeah you're right. OK I removed the if statement.

I kind of want to fix the uint64_t variables, ref_index and ref_lane, that are actually uint32_t but was done probably out of laziness or not wanting to explicitly convert between the two. This is probably causing a few extra instructions up converting a uint32_t during additions and whatnot and 32 bit compiled doing unnecessary add with carries. Also pseudo_rand would get split into 2 uint32_t: pseudo_rand_hi and pseudo_rand_lo. Then getting the high 32 bits from a 32 bit multiply would be replaced with hi32 = (uint32_t) (((uint64_t) a32 * b32) >> 32);. I think after this gets merged I'll fix this.

Oh right I found another divide. It's at the end of index_alpha(). I'll fix this "tomorrow".

phc-winner-argon2/src/core.c

Lines 254 to 255 in 60b31a3

    
           absolute_position = (start_position + relative_position) % 
        
                               instance->lane_length; /* absolute position */

solardiz · 2023-09-25T16:46:54Z

Another reason to get rid of these divides is the recently disclosed AMD Zen1 DIV bug:

https://www.openwall.com/lists/oss-security/2023/09/25/3

In the Zen1 microarchitecure, there is one divider in the pipeline which
services uops from both threads.  In the case of #DE, the latched result
from the previous DIV to execute will be forwarded speculatively.

This is a covert channel that allows two threads to communicate without
any system calls.  In also allows userspace to obtain the result of the
most recent DIV instruction executed (even speculatively) in the core,
which can be from a higher privilege context.

The patches for Xen overwrite the buffer in the divider on the
return-to-guest path.

However, as with some prior speculative vulnerabilities, the fix is only
effective in combination with disabling SMT.  For the same reasons as
before, Xen does not disable SMT by default.

The system administrator is required to risk-assess their workload, and
choose whether to enable or disable SMT.  Xen will issue a warning if
SMT is active and the user has not provided an explicit choice via the
smt=<bool> command line option.

I guess with SMT enabled it's possible to infer the other thread's DIV argument ranges anyway due to the instruction's data-dependent timings, but this bug makes it more direct (just trigger and return from #DE in the other hw thread?) and exact (but I'm not sure whether it's maybe only for the quotient or also for the remainder? someone would need to test).

solardiz · 2023-12-10T03:16:46Z

@Sc00bz FWIW, the remainder may also be computed more directly, not via subtraction, see: openwall/john#5246 (comment)

"Instead of taking the integer part of the fixed-point quotient, this takes its fractional part, brings it to the target range for the remainder, and then takes the integer part of that."

solardiz · 2023-12-10T17:15:36Z

the remainder may also be computed more directly, not via subtraction

So far, I failed to generalize this approach to arbitrary divisors across the full 32-bit dividend range. So I proceeded to test @Sc00bz's approach with subtraction and a fix-up more. It passes my testing, except for divisor = 1, which this PR's code handles specially anyway (along with all powers of 2). It appears possible to get it to work for divisor = 1 as well by simply changing 0x100000000 to 0xffffffff when computing the reciprocal, so maybe this should be done. I guess it can also be a simpler instruction on some architectures (no 64-bit input). I am testing with this program:

#include <stdio.h>
#include <stdint.h>

int main(void)
{
	uint32_t j = 1;
	do {
		printf("Testing %u\n", j);
		uint32_t rec = 0xffffffffU / j;
		//uint32_t rec = (1ull << 32) / j;
		//uint32_t rec = ((1ull << 32) + j - 1) / j;
		//uint32_t rec = ((1ull << 33) + j - 1) / j;
		uint32_t i = 0;
		do {
			//uint32_t res = ((i + 1) * rec * (uint64_t)j) >> 32;
			//uint32_t res = (i * rec * (uint64_t)j) >> 32;
			uint32_t res = i - ((i * (uint64_t)rec) >> 32) * j;
			res += (j & (((uint64_t)res - j) >> 32)) - j;
			//uint32_t res = i - ((i * (uint64_t)rec) >> 33) * j;
			if (res != i % j) {
				printf("%u %% %u = %u got %u\n", i, j, i % j, res);
				break;
			}
		} while (++i);
	} while (++j);

	return 0;
}

Indeed, can't practically test the full (almost) 64-bit space in this way, but can adjust it to test expected edge-cases.

sneves · 2023-12-10T20:55:49Z

Lemire, Kaser, and Kurz have worked out the details of that approach: https://arxiv.org/abs/1902.01961

Sc00bz · 2023-12-10T21:45:25Z

So I recently found out this is called Barrett reduction (or Wikipedia article on it). It's more for a*b (mod n) where a<n and b<n, but it can be made to work in more general cases like this.

You can think of it as doing:

easyDivide = 2^32
x = floor(easyDivide / n)

quotient = floor(rand * x / easyDivide)
    which is basically:
        quotient = "floor(rand * easyDivide / n / easyDivide)"
        quotient = "floor(rand / n)"

remainder = rand - quotient * n

But since x is an approximation, the quotient is an approximation and is either correct or 1 less assuming rand<easyDivide.

In your test code, you only need to check j<0x1000000 because max lanes is 0xffffff. Also you only need to check near multiples of i because the approximation is on the quotient. So you only need to check if the quotient is correct. Also you can skip nice binary numbers for j.

Here's my test code. I didn't remove nice binary numbers for j. Each output step should take about the same amount of time except the first one "j=1 to 1". Anyway I ran this with rec = 0xffffffffU / j; (for j = 1 to 0xffffff) and rec = (uint32_t) (UINT64_C(0x100000000) / j); (for j = 2 to 0xffffff) and it didn't get any wrong:

#include <stdio.h>
#include <stdint.h>

#ifdef _WIN32
	#include <windows.h>
	typedef LARGE_INTEGER TIMER_TYPE;
	#define TIMER_FUNC(t)             QueryPerformanceCounter(&t)

	inline double TIMER_DIFF(LARGE_INTEGER s, LARGE_INTEGER e)
	{
		LARGE_INTEGER f;
		QueryPerformanceFrequency(&f);
		return ((double) (e.QuadPart - s.QuadPart)) / f.QuadPart;
	}
#else
	#include <sys/time.h>

	typedef timeval TIMER_TYPE;
	#define TIMER_FUNC(t)             gettimeofday(&t, NULL)
	#define TIMER_DIFF(s,e)           ((e.tv_sec - s.tv_sec) + (e.tv_usec - s.tv_usec) / (double) 1000000.0)
#endif

int test(uint32_t i, uint32_t j, uint32_t rec, uint32_t correctQuotient)
{
	uint32_t quotient = (uint32_t) (((uint64_t) i * rec) >> 32) + 1; // correctQuotient or correctQuotient+1
	uint32_t remainder = i - quotient * j; // i < quotient * j, otherwise the remainder is correct
	uint32_t offByOne = ((int32_t) remainder >> 31); // 0 or 0xffffffff "-1"
	quotient  += offByOne;     // quotient or quotient-1
	remainder += j & offByOne; // remainder or remainder+j
	if (quotient != correctQuotient)
	{
		printf("\nError %u / %u = %u got %u\n", i, j, correctQuotient, quotient);
		return 1;
	}
	return 0;
}

int main()
{
	TIMER_TYPE s, sFirst, e;
	uint32_t rec;

	TIMER_FUNC(s);
	sFirst = s;

	// j = 1
	printf("Testing 1 to 1...");
	rec = 0xffffffffU / 1;
	for (uint32_t i = 1; i != 0; i++)
	{
		if (test(i, 1, rec, i))
		{
			return 1;
		}
	}

	for (uint32_t j = 2; j < 0x1000000; j++)
	{
		if (((j - 1) & j) == 0)
		{
			TIMER_FUNC(e);
			printf(" took %f\nTesting %u to %u...", TIMER_DIFF(s, e), j, 2 * j - 1);
			s = e;
		}
		// rec = (uint32_t) (UINT64_C(0x100000000) / j);
		rec = 0xffffffffU / j;
		uint32_t correctQuotient = 0;
		for (uint32_t nj = j; ; nj += j)
		{
			// n*j-1 overflowed
			if (nj - 1 < nj - j) break;
			if (test(nj - 1, j, rec, correctQuotient)) return 1;

			correctQuotient++;

			// n*j overflowed
			if (nj == 0) break;
			if (test(nj, j, rec, correctQuotient)) return 1;
		}
	}
	TIMER_FUNC(e);
	printf(" took %f\nTotal time %f\n", TIMER_DIFF(s, e), TIMER_DIFF(sFirst, e));
	return 0;
}

Oh by:

res += (j & (((uint64_t) res - j) >> 32)) - j;

You meant to do signed right shift:

res += (j & (((int32_t) res - j) >> 31)) - j;

OH LOL you copied that from me... because ref_lane is a uint64_t even though it should be a uint32_t (or I guess size_t) and I didn't want to make more changes than necessary. Either works it's just your test code makes what I did look weirder.

P.S. Using uint32_t rec = 0xffffffffU / j; might break others because you are doing i / j * 0xffffffff / 0x100000000. So the quotient could theoretically be off by 2... I think. Anyway I ran my test code with the rest of j=0x1000000 to 0xffffffff and it didn't get any wrong (with either method).

Sc00bz · 2023-12-10T21:54:45Z

Just realized I should also test max value test(0xffffffff, j, rec, 0xffffffff / j)... OK this also works.

solardiz · 2023-12-13T23:37:59Z

Lemire, Kaser, and Kurz have worked out the details of that approach: https://arxiv.org/abs/1902.01961

Oh, thanks! Also Lemire's blog posts https://lemire.me/blog/2019/02/08/faster-remainders-when-the-divisor-is-a-constant-beating-compilers-and-libdivide/ and https://lemire.me/blog/2019/02/20/more-fun-with-fast-remainders-when-the-divisor-is-a-constant/

So for the full 32-bit dividend range and arbitrary (even if not full 32-bit) divisors, we need wider than 32x32->64 multiplies to retain sufficient precision. (For mod 3 and a few others, I managed to find parameters that work within 32x32->64 across the full 32-bit dividend range, but per the above paper and per my experiments this can't be generalized to work for a wide range of divisors.)

Edit: for those skimming these comments - the above is about my (non-)suggestion to compute the remainder more directly than is done in this PR. The code currently in this PR is likely correct as-is, and should probably be merged. I am not arguing against it.

codecov-commenter · 2024-07-10T11:59:16Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 81.25000% with 3 lines in your changes missing coverage. Please review.

Project coverage is 77.22%. Comparing base (440ceb9) to head (28cccd1).
Report is 12 commits behind head on master.

Files	Patch %	Lines
src/opt.c	75.00%	3 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #306      +/-   ##
==========================================
- Coverage   77.34%   77.22%   -0.13%     
==========================================
  Files           9        9              
  Lines         971      979       +8     
==========================================
+ Hits          751      756       +5     
- Misses        220      223       +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sc00bz added 2 commits October 16, 2020 18:50

Removed unnecessary divides

dc6dc27

Replaced divide with reciprocal fix point multiply

bba500b

solardiz reviewed Jan 21, 2021

View reviewed changes

Removed data dependent branching

05a8c29

Also moved `instance->lanes` onto the stack and fixed formatting.

Removed unnecessary divide

28cccd1

solardiz mentioned this pull request Mar 30, 2023

NT-long OpenCL: Support lengths up to 125 openwall/john#5246

Merged

solardiz mentioned this pull request Dec 10, 2023

Get rid of modulo division in Argon2 openwall/john#5385

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization: Removed unnecessary divides #306

Optimization: Removed unnecessary divides #306

Sc00bz commented Oct 21, 2020

codecov-io commented Oct 21, 2020 •

edited

Loading

Sc00bz commented Oct 21, 2020

sneves commented Nov 3, 2020

solardiz commented Jan 21, 2021 •

edited

Loading

solardiz Jan 21, 2021 •

edited

Loading

Sc00bz Feb 20, 2021

solardiz commented Jan 21, 2021

Sc00bz commented Feb 21, 2021

solardiz commented Feb 21, 2021

Sc00bz commented Feb 22, 2021

solardiz commented Sep 25, 2023

solardiz commented Dec 10, 2023

solardiz commented Dec 10, 2023

sneves commented Dec 10, 2023

Sc00bz commented Dec 10, 2023

Sc00bz commented Dec 10, 2023

solardiz commented Dec 13, 2023 •

edited

Loading

codecov-commenter commented Jul 10, 2024

Optimization: Removed unnecessary divides #306

Are you sure you want to change the base?

Optimization: Removed unnecessary divides #306

Conversation

Sc00bz commented Oct 21, 2020

codecov-io commented Oct 21, 2020 • edited Loading

Codecov Report

Sc00bz commented Oct 21, 2020

sneves commented Nov 3, 2020

solardiz commented Jan 21, 2021 • edited Loading

solardiz Jan 21, 2021 • edited Loading

Choose a reason for hiding this comment

Sc00bz Feb 20, 2021

Choose a reason for hiding this comment

solardiz commented Jan 21, 2021

Sc00bz commented Feb 21, 2021

solardiz commented Feb 21, 2021

Sc00bz commented Feb 22, 2021

solardiz commented Sep 25, 2023

solardiz commented Dec 10, 2023

solardiz commented Dec 10, 2023

sneves commented Dec 10, 2023

Sc00bz commented Dec 10, 2023

Sc00bz commented Dec 10, 2023

solardiz commented Dec 13, 2023 • edited Loading

codecov-commenter commented Jul 10, 2024

Codecov Report

codecov-io commented Oct 21, 2020 •

edited

Loading

solardiz commented Jan 21, 2021 •

edited

Loading

solardiz Jan 21, 2021 •

edited

Loading

solardiz commented Dec 13, 2023 •

edited

Loading