Operators only implemented on references #57

UnlawfulMonad · 2017-05-22T02:34:31Z

I've noticed that operators on ExtendedPoint and the like are only implemented on &'a ExtendedPoint. Is there any reason for a PR that also implemented them for the concrete types wouldn't be accepted? The types are already copy and as far as I know the compile is better at reasoning about the code when there isn't explicit memory (i.e. a reference) involved.

The text was updated successfully, but these errors were encountered:

burdges · 2017-05-22T08:58:59Z

If I understand, the benchmarks suffer if you pass these structures by value, but maybe one can tweak that with optimizations modes, etc. In particular, passing by value permits the recipient to mutate the value, so you stack space optimizations might inhibit doing a pass by reference behind the scenes. I'd also think making the operators #[inline(always)] wrappers around call by reference functions should fix this, but maybe not.

hdevalence · 2017-05-23T06:01:59Z

When we implemented the arithmetic ops for T and not for &T, we had a lot of nested redundant copies: first, copying the LHS and RHS points into the function (each point is 160 bytes), plus copying the LHS and RHS of every field element operation inside (each field element is 40 bytes). It did make a difference in cutting out needless copies. However it's obviously worse ergonomically, as I'm sure you've noticed.

I didn't think that it would be possible to work around this, but it seems like @burdges' suggestion might work, at least on that toy example.

So, to answer your original question, @UnlawfulMonad, if there were implementations of the arithmetic ops that didn't require borrows but were just as fast, we would be happy to take that PR.

isislovecruft · 2017-05-26T21:46:12Z

Hi @UnlawfulMonad!

As @hdevalence said, we'd be stoked to take a patch for this, provided it doesn't slow anything down by introducing needless copies everywhere. If you want to implement this, my suggestion for the steps would be:

Implement the operators for non-borrow types, using @burdges suggestion to use #[inline(always)]. (Bonus points for adding #[inline(always)] to all the operator implementations!)
Find anywhere in our code (or in ed25519-dalek or dalek-rangeproofs that is doing what I call an "Eye of Sauron" (i.e. &(&(&(&( )))) like &(&(&s1 * &P1) + &(s2 * &P2)) + &(s3 * &P3) or something similarly evil looking), change it to your non-borrow syntax, and compare—with various rustc optimisation levels—the assembly generated from the original code to the assembly for the new syntax to ensure there's no extra copies. https://rust.godbolt.org is really handy for this. (If you're not comfortable with assembly, feel free to paste links to the the two versions of the code in godbolt, and we'll review it.)
(optional) If all goes well, feel free to kill as many Eye of Saurons as you find. :)

@UnlawfulMonad Would you like me to assign you to this issue?

UnlawfulMonad · 2017-05-26T22:11:49Z

Hi @isislovecruft and @hdevalence!

Thanks for the suggestions. I'll get on this next week and update this issue with my progress. I assumed that inlining would already be occurring if the call by value variant were simply wrapping the call by reference versions.

Sidenote: Eye of Sauron is a great name for those things. :)

isislovecruft · 2017-05-28T23:30:31Z

@UnlawfulMonad Awesome, thanks! It appears like if we add #[inline(always)] also to the borrow versions, then the by-value versions inline directly to the borrow version code (i.e. so that rustc is basically adding the Eye of Sauron for you).

(Also I used to work at the FreeGeek in Portland! It is an excellent project!)

UnlawfulMonad · 2017-06-12T23:36:30Z

@isislovecruft Is code bloat not a concern at that point? #[inline(always)] forces rustc/LLVM to inline. Another option would be to leave the Eyes of Sauron in the library code but gives users the option to use either impl (overgeneralization warning) since in practice the overhead in user code should be negligible. I'm going to test simply adding #[inline] attributes and compare the overhead vs #[inline(always)]ing all the operators.

Sorry for the lack of progress. It's been a hectic couple of weeks.

(And that's awesome! Agreed, it's an awesome project!)

UnlawfulMonad · 2017-06-13T00:20:40Z

Alright. Here are my results:

All readings with rustc version 1.19.0-nightly (148e91714 2017-06-08).

Baseline from commit 97d6974:

test curve::bench::add_extended_and_affine_niels_output_completed     ... bench:         212 ns/iter (+/- 87)
test curve::bench::add_extended_and_affine_niels_output_extended      ... bench:         480 ns/iter (+/- 136)
test curve::bench::add_extended_and_projective_niels_output_completed ... bench:         275 ns/iter (+/- 6)
test curve::bench::add_extended_and_projective_niels_output_extended  ... bench:         525 ns/iter (+/- 23)
test curve::bench::basepoint_mult                                     ... bench:      35,181 ns/iter (+/- 12,430)
test curve::bench::bench_select_precomputed_point                     ... bench:       1,032 ns/iter (+/- 50)
test curve::bench::edwards_compress                                   ... bench:      12,023 ns/iter (+/- 1,532)
test curve::bench::edwards_decompress                                 ... bench:      12,837 ns/iter (+/- 2,573)
test curve::bench::extended_double_output_extended                    ... bench:         486 ns/iter (+/- 100)
test curve::bench::mult_by_cofactor                                   ... bench:       1,316 ns/iter (+/- 251)
test curve::bench::projective_double_output_completed                 ... bench:         226 ns/iter (+/- 118)
test curve::bench::scalar_mult                                        ... bench:     152,515 ns/iter (+/- 2,688)
test curve::bench::vartime::bench_double_scalar_mult_basepoint        ... bench:     159,619 ns/iter (+/- 2,291)
test curve::bench::vartime::ten_fold_scalar_mult                      ... bench:     399,000 ns/iter (+/- 7,540)
test field::bench::fieldelement_a_inv                                 ... bench:      12,200 ns/iter (+/- 1,889)
test field::bench::fieldelement_a_mul_a                               ... bench:          67 ns/iter (+/- 1)
test field::bench::fieldelement_a_sq                                  ... bench:          48 ns/iter (+/- 1)
test scalar::bench::invert                                            ... bench:      67,871 ns/iter (+/- 803)
test scalar::bench::scalar_multiply_add                               ... bench:         270 ns/iter (+/- 68)
test scalar::bench::scalar_random                                     ... bench:       1,115 ns/iter (+/- 101)
test scalar::bench::scalar_unpacked_multiply_add                      ... bench:         199 ns/iter (+/- 12)

With #[inline] on all operators (commit 6cdae94 on the callbyvalue branch of my fork):

test curve::bench::add_extended_and_affine_niels_output_completed              ... bench:         220 ns/iter (+/- 9)
test curve::bench::add_extended_and_affine_niels_output_extended               ... bench:         493 ns/iter (+/- 28)
test curve::bench::add_extended_and_projective_niels_output_completed          ... bench:         287 ns/iter (+/- 23)
test curve::bench::add_extended_and_projective_niels_output_completed_by_value ... bench:         299 ns/iter (+/- 9)
test curve::bench::add_extended_and_projective_niels_output_extended           ... bench:         558 ns/iter (+/- 72)
test curve::bench::add_extended_and_projective_niels_output_extended_by_value  ... bench:         593 ns/iter (+/- 33)
test curve::bench::basepoint_mult                                              ... bench:      36,890 ns/iter (+/- 6,329)
test curve::bench::bench_select_precomputed_point                              ... bench:         891 ns/iter (+/- 60)
test curve::bench::edwards_compress                                            ... bench:      12,197 ns/iter (+/- 2,828)
test curve::bench::edwards_decompress                                          ... bench:      12,710 ns/iter (+/- 1,029)
test curve::bench::extended_double_output_extended                             ... bench:         493 ns/iter (+/- 14)
test curve::bench::mult_by_cofactor                                            ... bench:       1,318 ns/iter (+/- 96)
test curve::bench::projective_double_output_completed                          ... bench:         220 ns/iter (+/- 10)
test curve::bench::scalar_mult                                                 ... bench:     156,671 ns/iter (+/- 10,735)
test curve::bench::scalar_mult_by_value                                        ... bench:     163,941 ns/iter (+/- 13,653)
test curve::bench::vartime::bench_double_scalar_mult_basepoint                 ... bench:     160,486 ns/iter (+/- 10,722)
test curve::bench::vartime::ten_fold_scalar_mult                               ... bench:     402,349 ns/iter (+/- 16,962)
test field::bench::fieldelement_a_inv                                          ... bench:      11,891 ns/iter (+/- 2,288)
test field::bench::fieldelement_a_mul_a                                        ... bench:          65 ns/iter (+/- 2)
test field::bench::fieldelement_a_mul_a_by_value                               ... bench:          67 ns/iter (+/- 6)
test field::bench::fieldelement_a_sq                                           ... bench:          47 ns/iter (+/- 3)
test scalar::bench::invert                                                     ... bench:      65,925 ns/iter (+/- 3,157)
test scalar::bench::scalar_multiply_add                                        ... bench:         257 ns/iter (+/- 12)
test scalar::bench::scalar_random                                              ... bench:       1,135 ns/iter (+/- 61)
test scalar::bench::scalar_unpacked_multiply_add                               ... bench:         195 ns/iter (+/- 23)

With #[inline(always)] on all operators (commit 156c92d on the call by value branch of my fork):

test curve::bench::add_extended_and_affine_niels_output_completed              ... bench:         208 ns/iter (+/- 6)
test curve::bench::add_extended_and_affine_niels_output_extended               ... bench:         488 ns/iter (+/- 19)
test curve::bench::add_extended_and_projective_niels_output_completed          ... bench:         278 ns/iter (+/- 9)
test curve::bench::add_extended_and_projective_niels_output_completed_by_value ... bench:         295 ns/iter (+/- 11)
test curve::bench::add_extended_and_projective_niels_output_extended           ... bench:         550 ns/iter (+/- 20)
test curve::bench::add_extended_and_projective_niels_output_extended_by_value  ... bench:         565 ns/iter (+/- 16)
test curve::bench::basepoint_mult                                              ... bench:      36,467 ns/iter (+/- 1,810)
test curve::bench::bench_select_precomputed_point                              ... bench:         882 ns/iter (+/- 36)
test curve::bench::edwards_compress                                            ... bench:      12,003 ns/iter (+/- 1,698)
test curve::bench::edwards_decompress                                          ... bench:      12,630 ns/iter (+/- 1,812)
test curve::bench::extended_double_output_extended                             ... bench:         496 ns/iter (+/- 32)
test curve::bench::mult_by_cofactor                                            ... bench:       1,331 ns/iter (+/- 32)
test curve::bench::projective_double_output_completed                          ... bench:         217 ns/iter (+/- 8)
test curve::bench::scalar_mult                                                 ... bench:     155,447 ns/iter (+/- 5,160)
test curve::bench::scalar_mult_by_value                                        ... bench:     156,946 ns/iter (+/- 6,148)
test curve::bench::vartime::bench_double_scalar_mult_basepoint                 ... bench:     157,789 ns/iter (+/- 9,171)
test curve::bench::vartime::ten_fold_scalar_mult                               ... bench:     405,442 ns/iter (+/- 19,138)
test field::bench::fieldelement_a_inv                                          ... bench:      11,806 ns/iter (+/- 1,706)
test field::bench::fieldelement_a_mul_a                                        ... bench:          66 ns/iter (+/- 3)
test field::bench::fieldelement_a_mul_a_by_value                               ... bench:          67 ns/iter (+/- 3)
test field::bench::fieldelement_a_sq                                           ... bench:          47 ns/iter (+/- 4)
test scalar::bench::invert                                                     ... bench:      72,665 ns/iter (+/- 1,737)
test scalar::bench::scalar_multiply_add                                        ... bench:         267 ns/iter (+/- 53)
test scalar::bench::scalar_random                                              ... bench:       1,130 ns/iter (+/- 195)
test scalar::bench::scalar_unpacked_multiply_add                               ... bench:         216 ns/iter (+/- 40)

Isolating a couple of the tests we get:

Baseline:
test curve::bench::scalar_mult                                        ... bench:     152,515 ns/iter (+/- 2,688)
test curve::bench::vartime::bench_double_scalar_mult_basepoint        ... bench:     159,619 ns/iter (+/- 2,291)
test curve::bench::vartime::ten_fold_scalar_mult                      ... bench:     399,000 ns/iter (+/- 7,540)
test field::bench::fieldelement_a_inv                                 ... bench:      12,200 ns/iter (+/- 1,889)
test field::bench::fieldelement_a_mul_a                               ... bench:          67 ns/iter (+/- 1)
test field::bench::fieldelement_a_sq                                  ... bench:          48 ns/iter (+/- 1)

#[inline]
test curve::bench::scalar_mult                                                 ... bench:     156,671 ns/iter (+/- 10,735)
test curve::bench::scalar_mult_by_value                                        ... bench:     163,941 ns/iter (+/- 13,653)
test curve::bench::vartime::bench_double_scalar_mult_basepoint                 ... bench:     160,486 ns/iter (+/- 10,722)
test curve::bench::vartime::ten_fold_scalar_mult                               ... bench:     402,349 ns/iter (+/- 16,962)
test field::bench::fieldelement_a_inv                                          ... bench:      11,891 ns/iter (+/- 2,288)
test field::bench::fieldelement_a_mul_a                                        ... bench:          65 ns/iter (+/- 2)
test field::bench::fieldelement_a_mul_a_by_value                               ... bench:          67 ns/iter (+/- 6)
test field::bench::fieldelement_a_sq                                           ... bench:          47 ns/iter (+/- 3)

#[inline(always)]
test curve::bench::scalar_mult                                                 ... bench:     155,447 ns/iter (+/- 5,160)
test curve::bench::scalar_mult_by_value                                        ... bench:     156,946 ns/iter (+/- 6,148)
test curve::bench::vartime::bench_double_scalar_mult_basepoint                 ... bench:     157,789 ns/iter (+/- 9,171)
test curve::bench::vartime::ten_fold_scalar_mult                               ... bench:     405,442 ns/iter (+/- 19,138)
test field::bench::fieldelement_a_inv                                          ... bench:      11,806 ns/iter (+/- 1,706)
test field::bench::fieldelement_a_mul_a                                        ... bench:          66 ns/iter (+/- 3)
test field::bench::fieldelement_a_mul_a_by_value                               ... bench:          67 ns/iter (+/- 3)
test field::bench::fieldelement_a_sq                                           ... bench:          47 ns/iter (+/- 4)

Note: the by_value variants of tests modify the test to use the call by value variants. They don't change anything else. I didn't include any by_value variants to any tests that broke just by removing ampersands.

My assembly-fu isn't at the point where I can tell exactly what changed between runs but a cursory look suggests that it's very similar if not identical. Are these acceptable results?

hdevalence · 2017-06-16T06:33:31Z

I started looking at this using cargo benchcmp on a machine with Turbo Boost disabled, so that the timings are actually consistent. I'm not entirely clear on what's going on, but hopefully I can get a clearer picture on the weekend or next week (I just wanted to note that I didn't forget about it).

hdevalence · 2017-06-21T05:40:00Z

I compared the existing implementation with two variants of this trick: first, just having #[inline(always)] on the move/copy operators without #[inline] on all the other methods; second, having #[inline(always)] on the move/copy operators with the additional #[inline]s.

Roughly speaking, the results I got were a slight across-the-board decline in performance for the first option, which ends up inserting a bunch of extra copies. For the second option, there's a slight increase in performance for some functions, and some microbenchmarks, but there's a decrease in performance for the inversion used in decompression. (I believe this is because it's inlined all ~260 field operations).

What I suspect is happening here is that the extra inline annotations are interacting with the cost model in unexpected and not always positive ways. So while, on a single example, using #[inline(always)] on the move/copy operator works out, when it's done across the board there are unexpected interactions. It almost works -- but I think a cleaner solution would be to make arithmetic operators in Rust work the way the . operator does.

However, for point operations, which are what people who use -dalek want to work with, the cost of a copy is tiny compared to the cost of the operation. For those operations, it doesn't matter whether someone uses the move operator or the borrow operator. So I think that it makes sense to keep the operator-generating macros for those, at least until we can fix Rust.

UnlawfulMonad · 2017-06-22T17:35:24Z

Hmm. I'm interested in the idea of arithmetic operators behaving like operator .. If there hasn't already been discussion on this topic can I steal the idea and see if I can propose this as a (Rust) RFC?

burdges · 2017-09-12T05:55:22Z

There is more serious discussion on that topic now in rust-lang/rfcs#2147

hdevalence · 2017-11-21T00:15:57Z

Just to circle back on this, the refactoring in #86 more cleanly separates the internal API from the external API. I think the fix for this issue is to just implement operators for both T and &T for all public types T, using something like @UnlawfulMonad's macro idea.

hdevalence · 2018-01-31T17:57:03Z

Closing this issue since it's fixed in #101

Rand 0.6 version bump

UnlawfulMonad mentioned this issue May 27, 2017

Replace call by reference by call by value #58

Closed

isislovecruft assigned UnlawfulMonad May 27, 2017

burdges mentioned this issue Jun 13, 2017

Consider making operators transparently work on borrows wherever possible rust-lang/rfcs#1936

Open

burdges mentioned this issue Sep 12, 2017

Allow autoderef and autoref in operators rust-lang/rfcs#2147

Closed

isislovecruft changed the title ~~Operators only implmented on references~~ Operators only implemented on references Sep 21, 2017

isislovecruft added the enhancement label Sep 21, 2017

isislovecruft added api - pain point good first issue research - rustlang labels Oct 16, 2017

hdevalence mentioned this issue Jan 24, 2018

Implement public-facing ops traits on all combos of &T/T #101

Merged

hdevalence closed this as completed Jan 31, 2018

burdges mentioned this issue Jun 8, 2018

Operator Overloading zkcrypto/pairing#85

Open

pinkforest pushed a commit to pinkforest/curve25519-dalek that referenced this issue Jun 27, 2023

Merge pull request dalek-cryptography#57 from IronCoreLabs/rand-0.6

8528557

Rand 0.6 version bump

pinkforest pushed a commit to pinkforest/curve25519-dalek that referenced this issue Jun 27, 2023

Cleanup RNG usage after merging dalek-cryptography#57.

80ae5d0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Operators only implemented on references #57

Operators only implemented on references #57

UnlawfulMonad commented May 22, 2017

burdges commented May 22, 2017 •

edited

Loading

hdevalence commented May 23, 2017

isislovecruft commented May 26, 2017 •

edited

Loading

UnlawfulMonad commented May 26, 2017

isislovecruft commented May 28, 2017

UnlawfulMonad commented Jun 12, 2017 •

edited

Loading

UnlawfulMonad commented Jun 13, 2017 •

edited

Loading

hdevalence commented Jun 16, 2017 •

edited

Loading

hdevalence commented Jun 21, 2017

UnlawfulMonad commented Jun 22, 2017

burdges commented Sep 12, 2017

hdevalence commented Nov 21, 2017

hdevalence commented Jan 31, 2018

Operators only implemented on references #57

Operators only implemented on references #57

Comments

UnlawfulMonad commented May 22, 2017

burdges commented May 22, 2017 • edited Loading

hdevalence commented May 23, 2017

isislovecruft commented May 26, 2017 • edited Loading

UnlawfulMonad commented May 26, 2017

isislovecruft commented May 28, 2017

UnlawfulMonad commented Jun 12, 2017 • edited Loading

UnlawfulMonad commented Jun 13, 2017 • edited Loading

hdevalence commented Jun 16, 2017 • edited Loading

hdevalence commented Jun 21, 2017

UnlawfulMonad commented Jun 22, 2017

burdges commented Sep 12, 2017

hdevalence commented Nov 21, 2017

hdevalence commented Jan 31, 2018

burdges commented May 22, 2017 •

edited

Loading

isislovecruft commented May 26, 2017 •

edited

Loading

UnlawfulMonad commented Jun 12, 2017 •

edited

Loading

UnlawfulMonad commented Jun 13, 2017 •

edited

Loading

hdevalence commented Jun 16, 2017 •

edited

Loading