Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operators only implemented on references #57

Closed
UnlawfulMonad opened this issue May 22, 2017 · 13 comments
Closed

Operators only implemented on references #57

UnlawfulMonad opened this issue May 22, 2017 · 13 comments

Comments

@UnlawfulMonad
Copy link

I've noticed that operators on ExtendedPoint and the like are only implemented on &'a ExtendedPoint. Is there any reason for a PR that also implemented them for the concrete types wouldn't be accepted? The types are already copy and as far as I know the compile is better at reasoning about the code when there isn't explicit memory (i.e. a reference) involved.

@burdges
Copy link
Contributor

burdges commented May 22, 2017

If I understand, the benchmarks suffer if you pass these structures by value, but maybe one can tweak that with optimizations modes, etc. In particular, passing by value permits the recipient to mutate the value, so you stack space optimizations might inhibit doing a pass by reference behind the scenes. I'd also think making the operators #[inline(always)] wrappers around call by reference functions should fix this, but maybe not.

@hdevalence
Copy link
Contributor

When we implemented the arithmetic ops for T and not for &T, we had a lot of nested redundant copies: first, copying the LHS and RHS points into the function (each point is 160 bytes), plus copying the LHS and RHS of every field element operation inside (each field element is 40 bytes). It did make a difference in cutting out needless copies. However it's obviously worse ergonomically, as I'm sure you've noticed.

I didn't think that it would be possible to work around this, but it seems like @burdges' suggestion might work, at least on that toy example.

So, to answer your original question, @UnlawfulMonad, if there were implementations of the arithmetic ops that didn't require borrows but were just as fast, we would be happy to take that PR.

@isislovecruft
Copy link
Member

isislovecruft commented May 26, 2017

Hi @UnlawfulMonad!

As @hdevalence said, we'd be stoked to take a patch for this, provided it doesn't slow anything down by introducing needless copies everywhere. If you want to implement this, my suggestion for the steps would be:

  1. Implement the operators for non-borrow types, using @burdges suggestion to use #[inline(always)]. (Bonus points for adding #[inline(always)] to all the operator implementations!)
  2. Find anywhere in our code (or in ed25519-dalek or dalek-rangeproofs that is doing what I call an "Eye of Sauron" (i.e. &(&(&(&( )))) like &(&(&s1 * &P1) + &(s2 * &P2)) + &(s3 * &P3) or something similarly evil looking), change it to your non-borrow syntax, and compare—with various rustc optimisation levels—the assembly generated from the original code to the assembly for the new syntax to ensure there's no extra copies. https://rust.godbolt.org is really handy for this. (If you're not comfortable with assembly, feel free to paste links to the the two versions of the code in godbolt, and we'll review it.)
  3. (optional) If all goes well, feel free to kill as many Eye of Saurons as you find. :)

@UnlawfulMonad Would you like me to assign you to this issue?

@UnlawfulMonad
Copy link
Author

Hi @isislovecruft and @hdevalence!

Thanks for the suggestions. I'll get on this next week and update this issue with my progress. I assumed that inlining would already be occurring if the call by value variant were simply wrapping the call by reference versions.

Sidenote: Eye of Sauron is a great name for those things. :)

@isislovecruft
Copy link
Member

@UnlawfulMonad Awesome, thanks! It appears like if we add #[inline(always)] also to the borrow versions, then the by-value versions inline directly to the borrow version code (i.e. so that rustc is basically adding the Eye of Sauron for you).

(Also I used to work at the FreeGeek in Portland! It is an excellent project!)

@UnlawfulMonad
Copy link
Author

UnlawfulMonad commented Jun 12, 2017

@isislovecruft Is code bloat not a concern at that point? #[inline(always)] forces rustc/LLVM to inline. Another option would be to leave the Eyes of Sauron in the library code but gives users the option to use either impl (overgeneralization warning) since in practice the overhead in user code should be negligible. I'm going to test simply adding #[inline] attributes and compare the overhead vs #[inline(always)]ing all the operators.

Sorry for the lack of progress. It's been a hectic couple of weeks.

(And that's awesome! Agreed, it's an awesome project!)

@UnlawfulMonad
Copy link
Author

UnlawfulMonad commented Jun 13, 2017

Alright. Here are my results:

All readings with rustc version 1.19.0-nightly (148e91714 2017-06-08).

Baseline from commit 97d6974:

test curve::bench::add_extended_and_affine_niels_output_completed     ... bench:         212 ns/iter (+/- 87)
test curve::bench::add_extended_and_affine_niels_output_extended      ... bench:         480 ns/iter (+/- 136)
test curve::bench::add_extended_and_projective_niels_output_completed ... bench:         275 ns/iter (+/- 6)
test curve::bench::add_extended_and_projective_niels_output_extended  ... bench:         525 ns/iter (+/- 23)
test curve::bench::basepoint_mult                                     ... bench:      35,181 ns/iter (+/- 12,430)
test curve::bench::bench_select_precomputed_point                     ... bench:       1,032 ns/iter (+/- 50)
test curve::bench::edwards_compress                                   ... bench:      12,023 ns/iter (+/- 1,532)
test curve::bench::edwards_decompress                                 ... bench:      12,837 ns/iter (+/- 2,573)
test curve::bench::extended_double_output_extended                    ... bench:         486 ns/iter (+/- 100)
test curve::bench::mult_by_cofactor                                   ... bench:       1,316 ns/iter (+/- 251)
test curve::bench::projective_double_output_completed                 ... bench:         226 ns/iter (+/- 118)
test curve::bench::scalar_mult                                        ... bench:     152,515 ns/iter (+/- 2,688)
test curve::bench::vartime::bench_double_scalar_mult_basepoint        ... bench:     159,619 ns/iter (+/- 2,291)
test curve::bench::vartime::ten_fold_scalar_mult                      ... bench:     399,000 ns/iter (+/- 7,540)
test field::bench::fieldelement_a_inv                                 ... bench:      12,200 ns/iter (+/- 1,889)
test field::bench::fieldelement_a_mul_a                               ... bench:          67 ns/iter (+/- 1)
test field::bench::fieldelement_a_sq                                  ... bench:          48 ns/iter (+/- 1)
test scalar::bench::invert                                            ... bench:      67,871 ns/iter (+/- 803)
test scalar::bench::scalar_multiply_add                               ... bench:         270 ns/iter (+/- 68)
test scalar::bench::scalar_random                                     ... bench:       1,115 ns/iter (+/- 101)
test scalar::bench::scalar_unpacked_multiply_add                      ... bench:         199 ns/iter (+/- 12)

With #[inline] on all operators (commit 6cdae94 on the callbyvalue branch of my fork):

test curve::bench::add_extended_and_affine_niels_output_completed              ... bench:         220 ns/iter (+/- 9)
test curve::bench::add_extended_and_affine_niels_output_extended               ... bench:         493 ns/iter (+/- 28)
test curve::bench::add_extended_and_projective_niels_output_completed          ... bench:         287 ns/iter (+/- 23)
test curve::bench::add_extended_and_projective_niels_output_completed_by_value ... bench:         299 ns/iter (+/- 9)
test curve::bench::add_extended_and_projective_niels_output_extended           ... bench:         558 ns/iter (+/- 72)
test curve::bench::add_extended_and_projective_niels_output_extended_by_value  ... bench:         593 ns/iter (+/- 33)
test curve::bench::basepoint_mult                                              ... bench:      36,890 ns/iter (+/- 6,329)
test curve::bench::bench_select_precomputed_point                              ... bench:         891 ns/iter (+/- 60)
test curve::bench::edwards_compress                                            ... bench:      12,197 ns/iter (+/- 2,828)
test curve::bench::edwards_decompress                                          ... bench:      12,710 ns/iter (+/- 1,029)
test curve::bench::extended_double_output_extended                             ... bench:         493 ns/iter (+/- 14)
test curve::bench::mult_by_cofactor                                            ... bench:       1,318 ns/iter (+/- 96)
test curve::bench::projective_double_output_completed                          ... bench:         220 ns/iter (+/- 10)
test curve::bench::scalar_mult                                                 ... bench:     156,671 ns/iter (+/- 10,735)
test curve::bench::scalar_mult_by_value                                        ... bench:     163,941 ns/iter (+/- 13,653)
test curve::bench::vartime::bench_double_scalar_mult_basepoint                 ... bench:     160,486 ns/iter (+/- 10,722)
test curve::bench::vartime::ten_fold_scalar_mult                               ... bench:     402,349 ns/iter (+/- 16,962)
test field::bench::fieldelement_a_inv                                          ... bench:      11,891 ns/iter (+/- 2,288)
test field::bench::fieldelement_a_mul_a                                        ... bench:          65 ns/iter (+/- 2)
test field::bench::fieldelement_a_mul_a_by_value                               ... bench:          67 ns/iter (+/- 6)
test field::bench::fieldelement_a_sq                                           ... bench:          47 ns/iter (+/- 3)
test scalar::bench::invert                                                     ... bench:      65,925 ns/iter (+/- 3,157)
test scalar::bench::scalar_multiply_add                                        ... bench:         257 ns/iter (+/- 12)
test scalar::bench::scalar_random                                              ... bench:       1,135 ns/iter (+/- 61)
test scalar::bench::scalar_unpacked_multiply_add                               ... bench:         195 ns/iter (+/- 23)

With #[inline(always)] on all operators (commit 156c92d on the call by value branch of my fork):

test curve::bench::add_extended_and_affine_niels_output_completed              ... bench:         208 ns/iter (+/- 6)
test curve::bench::add_extended_and_affine_niels_output_extended               ... bench:         488 ns/iter (+/- 19)
test curve::bench::add_extended_and_projective_niels_output_completed          ... bench:         278 ns/iter (+/- 9)
test curve::bench::add_extended_and_projective_niels_output_completed_by_value ... bench:         295 ns/iter (+/- 11)
test curve::bench::add_extended_and_projective_niels_output_extended           ... bench:         550 ns/iter (+/- 20)
test curve::bench::add_extended_and_projective_niels_output_extended_by_value  ... bench:         565 ns/iter (+/- 16)
test curve::bench::basepoint_mult                                              ... bench:      36,467 ns/iter (+/- 1,810)
test curve::bench::bench_select_precomputed_point                              ... bench:         882 ns/iter (+/- 36)
test curve::bench::edwards_compress                                            ... bench:      12,003 ns/iter (+/- 1,698)
test curve::bench::edwards_decompress                                          ... bench:      12,630 ns/iter (+/- 1,812)
test curve::bench::extended_double_output_extended                             ... bench:         496 ns/iter (+/- 32)
test curve::bench::mult_by_cofactor                                            ... bench:       1,331 ns/iter (+/- 32)
test curve::bench::projective_double_output_completed                          ... bench:         217 ns/iter (+/- 8)
test curve::bench::scalar_mult                                                 ... bench:     155,447 ns/iter (+/- 5,160)
test curve::bench::scalar_mult_by_value                                        ... bench:     156,946 ns/iter (+/- 6,148)
test curve::bench::vartime::bench_double_scalar_mult_basepoint                 ... bench:     157,789 ns/iter (+/- 9,171)
test curve::bench::vartime::ten_fold_scalar_mult                               ... bench:     405,442 ns/iter (+/- 19,138)
test field::bench::fieldelement_a_inv                                          ... bench:      11,806 ns/iter (+/- 1,706)
test field::bench::fieldelement_a_mul_a                                        ... bench:          66 ns/iter (+/- 3)
test field::bench::fieldelement_a_mul_a_by_value                               ... bench:          67 ns/iter (+/- 3)
test field::bench::fieldelement_a_sq                                           ... bench:          47 ns/iter (+/- 4)
test scalar::bench::invert                                                     ... bench:      72,665 ns/iter (+/- 1,737)
test scalar::bench::scalar_multiply_add                                        ... bench:         267 ns/iter (+/- 53)
test scalar::bench::scalar_random                                              ... bench:       1,130 ns/iter (+/- 195)
test scalar::bench::scalar_unpacked_multiply_add                               ... bench:         216 ns/iter (+/- 40)

Isolating a couple of the tests we get:

Baseline:
test curve::bench::scalar_mult                                        ... bench:     152,515 ns/iter (+/- 2,688)
test curve::bench::vartime::bench_double_scalar_mult_basepoint        ... bench:     159,619 ns/iter (+/- 2,291)
test curve::bench::vartime::ten_fold_scalar_mult                      ... bench:     399,000 ns/iter (+/- 7,540)
test field::bench::fieldelement_a_inv                                 ... bench:      12,200 ns/iter (+/- 1,889)
test field::bench::fieldelement_a_mul_a                               ... bench:          67 ns/iter (+/- 1)
test field::bench::fieldelement_a_sq                                  ... bench:          48 ns/iter (+/- 1)

#[inline]
test curve::bench::scalar_mult                                                 ... bench:     156,671 ns/iter (+/- 10,735)
test curve::bench::scalar_mult_by_value                                        ... bench:     163,941 ns/iter (+/- 13,653)
test curve::bench::vartime::bench_double_scalar_mult_basepoint                 ... bench:     160,486 ns/iter (+/- 10,722)
test curve::bench::vartime::ten_fold_scalar_mult                               ... bench:     402,349 ns/iter (+/- 16,962)
test field::bench::fieldelement_a_inv                                          ... bench:      11,891 ns/iter (+/- 2,288)
test field::bench::fieldelement_a_mul_a                                        ... bench:          65 ns/iter (+/- 2)
test field::bench::fieldelement_a_mul_a_by_value                               ... bench:          67 ns/iter (+/- 6)
test field::bench::fieldelement_a_sq                                           ... bench:          47 ns/iter (+/- 3)

#[inline(always)]
test curve::bench::scalar_mult                                                 ... bench:     155,447 ns/iter (+/- 5,160)
test curve::bench::scalar_mult_by_value                                        ... bench:     156,946 ns/iter (+/- 6,148)
test curve::bench::vartime::bench_double_scalar_mult_basepoint                 ... bench:     157,789 ns/iter (+/- 9,171)
test curve::bench::vartime::ten_fold_scalar_mult                               ... bench:     405,442 ns/iter (+/- 19,138)
test field::bench::fieldelement_a_inv                                          ... bench:      11,806 ns/iter (+/- 1,706)
test field::bench::fieldelement_a_mul_a                                        ... bench:          66 ns/iter (+/- 3)
test field::bench::fieldelement_a_mul_a_by_value                               ... bench:          67 ns/iter (+/- 3)
test field::bench::fieldelement_a_sq                                           ... bench:          47 ns/iter (+/- 4)

Note: the by_value variants of tests modify the test to use the call by value variants. They don't change anything else. I didn't include any by_value variants to any tests that broke just by removing ampersands.

My assembly-fu isn't at the point where I can tell exactly what changed between runs but a cursory look suggests that it's very similar if not identical. Are these acceptable results?

@hdevalence
Copy link
Contributor

hdevalence commented Jun 16, 2017

I started looking at this using cargo benchcmp on a machine with Turbo Boost disabled, so that the timings are actually consistent. I'm not entirely clear on what's going on, but hopefully I can get a clearer picture on the weekend or next week (I just wanted to note that I didn't forget about it).

@hdevalence
Copy link
Contributor

I compared the existing implementation with two variants of this trick: first, just having #[inline(always)] on the move/copy operators without #[inline] on all the other methods; second, having #[inline(always)] on the move/copy operators with the additional #[inline]s.

Roughly speaking, the results I got were a slight across-the-board decline in performance for the first option, which ends up inserting a bunch of extra copies. For the second option, there's a slight increase in performance for some functions, and some microbenchmarks, but there's a decrease in performance for the inversion used in decompression. (I believe this is because it's inlined all ~260 field operations).

What I suspect is happening here is that the extra inline annotations are interacting with the cost model in unexpected and not always positive ways. So while, on a single example, using #[inline(always)] on the move/copy operator works out, when it's done across the board there are unexpected interactions. It almost works -- but I think a cleaner solution would be to make arithmetic operators in Rust work the way the . operator does.

However, for point operations, which are what people who use -dalek want to work with, the cost of a copy is tiny compared to the cost of the operation. For those operations, it doesn't matter whether someone uses the move operator or the borrow operator. So I think that it makes sense to keep the operator-generating macros for those, at least until we can fix Rust.

@UnlawfulMonad
Copy link
Author

Hmm. I'm interested in the idea of arithmetic operators behaving like operator .. If there hasn't already been discussion on this topic can I steal the idea and see if I can propose this as a (Rust) RFC?

@burdges
Copy link
Contributor

burdges commented Sep 12, 2017

There is more serious discussion on that topic now in rust-lang/rfcs#2147

@isislovecruft isislovecruft changed the title Operators only implmented on references Operators only implemented on references Sep 21, 2017
@hdevalence
Copy link
Contributor

Just to circle back on this, the refactoring in #86 more cleanly separates the internal API from the external API. I think the fix for this issue is to just implement operators for both T and &T for all public types T, using something like @UnlawfulMonad's macro idea.

@hdevalence
Copy link
Contributor

Closing this issue since it's fixed in #101

pinkforest pushed a commit to pinkforest/curve25519-dalek that referenced this issue Jun 27, 2023
pinkforest pushed a commit to pinkforest/curve25519-dalek that referenced this issue Jun 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants