Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add bignum_mont{mul,sqr}_p384_neon, speed improvements/refactoring in tactics #122

Merged
merged 1 commit into from
May 10, 2024

Conversation

aqjune-aws
Copy link
Collaborator

This patch adds bignum_mont{mul,sqr}_p384_neon which are slightly faster than bignum_mont{mul,sqr}_p384.
They use SIMD instructions and better scheduling found with SLOTHY. Their correctness is verified using equivalence check w.r.t. specifications of their scalar ops. The new SUBROUTINE lemmas are added to the specification list using

./tools/collect-specs.sh arm >arm/proofs/specifications.txt

Benchmark results on Graviton2:

bignum_montsqr_p384             :    58.6 ns each (var  0.3%, corr  0.06) =   17053295 ops/sec
bignum_montsqr_p384_neon        :    52.6 ns each (var  0.4%, corr -0.04) =   19017192 ops/sec
bignum_montmul_p384             :    72.9 ns each (var  0.2%, corr -0.02) =   13726633 ops/sec
bignum_montmul_p384_neon        :    68.1 ns each (var  0.3%, corr  0.02) =   14680905 ops/sec

Test and benchmark were updated to include these & fix incorrect naming bugs in my previous p256_neon patch.

Also, some speedups in tactics are made:

  1. ARM_STEPS'_AND_ABBREV_TAC and ARM_STEPS'_AND_REWRITE_TAC.

They are tactics for symbolic execution when showing equivalence of two programs after reordering instructions.
ARM_STEPS'_AND_ABBREV_TAC does symbolic execution of the 'left' program and abbreviates every RHS of new read comp s = RHSs, meaning that after the tactic is done there are a bunch of equality assumptions whose number increases linearly to the number of instructions. ARM_STEPS'_AND_REWRITE_TAC then does symbolic execution of the 'right' program and rewrites the results using the assumptions.
This means the overall complexity of ARM_STEPS'_AND_REWRITE_TAC was quadratic to the number of instructions (# assum * # insts = (# insts)^2). This is fixed to be (close to) linear, by separately maintaining the abbreviations as a list of theorems internally rather than assumptions. This doesn’t mean that the therotical time complexity is now linear, but many tactics inside ARM_STEPS'_AND_REWRITE_TAC that inspect assumptions now run linearly.

  1. FIND_HOLE_TAC

FIND_HOLE_TAC tactic finds the 'hole' in the memory space that can place the machine code that is used in program equivalence. This is done by inspecting nonoverlapping assumptions, properly segmenting the memory with fixed-width ranges and doing case analysis. Previously the # splitted cases was something like 2^((# segments)^2), but now it is reduced to (# segments)^(#segments). Comparing these two numbers is easier if logarithm is used.

Finally, some lemmas in existing _neon.ml proofs are updated so that they do not mix usage of '_mc' and '_core_mc'. '_core_mc' is a machine code that is a sub-list of '_mc' retrieved by stripping the callee-save register store/loads as well as the ret instruction.
If possible, a lemmas is updated to only use '*_core_mc' because this makes the modular usage of the lemma possible in bigger theorems.

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@aqjune-aws aqjune-aws marked this pull request as ready for review April 28, 2024 17:33
…in tactics

This patch adds `bignum_mont{mul,sqr}_p384_neon` which are slightly faster than
`bignum_mont{mul,sqr}_p384`.
They use SIMD instructions and better scheduling found with SLOTHY.
Their correctness is verified using equivalence check w.r.t. specifications of their scalar ops.
The new SUBROUTINE lemmas are added to the specification list using
```
./tools/collect-specs.sh arm >arm/proofs/specifications.txt
```

Benchmark results on Graviton2:
```
bignum_montsqr_p384             :    58.6 ns each (var  0.3%, corr  0.06) =   17053295 ops/sec
bignum_montsqr_p384_neon        :    52.6 ns each (var  0.4%, corr -0.04) =   19017192 ops/sec
bignum_montmul_p384             :    72.9 ns each (var  0.2%, corr -0.02) =   13726633 ops/sec
bignum_montmul_p384_neon        :    68.1 ns each (var  0.3%, corr  0.02) =   14680905 ops/sec
```

Test and benchmark were updated to include these & fix incorrect naming bugs
in my previous p256_neon patch.

Also, some speedups in tactics are made:

1. `ARM_STEPS'_AND_ABBREV_TAC` and `ARM_STEPS'_AND_REWRITE_TAC`.

They are tactics for symbolic execution when showing equivalence of two programs
after reordering instructions.
`ARM_STEPS'_AND_ABBREV_TAC` does symbolic execution of the 'left' program and
abbreviates every RHS of new `read comp s = RHS`s,
meaning that after the tactic is done there are a bunch of equality assumptions whose
number increases linearly to the number of instructions.
`ARM_STEPS'_AND_REWRITE_TAC` then does symbolic execution of the 'right' program
and rewrites the results using the assumptions.
This means the overall complexity of `ARM_STEPS'_AND_REWRITE_TAC` was quadratic
to the number of instructions (# assum * # insts = (# insts)^2).
This is fixed to be (close to) linear, by separately maintaining the
abbreviations as a list of theorems internally rather than assumptions.
This doesn’t mean that the therotical time complexity is now linear,
but many tactics inside `ARM_STEPS'_AND_REWRITE_TAC` that inspect assumptions
now run linearly.

2. `FIND_HOLE_TAC`

`FIND_HOLE_TAC` tactic finds the 'hole' in the memory space that can place the
machine code that is used in program equivalence. This is done by inspecting
`nonoverlapping` assumptions, properly segmenting the memory with fixed-width
ranges and doing case analysis. Previously the # splitted cases was something
like 2^((# segments)^2), but now it is reduced to (# segments)^(#segments).
Comparing these two numbers is easier if logarithm is used.

Finally, some lemmas in existing `_neon.ml` proofs are updated so that
they do not mix usage of '*_mc' and '*_core_mc'. '*_core_mc' is a machine
code that is a sub-list of '*_mc' retrieved by stripping the callee-save register
store/loads as well as the ret instruction.
If possible, a lemmas is updated to only use '*_core_mc' because this
makes the modular usage of the lemma possible in bigger theorems.
@aqjune-aws aqjune-aws marked this pull request as draft May 9, 2024 00:36
@aqjune-aws
Copy link
Collaborator Author

Marking as a draft until CI checks pass.

@aqjune-aws aqjune-aws marked this pull request as ready for review May 9, 2024 07:05
@aqjune-aws
Copy link
Collaborator Author

Marked as ready. :)

@aqjune-aws aqjune-aws requested a review from jargh May 9, 2024 19:45
Copy link
Contributor

@jargh jargh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks very good, thanks!

@jargh jargh merged commit cbef866 into awslabs:main May 10, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants