Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supercop AMD64 ASM for Wallet Importing [Do not Merge] #2317

Closed

Conversation

vtnerd
Copy link
Contributor

@vtnerd vtnerd commented Aug 20, 2017

The first commit imports amd64-51-30k and amd64-64-24k from supercop-20170725/crypto_sign/ed25519. I encourage @iamsmooth @moneromooo-monero @hyc @fluffypony @luigi1111 @kenshi84 to verify that the code is the same from that source.

The next commit (ping @hyc @luigi1111) updates the supercop code to be compliant with -fPIC or -fPIE. Some platforms (OSX for example) compile in this mode by default, and I do not see a good way to detect whether position independent code is being used to select a non-position independent variant.

The last commit are the changes to the monero baseline to integrate the ASM code into the wallet (@hyc and @luigi1111 might want to peek at the *choose_tp.s files adapted from supercop code to include z field). The current ref10 implementation is not modified, and this code is optionally used for wallet scanning only. See associated readme. @iamsmooth @luigi1111 @hyc @moneromooo-monero ... should this be enabled by default? Is it worth the extra maintenance to even use this code? The bulk of this code is upstream from supercop, and now written by me.

Performance Numbers (OSX on Mac mini mobile i5)

Note that the monerod system in my test setup should have slightly higher than normal latency. There was still a ~8min improvement. A much larger drop in user CPU time was seen. Improvements to the fetching code should help. The amd64-51-30k version always fetched more blocks.

ref10

Refresh done, blocks received: 1379435
Balance 0.00, unlocked balance: 0.00

real 68m7.155s
user 39m40.531s
sys 0m27.386s

amd64-51-30k

Refresh done, blocks received: 1379469
Balance 0.00, unlocked balance: 0.00

real 61m14.040s
user 15m3.630s
sys 0m26.726s

EDIT: Updated reference to last commit. The original last commit did not have amd64-51-30k.cmake and amd64-64-24k.cmake.

vtnerd added 2 commits August 20, 2017 11:24
SHA-256 hash of supercop-20170725.tar :
    87cf6b3306fa4cb5c688774d0a8a367d74e519c9ea6733d96cfce322a228044e
@kenshi84
Copy link
Contributor

@vtnerd I confirmed that the copied files are identical to the original. Just one small thing: the entire content of the folder amd64-51-30k/ and amd64-64-24k/ seems to be copied to the folders with the same name, which seems like a mistake.

Also, I couldn't build the accelerated version because the amd64-51-30k.cmake or amd64-64-24k.cmake couldn't be found in the repository. Didn't you forget to include them?

@vtnerd vtnerd force-pushed the improvement/wallet_x86_64_asm branch from 9d72ff2 to b610a47 Compare August 22, 2017 13:32
@vtnerd
Copy link
Contributor Author

vtnerd commented Aug 22, 2017

@vtnerd I confirmed that the copied files are identical to the original. Just one small thing: the entire content of the folder amd64-51-30k/ and amd64-64-24k/ seems to be copied to the folders with the same name, which seems like a mistake.

I do not understand - the first commit is supposed to be an exact duplicate of those folders in supercop. The next commit converts the constant references to position independent references.

Also, I couldn't build the accelerated version because the amd64-51-30k.cmake or amd64-64-24k.cmake couldn't be found in the repository. Didn't you forget to include them?

The .gitignore file for monero has *.cmake! So locally it appeared like I had uploaded everything. I just did a force push to the last commit so those files should now be available.

Default behavior is to use amd64-51-30k when targeting amd64 architecture.
`-DWALLET_CRYPTO` can be used to optionally disable or enable amd64-51-30k
or amd64-64-24. See `src/wallet/crypto/README.md` for more info.
@vtnerd vtnerd force-pushed the improvement/wallet_x86_64_asm branch from b610a47 to 8f38b22 Compare August 22, 2017 14:05
@kenshi84
Copy link
Contributor

I do not understand - the first commit is supposed to be an exact duplicate of those folders in supercop.

Maybe I was unclear: I see in the first commit that there are two exactly identical files, e.g. amd64-51-30k/choose_t.s and amd64-51-30k/amd64-51-30k/choose_t.s. Is this really intended?

@vtnerd
Copy link
Contributor Author

vtnerd commented Aug 22, 2017

I do not understand - the first commit is supposed to be an exact duplicate of those folders in supercop.

Maybe I was unclear: I see in the first commit that there are two exactly identical files, e.g. amd64-51-30k/choose_t.s and amd64-51-30k/amd64-51-30k/choose_t.s. Is this really intended?

The files are similar but not identical. ed25519/amd64-51-30k/choose_t.s takes a ge25519_niels array as a parameter whereas amd64-51-30k-choose_tp.s uses a ge25519_pniels array as a parameter. The latter has a z field so that no field inversions have to be calculated during the pre-computation stage. This was not a problem for the supercop code because it always does multiplication with a fixed base (pre-computation is hard-coded into a static variable) - the z field would just be wasted space. The readme also discusses this.

EDIT: Should've said good "eye" @kenshi84 ! Noticing the similarity should taken a decent amount of effort.

@anonimal
Copy link
Contributor

@vtnerd Nice improvements 👍

For various reasons, including the frequency of which supercop is updated, will you consider doing something like this for the repo? Also referencing #2133.

@kenshi84
Copy link
Contributor

kenshi84 commented Aug 22, 2017

Here are the timings for a full blockchain scan (with restore height 0) on testnet (981463 blocks):

  • ref10: 4m 48s
  • amd64-51-30k: 2m 07s
  • amd64-64-24k: 1m 50s

Computer: MacBook Pro 15-inch/2016, 2.9 GHz Intel Core i7, 16GB RAM

The speedup is quite substantial!

@vtnerd
Copy link
Contributor Author

vtnerd commented Aug 23, 2017

For various reasons, including the frequency of which supercop is updated, will you consider doing something like this for the repo? Also referencing #2133.

What is the benefit of the separate project? Would kovri or other projects use the ASM speedups in some way? Supposedly one of the variants will end up in NaCL...

@anonimal
Copy link
Contributor

What is the benefit of the separate project?

I think you mean repo?

The first commit imports amd64-51-30k and amd64-64-24k from supercop-20170725/crypto_sign/ed25519. I encourage @iamsmooth @moneromooo-monero @hyc @fluffypony @luigi1111 @kenshi84 to verify that the code is the same from that source.

The next commit (ping @hyc @luigi1111) updates the supercop code to be compliant with -fPIC or -fPIE.

And, as referenced in #2133, paraphrasing:

I think we should aim to adapt our build process to work cleanly with other projects (within reason). This includes working from a submodule with monero-specific branch so svn rebasing can be applied cleanly while keeping the log intact for review.

Answer: to streamline the review process for the sake of accurate auditing ^

Would kovri or other projects use the ASM speedups in some way?

Not that I know of, but your actual work (versus what the library produces) will be much easier to review (thus why we've began moving to submodules).

@anonimal
Copy link
Contributor

Hi @vtnerd, continuing where we left off on IRC:

  1. Either you or I would create a supercop repo and pull the respective bits we need from ed25519 into said repo's master branch. The needed changes are then done to a respective monero and kovri branch.
  2. @fluffypony creates monero-project/supercop but we can work from either of our supercop repos until then.
  3. Said repo is then submoduled into monero's external/ directory.
  4. Re: this PR or any supercop usage, monero's implementation code is then fully abstracted (as in separated to the point of abstraction, not polymorphism etc.) so our supercop changes remain within the submodule / supercop work is not mixed in with monero/kovri code (I feel strongly about not mixing dependency code with our code outside of a library's API but I know there are special cases). Any adjustments to CMake would still be trivial to make.
  5. Voila! Kill two birds with one stone (poor birdies)

What do you think?

Would kovri or other projects use the ASM speedups in some way?

Not that I know of

But that doesn't mean no (even for kovri). The future is bright with potential 🌞

@anonimal anonimal mentioned this pull request Aug 30, 2017
Copy link
Collaborator

@moneromooo-monero moneromooo-monero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Contingent on a theory OK from luigi

@fluffypony
Copy link
Contributor

@vtnerd I've created a supercop repository on monero-project

@fireice-uk
Copy link
Contributor

I would say that amd64-64-24k is the better choice for future proofing as fe operations are simpler.

@vtnerd it won't be too hard to make that code run on Windows if you put your mind to it.

@vtnerd
Copy link
Contributor Author

vtnerd commented Sep 24, 2017

The field unpacking functions amd64-64-24k are more simple, is that what you mean? I am not sure what you mean by "future proofing" in this context. Otherwise, the 51-bit radix variant has the advantage of reduced overflow checks on the limbs in some cases due to the "headroom" within each limb. The downside is the extra limb puts more pressure on the CPU cache. I selected the amd64-51-30k version because the benchmarks showed a slight edge on recent Intel chips (but a decrease in performance on AMD chips). I recall seeing similar results on my newish Mac-mini, but I would have to re-check my numbers. I thought about changing to amd64-64-24k because I think in the actual use (wallet) less CPU cache usage would provide better overall numbers. They are really close and the latency / variance introduced by network traffic makes it somewhat difficult to say.

And making that code run in Windows would require an update to every ASM function to handle the different ABI. Its certainly not more difficult than the ASM work I already completed, but is additional work and a burden on reviewers looking for correctness / non-malicious behavior. I was hoping to provide minimal changes to the supercop code for auditing purposes; the current changes are far more than I wanted to do. I would prefer if making it work in Windows were done in a future patch.

@fireice-uk
Copy link
Contributor

The field unpacking functions

I'm using Dr. Bernstein nomenclature here:

  • fe is short for field, so fe_add should be read as "addition over a finite field", fe functions are the basic building blocks. They are one dimensional.

  • ge (I think) is short for geometric, ge_add adds two Edwards points. That operation is obviously different from just adding a one dimensional value and is multi-dimensional.

Otherwise, the 51-bit radix variant has the advantage of reduced overflow checks on the limbs in some cases due to the "headroom" within each limb.

The distinct advantage of asm for doing fe operations is that you don't need to check overflows (go ahead, read the 64-bit fe_add which will be the easiest to interpret as it only adds two large numbers), since you have access to the cf.

Arithmetic overflow checks are a red herring and cmov instructions to make sure we stay in the field and avoid division are not that expensive on asm level (they cost an order of magnitude more on c level)

You are also missing the security dimension. 64-bit version is going to be inherently more well-behaved as there is one and only one way of representing each fe number, whereas a representation in any partial number system is going to involve a normalisation step that will need to occur sooner or later - ( have you checked whether the bounds of all operations added together are certain not to roll off the field? Think doing fe_add 10,000 times )

Lastly the 64-bit operations will probably lend themselves better to future cpu architecture optimisations, whereas with 51-bits you are stuck with what you have

Another important point - ge_scalarmult_base is your performance bottleneck, not ge_scalarmult, make sure you benchmark the former and not the latter.

@vtnerd
Copy link
Contributor Author

vtnerd commented Sep 25, 2017

The field unpacking functions

I'm using Dr. Bernstein nomenclature here:

  • fe is short for field, so fe_add should be read as "addition over a finite field", fe functions are the basic building blocks. They are one dimensional.
  • ge (I think) is short for geometric, ge_add adds two Edwards points. That operation is obviously different from just adding a one dimensional value and is multi-dimensional.

I know what you meant by fe in the original post. I wanted to know how the field representation of amd64-64-24k was a better choice for future proofing. Its not immediately obvious to me what representation would be better on future CPU architectures - it seems to be determined by the relative cost of specific x86 instructions and the cache size. Cache sizes are likely to increase, but I am not sure how the relative cost of the instructions will change over time.

Otherwise, the 51-bit radix variant has the advantage of reduced overflow checks on the limbs in some cases due to the "headroom" within each limb.

The distinct advantage of asm for doing fe operations is that you don't need to check overflows (go ahead, read the 64-bit fe_add which will be the easiest to interpret as it only adds two large numbers), since you have access to the cf.

Arithmetic overflow checks are a red herring and cmov instructions to make sure we stay in the field and avoid division are not that expensive on asm level (they cost an order of magnitude more on c level).

The overflow checks are not a red herring. According to the ed25519 whitepaper, it is the entire reason the amd64-51-30k variant exists ("to reduce the number of expensive adc/subc instructions"). The whitepaper discusses where and what instructions can be reduced as a result of the representation.

You are also missing the security dimension. 64-bit version is going to be inherently more well-behaved as there is one and only one way of representing each fe number, whereas a representation in any partial number system is going to involve a normalisation step that will need to occur sooner or later - ( have you checked whether the bounds of all operations added together are certain not to roll off the field? Think doing fe_add 10,000 times )

Both are used only for wallet scanning due to possible issues as a result of less testing/auditing and general complexity. A bug in this context means an output could be lost until re-scanned with the option disabled. I agree that the 51-bit radix representation appears to have additional complexity (just read the whitepaper). I will change the default to amd64-64-24k for this reason and the less CPU cache destruction ( @kenshi84 showed it was faster, but is likely "within the noise"). The amd64-51-30k still may be useful to alternate wallet scanning implementations (mymonero for instance is likely to continue using amd64-51-30k in the short-term).

Lastly the 64-bit operations will probably lend themselves better to future cpu architecture optimisations, whereas with 51-bits you are stuck with what you have.

I still do not understand what you mean - the 51-bit variant uses 5 64-bit numbers. The future performance of each seems hard to predict because it depends on the CPU cache size/performance/eviction and the relative cost of specific instructions.

Another important point - ge_scalarmult_base is your performance bottleneck, not ge_scalarmult, make sure you benchmark the former and not the latter.

I was benchmarking 10000 transactions each with varying numbers of outputs (I think I settled on 4 for testing). So both functions were being tested indirectly. It was still artificial - in real usage the CPU is more likely to evict the associated cachelines since the current processing loop is not as "tight" as my benchmark. So again, that's why I was considered making amd64-64-24k the default.

Also, its worth mentioning that this was designed to compile both in the same binary for benchmarking. I just need to cleanup the cmake primarily before inclusion into mainline (i.e. some future patch).

@fireice-uk
Copy link
Contributor

fireice-uk commented Sep 26, 2017

We seem to be arguing the difference between an arithmetic overflow and adc instruction. Since I don't really want to get intro arguments here, I will leave you to it. Sorry if you found me annoying.

@luigi1111
Copy link
Collaborator

@moneromooo-monero I approve the idea, indeed.

It's a pretty good way IMO to "get its foot in the door" for potential future use elsewhere (more critical functions), as it is a significant speedup.

I have not looked at the code yet.

@vtnerd
Copy link
Contributor Author

vtnerd commented Apr 17, 2018

Once the ASM code is merged into the sub-repo, I will revisit with a new PR.

@vtnerd vtnerd closed this Apr 17, 2018
@vtnerd vtnerd deleted the improvement/wallet_x86_64_asm branch October 5, 2019 02:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants