-
Notifications
You must be signed in to change notification settings - Fork 459
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimized support for P-384r1 and P-521r1 #99
Comments
Update - I was able to incorporate the NIST optimized equation for the P-384 curve that I added and it works (passes the NIST test vectors I also included in the test_ecdh_all.c file I made. Am trying to add the NIST equation for the P-521 curve, but I think the odd number of bits, and the fact that I have to define the num_words as 17 (32-bit words for a total 544 bits available) means that the add routine NEVER generates a normal carry. I think the carry is only a mid-range bit in the highest word. I'm going to try to check the 522 bit by adding a separate vli_add (and maybe vli_sub) routine to detect it and see what happens. Will update if I figure it out. Ed B. |
Update - I was able to incorporate the NIST optimized equation for the P-521 curve that I added and it works (passes the NIST test vectors I also included in the test_ecdh_all.c file I made. Just had to only keep the low 9 bits on the high word and rearrange all so that each tmp[i] uses the left 23 bits shifted to the right 9 bits and ORs on the low 9 bits from the next word shifted left 23 bits. Only take the loop to num_words_secp521r1 -2 and set the last tmp[num_words_secp521r1-1] to the left 23 bits shifted to the right 9 and AND with 0x000001FF. No changes to any other files needed (except invoke optimization >0 in uECC.h). Thanks for making the base software. Thanks for including the references to the math papers so I could learn about the mmod and jacobian algorithms. Ed B. |
I didn't include secp384r1 or scpr521r1 because they are unusably slow on 8-bit platforms, and the optimized assembly would be very large. If a lot of people want to use them I could add them though. To merge your changes into micro-ecc, you would submit a 'pull request'. Thanks for your interest! |
@balchick, do you have a fork of this repository with support for P-384 and P-521? I need a support for those curves and was planning on adding them (at least for ARMv7 and ARM64, other platforms are less important to me). Since you have those already implemented, I would be grateful if you could share that code. |
@balchick, @kmackay, I am trying to implement P-521 in my code and it works with disabled optimisation ( So far, this looks like this (for 64-bit platforms): static void vli_mmod_fast_secp521r1(uint64_t *result, uint64_t *product) {
uint64_t tmp[ num_words_secp521r1 ];
int carry;
int i;
/* t */
uECC_vli_set(result, product, num_words_secp521r1);
result[ num_words_secp521r1 - 1 ] &= 0x01FF;
/* s */
for ( i = 0; i < num_words_secp521r1 - 2; ++i ) {
tmp[ i ] = ( product[ num_words_secp521r1 - 1 + i ] >> 9 ) | ( product[ num_words_secp521r1 + i ] << 55 );
}
tmp[ num_words_secp521r1 - 1 ] = ( product[ num_words_secp521r1 + num_words_secp521r1 - 1 ] >> 9 ) & 0x01FF;
carry = (int)uECC_vli_add(result, result, tmp, num_words_secp521r1);
while (carry || uECC_vli_cmp_unsafe(curve_secp521r1.p, result, num_words_secp521r1) != 1) {
carry -= uECC_vli_sub( result, result, curve_secp521r1.p, num_words_secp521r1);
}
} However, this does not work and I cannot find the reason why. I would be very grateful if you could give me some hints. |
- the implementation does not work for unknown reason, but gives overview of final performance of P-521 - details can be tracked here: kmackay#99 (comment)
Hi,
I have some code on another computer.
I'll compare my 32-bit version and let you know.
Ed
…-----Original Message-----
From: DoDo <notifications@github.com>
To: kmackay/micro-ecc <micro-ecc@noreply.github.com>
Cc: balchick <balchickej@aol.com>; Mention <mention@noreply.github.com>
Sent: Thu, Aug 10, 2017 12:34 pm
Subject: Re: [kmackay/micro-ecc] Optimized support for P-384r1 and P-521r1 (#99)
@balchick, @kmackay, I am trying to implement P-521 in my code and it works with disabled optimisation (uECC_OPTIMIZATION_LEVEL==0), however I am struggling with writing optimised mmod function, based on NIST paper and suggestions from @balchick.
So far, this looks like this (for 64-bit platforms):
static void vli_mmod_fast_secp521r1(uint64_t *result, uint64_t *product) {
uint64_t tmp[ num_words_secp521r1 ];
int carry;
int i;
/* t */
uECC_vli_set(result, product, num_words_secp521r1);
result[ num_words_secp521r1 - 1 ] &= 0x01FF;
/* s */
for ( i = 0; i < num_words_secp521r1 - 2; ++i ) {
tmp[ i ] = ( product[ num_words_secp521r1 - 1 + i ] >> 9 ) | ( product[ num_words_secp521r1 + i ] << 55 );
}
tmp[ num_words_secp521r1 - 1 ] = ( product[ num_words_secp521r1 + num_words_secp521r1 - 1 ] >> 9 ) & 0x01FF;
carry = (int)uECC_vli_add(result, result, tmp, num_words_secp521r1);
while (carry || uECC_vli_cmp_unsafe(curve_secp521r1.p, result, num_words_secp521r1) != 1) {
carry -= uECC_vli_sub( result, result, curve_secp521r1.p, num_words_secp521r1);
}
}
However, this does not work and I cannot find the reason why. I would be very grateful if you could give me some hints.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
@balchick, this is my 32-bit version (if it helps you in comparison). It doesn't work either (same as 64-bit). The only difference is in left shifting - now its by 23 bytes, not 55 as in 64-bit case: static void vli_mmod_fast_secp521r1(uint32_t *result, uint32_t *product) {
uint32_t tmp[ num_words_secp521r1 ];
int carry;
int i;
/* t */
uECC_vli_set(result, product, num_words_secp521r1);
result[ num_words_secp521r1 - 1 ] &= 0x01FF;
/* s */
for ( i = 0; i < num_words_secp521r1 - 2; ++i ) {
tmp[ i ] = ( product[ num_words_secp521r1 - 1 + i ] >> 9 ) | ( product[ num_words_secp521r1 + i ] << 23 );
}
tmp[ num_words_secp521r1 - 1 ] = ( product[ num_words_secp521r1 + num_words_secp521r1 - 1 ] >> 9 ) & 0x01FF;
carry = (int)uECC_vli_add(result, result, tmp, num_words_secp521r1);
while (carry || uECC_vli_cmp_unsafe(curve_secp521r1.p, result, num_words_secp521r1) != 1) {
carry -= uECC_vli_sub( result, result, curve_secp521r1.p, num_words_secp521r1);
}
} |
- the implementation does not work for unknown reason, but gives overview of final performance of P-521 - details can be tracked here: kmackay#99 (comment)
Hi,
I put changes in below.
Also, be sure that when you run the non-optimized version (with uECC Optimization set to 0) that it generates the correct result using the NIST test vectors to make sure that you entered all the curve data properly.
The micro-ECC author doesn't have those tests. I wrote them for myself when I was adding support for the P-384 and P-521 curves.
I assume you also know you have to deal with the extra bits in the public keys and shared secrets if you want to compare since the NIST data is exactly the right length. I use 17 32-bit words and lop off the extra bytes when printing and comparing.
I also wrote a routing to substitute a constant private key into the private variable and then only call compute_public_key in the test.
Hope this helps.
Ed
…-----Original Message-----
From: DoDo <notifications@github.com>
To: kmackay/micro-ecc <micro-ecc@noreply.github.com>
Cc: balchick <balchickej@aol.com>; Mention <mention@noreply.github.com>
Sent: Fri, Aug 11, 2017 4:32 am
Subject: Re: [kmackay/micro-ecc] Optimized support for P-384r1 and P-521r1 (#99)
@balchick, this is my 32-bit version (if it helps you in comparison). It doesn't work either (same as 64-bit). The only difference is in left shifting - now its by 23 bytes, not 55 as in 64-bit case:
static void vli_mmod_fast_secp521r1(uint32_t *result, uint32_t *product) {
uint32_t tmp[ num_words_secp521r1 ];
uECC_word_t tmp2 = 0; //new by ejb
int carry;
int i;
/* t */
tmp2 = product[num_words_secp521r1 - 1]; // new by ejb
product[num_words_secp521r1 -1] = tmp2 & 0x000001FF; // new by ejb
uECC_vli_set(result, product, num_words_secp521r1); // moved this down
product[num_words_secp521r1 -1 = tmp2; // new by ejb
delete this line (i added a slightly different one above) - result[ num_words_secp521r1 - 1 ] &= 0x01FF;
/* s */
for ( i = 0; i < num_words_secp521r1 - 1; i++ ) // Note that I changed the line to only subtract 1 instead of 2 and do i++. Your original code may work if you just fix this, not sure. ejb
{
tmp[i] = ((product[i + num_words_secp521r1] & 0x000001ff) << 23) | ((product[i + numwords_secp521r1 - 1] & 0xFFFFFE00) >> 9); // new by ejb to replace line you have below
delete this line and use mine - tmp[ i ] = ( product[ num_words_secp521r1 - 1 + i ] >> 9 ) | ( product[ num_words_secp521r1 + i ] << 23 );
}
tmp[ num_words_secp521r1 - 1 ] = ( product[ 2 * num_words_secp521r1 - 2 ] & 0xFFFFFE00) >> 9 ); // NOTE - I edited this line for you. ejb
carry = += uECC_vli_add(result, result, tmp, num_words_secp521r1); // NOTE - I edited this line for you. ejb
if (carry < 0)
{
do
{
carry += uECC_vli_add( result, result, curve_secp521r1.p, num_words_secp521r1); // new line by ejb
}
while (carry < 0);
}
else
{
while (carry || uECC_vli_cmp_unsafe(curve_secp521r1.p, result, num_words_secp521r1) != 1)
{
carry -= uECC_vli_sub( result, result, curve_secp521r1.p, num_words_secp521r1);
}
} // don't forget this extra bracket. ejb
}
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Hi,
I also implemented the 32-bit version of the NIST equation for the P-384 curve.
Let me know if you want that.
I did not do any ARM optimizations. I wish I knew how to add some of those routines for the 32-bit ARM platform :)
Ed
…-----Original Message-----
From: DoDo <notifications@github.com>
To: kmackay/micro-ecc <micro-ecc@noreply.github.com>
Cc: balchick <balchickej@aol.com>; Mention <mention@noreply.github.com>
Sent: Thu, Aug 10, 2017 12:34 pm
Subject: Re: [kmackay/micro-ecc] Optimized support for P-384r1 and P-521r1 (#99)
@balchick, @kmackay, I am trying to implement P-521 in my code and it works with disabled optimisation (uECC_OPTIMIZATION_LEVEL==0), however I am struggling with writing optimised mmod function, based on NIST paper and suggestions from @balchick.
So far, this looks like this (for 64-bit platforms):
static void vli_mmod_fast_secp521r1(uint64_t *result, uint64_t *product) {
uint64_t tmp[ num_words_secp521r1 ];
int carry;
int i;
/* t */
uECC_vli_set(result, product, num_words_secp521r1);
result[ num_words_secp521r1 - 1 ] &= 0x01FF;
/* s */
for ( i = 0; i < num_words_secp521r1 - 2; ++i ) {
tmp[ i ] = ( product[ num_words_secp521r1 - 1 + i ] >> 9 ) | ( product[ num_words_secp521r1 + i ] << 55 );
}
tmp[ num_words_secp521r1 - 1 ] = ( product[ num_words_secp521r1 + num_words_secp521r1 - 1 ] >> 9 ) & 0x01FF;
carry = (int)uECC_vli_add(result, result, tmp, num_words_secp521r1);
while (carry || uECC_vli_cmp_unsafe(curve_secp521r1.p, result, num_words_secp521r1) != 1) {
carry -= uECC_vli_sub( result, result, curve_secp521r1.p, num_words_secp521r1);
}
}
However, this does not work and I cannot find the reason why. I would be very grateful if you could give me some hints.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
- secp521r1 currently works only for uECC_OPTIMIZATION_LEVEL <= 2 and is around 6 times slower than secp256k1 on same optimisation level and around 12 time slower than secp256k1 on optimisation level 4
Hi @balchick! Thank you a lot! With your help, I was able to fix my code and ensure tests with secp521r1 now work for me. The bug was in for loop (it had to go until
For sake of completeness, it would be good to add that too - it would make this library complete for those who don't seek assembly optimised versions of routines or do not target 8-bit computers (that will have to be stated in README, to prevent people to expect that there is an assembly-optimised version for that too). Based on your code, I will add 64-bit version (I don't have any 8-bit devices to add and test an 8-bit version) and send a pull request to the @kmackay so he can include that into original software, if he wants. |
- the implementation does not work for unknown reason, but gives overview of final performance of P-521 - details can be tracked here: kmackay#99 (comment)
- secp521r1 currently works only for uECC_OPTIMIZATION_LEVEL <= 2 and is around 6 times slower than secp256k1 on same optimisation level and around 12 time slower than secp256k1 on optimisation level 4
Hi,
Glad that worked.
I'll send you an e-mail with the NIST equation for the P-384 tomorrow.
Later this month I might take a shot at making the 32-bit ARM asm. I don't know ARM asm, but I have a book and if I can relate his for 32-bit for P-256, I figure I can expand it to make it work.
I have some performance numbers from an ARM platform for the supported curves for 32-bit and can't remember if it helps a lot over the NIST math. I'll get a list of the numbers for your info.
Ed
…-----Original Message-----
From: DoDo <notifications@github.com>
To: kmackay/micro-ecc <micro-ecc@noreply.github.com>
Cc: balchick <balchickej@aol.com>; Mention <mention@noreply.github.com>
Sent: Sun, Aug 13, 2017 9:57 am
Subject: Re: [kmackay/micro-ecc] Optimized support for P-384r1 and P-521r1 (#99)
Hi @balchick!
Thank you a lot! With your help, I was able to fix my code and ensure tests with secp521r1 now work for me. The bug was in for loop (it had to go until num_words_secp521r1 - 1, as you noticed) and the shifting of last element of tmp was done incorrectly. My final (working) code can now be found in my fork of this repository, however it will only work with uECC_OPTIMIZATION_LEVEL <= 2 on ARM, since optimisation levels beyond 2 expect ARM assembly-optimised uECC_vli_* functions, which are very large for secp521r1 (as @kmackay stated in his last comment).
I also implemented the 32-bit version of the NIST equation for the P-384 curve.
Let me know if you want that.
For sake of completeness, it would be good to add that too - it would make this library complete for those who don't seek assembly optimised versions of routines or do not target 8-bit computers (that will have to be stated in README, to prevent people to expect that there is an assembly-optimised version for that too).
Based on your code, I will add 64-bit version (I don't have any 8-bit devices to add and test an 8-bit version) and send a pull request to the @kmackay so he can include that into original software, if he wants.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Hi!
Good luck! I tried that and concluded that optimised assembly would be very large (@kmackay also stated that previously). However, maybe a good way to do that would be to start from C code, then first optimise it until it produces best possible assembly (this tool is good for doing that task) and finally hand-optimise the produced assembly until it produces best possible performance - it's a lot of work, but could pay off in cases when performance is critical.
On my benchmarks (dual core Cortex A9 in low-end android device), However, with the advancement of the technology, when average android device will be faster, I will eventually upgrade my software to use |
- the implementation does not work for unknown reason, but gives overview of final performance of P-521 - details can be tracked here: kmackay#99 (comment)
- secp521r1 currently works only for uECC_OPTIMIZATION_LEVEL <= 2 and is around 6 times slower than secp256k1 on same optimisation level and around 12 time slower than secp256k1 on optimisation level 4
Hi,
I am new to Github and found your micro-ecc sw. I am interested in that on ARM microcontrollers and was able to get it to work just fine.
I was able to modify the curve-specific.inc file to include support for the NIST P-384r1 and P-521r1 curves and I added code to the test-ecdh.c program to include and test the NIST test vectors. I swap their private key in to the variable and modified the make_key function to make a make_public_key function.
My question is that I have no idea how to give you my code. Can you point me to a guide that would help? Would you even want it?
What I'd really like would be to see if you can add the NIST ecc optimizations for these two curves the way you have for the other ones. When I compile those in, my code is 20-40x faster, depending whether I run in on Linux or ARM bare metal on my dev board.
Thanks for the software and thanks for your time!
Ed B.
The text was updated successfully, but these errors were encountered: