Skip to content

Commit 59a3c2e

Browse files
author
ipl_ci
committed
WW07'25 source code update
1 parent b2a11da commit 59a3c2e

File tree

60 files changed

+574
-326
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

60 files changed

+574
-326
lines changed

CHANGELOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,8 @@ Cmake build options `-DMERGED_BLD:BOOL=off -DMBX_PLATFORM_LIST="k1;l9"` may be u
1111
[BUILD.md](./BUILD.md) for the details.
1212
- Fixed AVX512 IFMA implementation (k1 branch) of SM2 signature and verification single-buffer algorithm. The optimized path is re-enabled.
1313
- Added `ippsHashMethod_SM3_NI` and `ippsHashMethod_SM3_TT` methods for SM3 hash algorithm optimization with the new SM3 instructions for Lunar Lake and Arrow Lake S CPUs. The runtime dispatch introduced in Intel(R) Cryptography Primitives Library 1.0.0 release `ippsHashMethod_SM3` is moved to `ippsHashMethod_SM3_TT` and the behavior of `ippsHashMethod` API is aligned with SHA hash family.
14+
- Deprecated `fips_selftest_ippsRSASignVerify_PKCS1v15_rmf_get_size_keys` and `fips_selftest_ippsRSASignVerify_PKCS1v15_rmf_get_size`.
15+
Please see [DEPRECATION_NOTES](DEPRECATION_NOTES.md) for more details.
1416

1517
## Intel(R) Cryptography Primitives Library 1.0.1
1618
- Fixed an issue with invalid memory access for AES-GCM algorithm with Intel® Advanced Vector Extensions 2 (Intel® AVX2) vector extensions of Intel® AES New Instructions (Intel® AES-NI) in case of corner sizes.

DEPRECATION_NOTES.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,20 @@ This document describes deprecated API in different Intel® Cryptography Primiti
44

55
The deprecated API means it is obsolete and will be removed in one of future Intel® Cryptography Primitives Library releases. If you have any concerns, please use the following link for opening a ticket and providing feedback: <https://supporttickets.intel.com.>
66

7+
## Intel® Cryptography Primitives Library v1.1.0
8+
9+
### FIPS self-tests
10+
11+
The common `fips_selftest_ippsRSASignVerify_PKCS1v15_rmf_get_size*` functions have been specialized for signing and signature verification.
12+
The memory footprint of the corresponding FIPS self-tests was reduced.
13+
14+
| Deprecated | Recommended replacement | Context |
15+
| :--------------------------------------------------------- | :----------------------------------------------------: | :--------------------------------------------------------------------: |
16+
| fips_selftest_ippsRSASignVerify_PKCS1v15_rmf_get_size_keys | fips_selftest_ippsRSASign_PKCS1v15_rmf_get_size_keys | Understand memory requirements of RSA PKCS1 v1.5 signature generation |
17+
| fips_selftest_ippsRSASignVerify_PKCS1v15_rmf_get_size | fips_selftest_ippsRSASign_PKCS1v15_rmf_get_size | Understand memory requirements of RSA PKCS1 v1.5 signature generation |
18+
| fips_selftest_ippsRSASignVerify_PKCS1v15_rmf_get_size_keys | fips_selftest_ippsRSAVerify_PKCS1v15_rmf_get_size_keys | Understand memory requirements of RSA PKCS1 v1.5 signature verfication |
19+
| fips_selftest_ippsRSASignVerify_PKCS1v15_rmf_get_size | fips_selftest_ippsRSAVerify_PKCS1v15_rmf_get_size | Understand memory requirements of RSA PKCS1 v1.5 signature verfication |
20+
721
## Intel® Cryptography Primitives Library v1.0.0
822

923
### Service Functions

README_FIPS.md

Lines changed: 20 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -276,15 +276,31 @@ fips_test_status fips_selftest_ippsHMACMessage_rmf (void);
276276
##### RSA sign/verify
277277
278278
```cpp
279-
fips_test_status fips_selftest_ippsRSASignVerify_PKCS1v15_rmf_get_size_keys (int *pKeysBufferSize);
280-
fips_test_status fips_selftest_ippsRSASignVerify_PKCS1v15_rmf_get_size (int *pBufferSize Ipp8u *pKeysBuffer);
279+
fips_test_status fips_selftest_ippsRSASign_PKCS1v15_rmf_get_size_keys (int *pKeysBufferSize);
280+
fips_test_status fips_selftest_ippsRSASign_PKCS1v15_rmf_get_size (int *pBufferSize Ipp8u *pKeysBuffer);
281281
fips_test_status fips_selftest_ippsRSASign_PKCS1v15_rmf (Ipp8u *pBuffer Ipp8u *pKeysBuffer);
282+
```
283+
284+
, where `pBuffer` is the valid buffer for selftest of size indicated by
285+
`fips_selftest_ippsRSASign_PKCS1v15_rmf_get_size` and `pKeysBuffer` is the
286+
valid buffer for selftest of size indicated by `fips_selftest_ippsRSASign_PKCS1v15_rmf_get_size_keys`.
287+
288+
```cpp
289+
fips_test_status fips_selftest_ippsRSAVerify_PKCS1v15_rmf_get_size_keys (int *pKeysBufferSize);
290+
fips_test_status fips_selftest_ippsRSAVerify_PKCS1v15_rmf_get_size (int *pBufferSize Ipp8u *pKeysBuffer);
282291
fips_test_status fips_selftest_ippsRSAVerify_PKCS1v15_rmf (Ipp8u *pBuffer Ipp8u *pKeysBuffer);
283292
```
284293
285294
, where `pBuffer` is the valid buffer for selftest of size indicated by
286-
`fips_selftest_ippsRSASignVerify_PKCS1v15_rmf_get_size` and `pKeysBuffer` is the
287-
valid buffer for selftest of size indicated by `fips_selftest_ippsRSASignVerify_PKCS1v15_rmf_get_size_keys`.
295+
`fips_selftest_ippsRSAVerify_PKCS1v15_rmf_get_size` and `pKeysBuffer` is the
296+
valid buffer for selftest of size indicated by `fips_selftest_ippsRSAVerify_PKCS1v15_rmf_get_size_keys`.
297+
298+
Following APIs have been deprecated:
299+
```cpp
300+
fips_test_status fips_selftest_ippsRSASignVerify_PKCS1v15_rmf_get_size_keys (int *pKeysBufferSize);
301+
fips_test_status fips_selftest_ippsRSASignVerify_PKCS1v15_rmf_get_size (int *pBufferSize Ipp8u *pKeysBuffer);
302+
```
303+
Their transition plan can be found in [DEPRECATION_NOTES](./DEPRECATION_NOTES.md).
288304

289305
```cpp
290306
fips_test_status fips_selftest_ippsRSASignVerify_PSS_rmf_get_size_keys (int *pKeysBufferSize);

include/ippcp/fips_cert.h

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -84,8 +84,20 @@ IPPAPI(fips_test_status, fips_selftest_ippsHMACUpdate_rmf, (Ipp8u *pBuffer))
8484
IPPAPI(fips_test_status, fips_selftest_ippsHMACMessage_rmf, (void))
8585

8686
/* RSA sign/verify */
87+
88+
#define LEAN_GET_SIZE "Function reports more memory than required. A leaner alternative is available: \
89+
fips_selftest_ippsRSASign_PKCS1v15_rmf_get_size_keys, fips_selftest_ippsRSASign_PKCS1v15_rmf_get_size, \
90+
fips_selftest_ippsRSAVerify_PKCS1v15_rmf_get_size_keys, fips_selftest_ippsRSAVerify_PKCS1v15_rmf_get_size."
91+
92+
IPP_DEPRECATED(LEAN_GET_SIZE) \
8793
IPPAPI(fips_test_status, fips_selftest_ippsRSASignVerify_PKCS1v15_rmf_get_size_keys, (int *pKeysBufferSize))
94+
IPP_DEPRECATED(LEAN_GET_SIZE) \
8895
IPPAPI(fips_test_status, fips_selftest_ippsRSASignVerify_PKCS1v15_rmf_get_size, (int *pBufferSize, Ipp8u *pKeysBuffer))
96+
97+
IPPAPI(fips_test_status, fips_selftest_ippsRSASign_PKCS1v15_rmf_get_size_keys, (int *pKeysBufferSize))
98+
IPPAPI(fips_test_status, fips_selftest_ippsRSASign_PKCS1v15_rmf_get_size, (int *pBufferSize, Ipp8u *pKeysBuffer))
99+
IPPAPI(fips_test_status, fips_selftest_ippsRSAVerify_PKCS1v15_rmf_get_size_keys, (int *pKeysBufferSize))
100+
IPPAPI(fips_test_status, fips_selftest_ippsRSAVerify_PKCS1v15_rmf_get_size, (int *pBufferSize, Ipp8u *pKeysBuffer))
89101
IPPAPI(fips_test_status, fips_selftest_ippsRSASign_PKCS1v15_rmf, (Ipp8u *pBuffer, Ipp8u *pKeysBuffer))
90102
IPPAPI(fips_test_status, fips_selftest_ippsRSAVerify_PKCS1v15_rmf, (Ipp8u *pBuffer, Ipp8u *pKeysBuffer))
91103

sources/ippcp/asm_ia32/pcpsha512l9ni.asm

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,6 @@ IPPASM UpdateSHA512ni,PUBLIC
7676
;; hash infrastructure (caller) sends the block size in bytes
7777
;; the algorithm requires the number of 2^7=128 byte blocks
7878
shr arg_num_blks, 7
79-
or arg_num_blks, arg_num_blks
8079
je .done_hash
8180

8281
mov arg_hash, pDigest

sources/ippcp/asm_intel64/pcpsha1l9as.asm

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@
3535
;;
3636
;; assignments
3737
;;
38-
%xdefine hA eax ;; hash values into GPR registers
38+
%xdefine hA eax ;; hash values into general purpose registers (GPRs)
3939
%xdefine F ebp
4040
%xdefine hB ebx
4141
%xdefine hC ecx

sources/ippcp/asm_intel64/pcpsha256l9as.asm

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@
3434
;;
3535
;; assignments
3636
;;
37-
%xdefine hA eax ;; hash values into GPR registers
37+
%xdefine hA eax ;; hash values into general purpose registers (GPRs)
3838
%xdefine hB ebx
3939
%xdefine hC ecx
4040
%xdefine hD edx

sources/ippcp/asm_intel64/pcpsha512l9ni.asm

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,6 @@ IPPASM UpdateSHA512ni,PUBLIC
7878
;; hash infrastructure (caller) sends the block size in bytes
7979
;; the algorithm requires the number of 2^7=128 byte blocks
8080
shr arg_num_blks, 7
81-
or arg_num_blks, arg_num_blks
8281
je .done_hash
8382

8483
;; ===========================================================

sources/ippcp/asm_intel64/pcpsm4l9_ni_as.asm

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -90,6 +90,9 @@ IPPASM cpSMS4_SetRoundKeys_ni,PUBLIC
9090
%assign i (i + 1)
9191
%endrep
9292

93+
; zeroize
94+
vpxor xmm0, xmm0
95+
9396
REST_XMM_AVX
9497
REST_GPR
9598
ret
@@ -127,6 +130,9 @@ IPPASM cpSMS4_ECB_ni,PUBLIC
127130
vpshufb xmm0, [rel out_shufb]
128131
vmovdqu [pOut], xmm0
129132

133+
; zeroize
134+
vpxor xmm0, xmm0
135+
130136
REST_XMM_AVX
131137
REST_GPR
132138
ret
@@ -165,6 +171,10 @@ IPPASM cpSMS4_ECB_ni_256,PUBLIC
165171
vpshufb ymm0, [rel out_shufb]
166172
vmovdqu [pOut], ymm0
167173

174+
; zeroize
175+
vpxor ymm0, ymm0
176+
vpxor ymm1, ymm1
177+
168178
REST_XMM_AVX
169179
REST_GPR
170180
ret

sources/ippcp/crypto_mb/src/rsa/internal_avx2/ifma_ams52x10_diagonal_mb4.c

Lines changed: 98 additions & 102 deletions
Original file line numberDiff line numberDiff line change
@@ -64,108 +64,104 @@ static void ams52x10_square_diagonal_mb4(__m256i* res, const int64u* inpA_mb)
6464
// 1st triangle - sum the products, double and square
6565
r0 = zero_simd;
6666

67-
res[0] = _mm256_madd52lo_epu64(r0, inpA[0], inpA[0]);
68-
r1 = zero_simd;
69-
r2 = zero_simd;
70-
r3 = zero_simd;
71-
r4 = zero_simd;
72-
r5 = zero_simd;
73-
r6 = zero_simd;
74-
r7 = zero_simd;
75-
r8 = zero_simd;
76-
AL = inpA[0];
77-
r0 = _mm256_madd52lo_epu64(r0, AL, inpA[1]); // Sum(1)
78-
r1 = _mm256_madd52lo_epu64(r1, AL, inpA[2]); // Sum(2)
79-
r2 = _mm256_madd52lo_epu64(r2, AL, inpA[3]); // Sum(3)
80-
r3 = _mm256_madd52lo_epu64(r3, AL, inpA[4]); // Sum(4)
81-
r4 = _mm256_madd52lo_epu64(r4, AL, inpA[5]); // Sum(5)
82-
r5 = _mm256_madd52lo_epu64(r5, AL, inpA[6]); // Sum(6)
83-
r6 = _mm256_madd52lo_epu64(r6, AL, inpA[7]); // Sum(7)
84-
r7 = _mm256_madd52lo_epu64(r7, AL, inpA[8]); // Sum(8)
85-
r1 = _mm256_madd52hi_epu64(r1, AL, inpA[1]); // Sum(1)
86-
r2 = _mm256_madd52hi_epu64(r2, AL, inpA[2]); // Sum(2)
87-
r3 = _mm256_madd52hi_epu64(r3, AL, inpA[3]); // Sum(3)
88-
r4 = _mm256_madd52hi_epu64(r4, AL, inpA[4]); // Sum(4)
89-
r5 = _mm256_madd52hi_epu64(r5, AL, inpA[5]); // Sum(5)
90-
r6 = _mm256_madd52hi_epu64(r6, AL, inpA[6]); // Sum(6)
91-
r7 = _mm256_madd52hi_epu64(r7, AL, inpA[7]); // Sum(7)
92-
r8 = _mm256_madd52hi_epu64(r8, AL, inpA[8]); // Sum(8)
93-
AL = inpA[1];
94-
r2 = _mm256_madd52lo_epu64(r2, AL, inpA[2]); // Sum(3)
95-
r3 = _mm256_madd52lo_epu64(r3, AL, inpA[3]); // Sum(4)
96-
r4 = _mm256_madd52lo_epu64(r4, AL, inpA[4]); // Sum(5)
97-
r5 = _mm256_madd52lo_epu64(r5, AL, inpA[5]); // Sum(6)
98-
r6 = _mm256_madd52lo_epu64(r6, AL, inpA[6]); // Sum(7)
99-
r7 = _mm256_madd52lo_epu64(r7, AL, inpA[7]); // Sum(8)
100-
r3 = _mm256_madd52hi_epu64(r3, AL, inpA[2]); // Sum(3)
101-
r4 = _mm256_madd52hi_epu64(r4, AL, inpA[3]); // Sum(4)
102-
r5 = _mm256_madd52hi_epu64(r5, AL, inpA[4]); // Sum(5)
103-
r6 = _mm256_madd52hi_epu64(r6, AL, inpA[5]); // Sum(6)
104-
r7 = _mm256_madd52hi_epu64(r7, AL, inpA[6]); // Sum(7)
105-
r8 = _mm256_madd52hi_epu64(r8, AL, inpA[7]); // Sum(8)
106-
AL = inpA[2];
107-
r4 = _mm256_madd52lo_epu64(r4, AL, inpA[3]); // Sum(5)
108-
r5 = _mm256_madd52lo_epu64(r5, AL, inpA[4]); // Sum(6)
109-
r6 = _mm256_madd52lo_epu64(r6, AL, inpA[5]); // Sum(7)
110-
r7 = _mm256_madd52lo_epu64(r7, AL, inpA[6]); // Sum(8)
111-
r5 = _mm256_madd52hi_epu64(r5, AL, inpA[3]); // Sum(5)
112-
r6 = _mm256_madd52hi_epu64(r6, AL, inpA[4]); // Sum(6)
113-
r7 = _mm256_madd52hi_epu64(r7, AL, inpA[5]); // Sum(7)
114-
r8 = _mm256_madd52hi_epu64(r8, AL, inpA[6]); // Sum(8)
115-
AL = inpA[3];
116-
r6 = _mm256_madd52lo_epu64(r6, AL, inpA[4]); // Sum(7)
117-
r7 = _mm256_madd52lo_epu64(r7, AL, inpA[5]); // Sum(8)
118-
r7 = _mm256_madd52hi_epu64(r7, AL, inpA[4]); // Sum(7)
119-
r8 = _mm256_madd52hi_epu64(r8, AL, inpA[5]); // Sum(8)
120-
r0 = _mm256_add_epi64(r0, r0); // Double(1)
121-
r0 = _mm256_madd52hi_epu64(r0, inpA[0], inpA[0]); // Add square(1)
122-
res[1] = r0;
123-
r1 = _mm256_add_epi64(r1, r1); // Double(2)
124-
r1 = _mm256_madd52lo_epu64(r1, inpA[1], inpA[1]); // Add square(2)
125-
res[2] = r1;
126-
r2 = _mm256_add_epi64(r2, r2); // Double(3)
127-
r2 = _mm256_madd52hi_epu64(r2, inpA[1], inpA[1]); // Add square(3)
128-
res[3] = r2;
129-
r3 = _mm256_add_epi64(r3, r3); // Double(4)
130-
r3 = _mm256_madd52lo_epu64(r3, inpA[2], inpA[2]); // Add square(4)
131-
res[4] = r3;
132-
r4 = _mm256_add_epi64(r4, r4); // Double(5)
133-
r4 = _mm256_madd52hi_epu64(r4, inpA[2], inpA[2]); // Add square(5)
134-
res[5] = r4;
135-
r5 = _mm256_add_epi64(r5, r5); // Double(6)
136-
r5 = _mm256_madd52lo_epu64(r5, inpA[3], inpA[3]); // Add square(6)
137-
res[6] = r5;
138-
r6 = _mm256_add_epi64(r6, r6); // Double(7)
139-
r6 = _mm256_madd52hi_epu64(r6, inpA[3], inpA[3]); // Add square(7)
140-
res[7] = r6;
141-
r7 = _mm256_add_epi64(r7, r7); // Double(8)
142-
r7 = _mm256_madd52lo_epu64(r7, inpA[4], inpA[4]); // Add square(8)
143-
res[8] = r7;
144-
r0 = r8;
145-
r1 = zero_simd;
146-
AL = inpA[0];
147-
r0 = _mm256_madd52lo_epu64(r0, AL, inpA[9]); // Sum(9)
148-
r1 = _mm256_madd52hi_epu64(r1, AL, inpA[9]); // Sum(9)
149-
AL = inpA[1];
150-
r0 = _mm256_madd52lo_epu64(r0, AL, inpA[8]); // Sum(9)
151-
r1 = _mm256_madd52hi_epu64(r1, AL, inpA[8]); // Sum(9)
152-
AL = inpA[2];
153-
r0 = _mm256_madd52lo_epu64(r0, AL, inpA[7]); // Sum(9)
154-
r1 = _mm256_madd52hi_epu64(r1, AL, inpA[7]); // Sum(9)
155-
AL = inpA[3];
156-
r0 = _mm256_madd52lo_epu64(r0, AL, inpA[6]); // Sum(9)
157-
r1 = _mm256_madd52hi_epu64(r1, AL, inpA[6]); // Sum(9)
158-
AL = inpA[4];
159-
r0 = _mm256_madd52lo_epu64(r0, AL, inpA[5]); // Sum(9)
160-
r1 = _mm256_madd52hi_epu64(r1, AL, inpA[5]); // Sum(9)
161-
AL = inpA[5];
162-
AL = inpA[6];
163-
AL = inpA[7];
164-
r0 = _mm256_add_epi64(r0, r0); // Double(9)
165-
r0 = _mm256_madd52hi_epu64(r0, inpA[4], inpA[4]); // Add square(9)
166-
res[9] = r0;
167-
r0 = r1;
168-
res[10] = r0; // finish up 1st triangle
67+
res[0] = _mm256_madd52lo_epu64(r0, inpA[0], inpA[0]);
68+
r1 = zero_simd;
69+
r2 = zero_simd;
70+
r3 = zero_simd;
71+
r4 = zero_simd;
72+
r5 = zero_simd;
73+
r6 = zero_simd;
74+
r7 = zero_simd;
75+
r8 = zero_simd;
76+
AL = inpA[0];
77+
r0 = _mm256_madd52lo_epu64(r0, AL, inpA[1]); // Sum(1)
78+
r1 = _mm256_madd52lo_epu64(r1, AL, inpA[2]); // Sum(2)
79+
r2 = _mm256_madd52lo_epu64(r2, AL, inpA[3]); // Sum(3)
80+
r3 = _mm256_madd52lo_epu64(r3, AL, inpA[4]); // Sum(4)
81+
r4 = _mm256_madd52lo_epu64(r4, AL, inpA[5]); // Sum(5)
82+
r5 = _mm256_madd52lo_epu64(r5, AL, inpA[6]); // Sum(6)
83+
r6 = _mm256_madd52lo_epu64(r6, AL, inpA[7]); // Sum(7)
84+
r7 = _mm256_madd52lo_epu64(r7, AL, inpA[8]); // Sum(8)
85+
r1 = _mm256_madd52hi_epu64(r1, AL, inpA[1]); // Sum(1)
86+
r2 = _mm256_madd52hi_epu64(r2, AL, inpA[2]); // Sum(2)
87+
r3 = _mm256_madd52hi_epu64(r3, AL, inpA[3]); // Sum(3)
88+
r4 = _mm256_madd52hi_epu64(r4, AL, inpA[4]); // Sum(4)
89+
r5 = _mm256_madd52hi_epu64(r5, AL, inpA[5]); // Sum(5)
90+
r6 = _mm256_madd52hi_epu64(r6, AL, inpA[6]); // Sum(6)
91+
r7 = _mm256_madd52hi_epu64(r7, AL, inpA[7]); // Sum(7)
92+
r8 = _mm256_madd52hi_epu64(r8, AL, inpA[8]); // Sum(8)
93+
AL = inpA[1];
94+
r2 = _mm256_madd52lo_epu64(r2, AL, inpA[2]); // Sum(3)
95+
r3 = _mm256_madd52lo_epu64(r3, AL, inpA[3]); // Sum(4)
96+
r4 = _mm256_madd52lo_epu64(r4, AL, inpA[4]); // Sum(5)
97+
r5 = _mm256_madd52lo_epu64(r5, AL, inpA[5]); // Sum(6)
98+
r6 = _mm256_madd52lo_epu64(r6, AL, inpA[6]); // Sum(7)
99+
r7 = _mm256_madd52lo_epu64(r7, AL, inpA[7]); // Sum(8)
100+
r3 = _mm256_madd52hi_epu64(r3, AL, inpA[2]); // Sum(3)
101+
r4 = _mm256_madd52hi_epu64(r4, AL, inpA[3]); // Sum(4)
102+
r5 = _mm256_madd52hi_epu64(r5, AL, inpA[4]); // Sum(5)
103+
r6 = _mm256_madd52hi_epu64(r6, AL, inpA[5]); // Sum(6)
104+
r7 = _mm256_madd52hi_epu64(r7, AL, inpA[6]); // Sum(7)
105+
r8 = _mm256_madd52hi_epu64(r8, AL, inpA[7]); // Sum(8)
106+
AL = inpA[2];
107+
r4 = _mm256_madd52lo_epu64(r4, AL, inpA[3]); // Sum(5)
108+
r5 = _mm256_madd52lo_epu64(r5, AL, inpA[4]); // Sum(6)
109+
r6 = _mm256_madd52lo_epu64(r6, AL, inpA[5]); // Sum(7)
110+
r7 = _mm256_madd52lo_epu64(r7, AL, inpA[6]); // Sum(8)
111+
r5 = _mm256_madd52hi_epu64(r5, AL, inpA[3]); // Sum(5)
112+
r6 = _mm256_madd52hi_epu64(r6, AL, inpA[4]); // Sum(6)
113+
r7 = _mm256_madd52hi_epu64(r7, AL, inpA[5]); // Sum(7)
114+
r8 = _mm256_madd52hi_epu64(r8, AL, inpA[6]); // Sum(8)
115+
AL = inpA[3];
116+
r6 = _mm256_madd52lo_epu64(r6, AL, inpA[4]); // Sum(7)
117+
r7 = _mm256_madd52lo_epu64(r7, AL, inpA[5]); // Sum(8)
118+
r7 = _mm256_madd52hi_epu64(r7, AL, inpA[4]); // Sum(7)
119+
r8 = _mm256_madd52hi_epu64(r8, AL, inpA[5]); // Sum(8)
120+
r0 = _mm256_add_epi64(r0, r0); // Double(1)
121+
r0 = _mm256_madd52hi_epu64(r0, inpA[0], inpA[0]); // Add square(1)
122+
res[1] = r0;
123+
r1 = _mm256_add_epi64(r1, r1); // Double(2)
124+
r1 = _mm256_madd52lo_epu64(r1, inpA[1], inpA[1]); // Add square(2)
125+
res[2] = r1;
126+
r2 = _mm256_add_epi64(r2, r2); // Double(3)
127+
r2 = _mm256_madd52hi_epu64(r2, inpA[1], inpA[1]); // Add square(3)
128+
res[3] = r2;
129+
r3 = _mm256_add_epi64(r3, r3); // Double(4)
130+
r3 = _mm256_madd52lo_epu64(r3, inpA[2], inpA[2]); // Add square(4)
131+
res[4] = r3;
132+
r4 = _mm256_add_epi64(r4, r4); // Double(5)
133+
r4 = _mm256_madd52hi_epu64(r4, inpA[2], inpA[2]); // Add square(5)
134+
res[5] = r4;
135+
r5 = _mm256_add_epi64(r5, r5); // Double(6)
136+
r5 = _mm256_madd52lo_epu64(r5, inpA[3], inpA[3]); // Add square(6)
137+
res[6] = r5;
138+
r6 = _mm256_add_epi64(r6, r6); // Double(7)
139+
r6 = _mm256_madd52hi_epu64(r6, inpA[3], inpA[3]); // Add square(7)
140+
res[7] = r6;
141+
r7 = _mm256_add_epi64(r7, r7); // Double(8)
142+
r7 = _mm256_madd52lo_epu64(r7, inpA[4], inpA[4]); // Add square(8)
143+
res[8] = r7;
144+
r0 = r8; // First, lo_ of 9th chunk is hi_ of 8th chunk
145+
r1 = zero_simd;
146+
AL = inpA[0];
147+
r0 = _mm256_madd52lo_epu64(r0, AL, inpA[9]); // Sum(9)
148+
r1 = _mm256_madd52hi_epu64(r1, AL, inpA[9]); // Sum(9)
149+
AL = inpA[1];
150+
r0 = _mm256_madd52lo_epu64(r0, AL, inpA[8]); // Sum(9)
151+
r1 = _mm256_madd52hi_epu64(r1, AL, inpA[8]); // Sum(9)
152+
AL = inpA[2];
153+
r0 = _mm256_madd52lo_epu64(r0, AL, inpA[7]); // Sum(9)
154+
r1 = _mm256_madd52hi_epu64(r1, AL, inpA[7]); // Sum(9)
155+
AL = inpA[3];
156+
r0 = _mm256_madd52lo_epu64(r0, AL, inpA[6]); // Sum(9)
157+
r1 = _mm256_madd52hi_epu64(r1, AL, inpA[6]); // Sum(9)
158+
AL = inpA[4];
159+
r0 = _mm256_madd52lo_epu64(r0, AL, inpA[5]); // Sum(9)
160+
r1 = _mm256_madd52hi_epu64(r1, AL, inpA[5]); // Sum(9)
161+
r0 = _mm256_add_epi64(r0, r0); // Double(9)
162+
r0 = _mm256_madd52hi_epu64(r0, inpA[4], inpA[4]); // Add square(9)
163+
res[9] = r0;
164+
r0 = r1; // finish up 1st triangle
169165

170166
ASM("jmp l0\nl0:\n");
171167

0 commit comments

Comments
 (0)