Optimized SHA2 implementation. #78

cheatfate · 2024-05-08T01:50:51Z

Should address #36

arnetheduck · 2024-12-17T17:13:22Z

nimcrypto/sha2/sha2_avx.nim

+    m128i* {.importc: "__m128i", x86type.} = object
+      data: array[2, uint64]
+
+  let


let introduces a global variable which, when you access it, is a side effect - this is (one of) the reasons func can't be used. It should be safe to switch this to const as long as only unaligned loads are used.

sha2 needs to be func for lots of reasons, ie one should be able to assume that it has no side effects (in fact, our code doesn't compile with this branch)

const is not working here, because when Nim encounters <const>[x] in the code it substitutes it with value, but for AVX/AVX2 implementations i need constants to be stored in memory in specific order one by one. So AVX/AVX2 operations could load not single 64bit value but 4 or 8 values with one instruction.

in that case, access to it needs to go through the same {.noSideEffect.} trick so that the public interface remains side-effect-free.

This is really a bug in nim though, that there is no native mechanism to make this kind of code side-effect-free - because at the end of the day, there's actually no side effect in the generated code, not even with let - it's just a semantic gap in nim.

I'm still not sure about this {.noSideEffect.} thing.
init() updates context - side effect present
update() updates context - side effect present
finish() updates context - side effect present
reset() updates context - side effect present
What the reason to avoid side effect for this specific API functions?

side effect in this context means that it depends or updates something outside of the given parameters - for example a global - the "test" to determine whether something has a side effect or not is to reason about whether two calls with the same inputs can yield different outputs - in this case, they can't - the reason the compiler doesn't understand this is due to a limitation in the nim compiler - it can't guarantee that a global variable (K0 in this case) keeps the same value between one call to update and the next call.

Hash functions by definition are side-effect-free, or they wouldn't be hash functions - ie the output should depend only on the inputs and not anything else - "side-effect-free" for example allows the compiler to remove repeated computation of the same thing - for i in 0..9: echo digest("hello") can be rewritten by the compiler as let tmp = digest("hello"); for i in 0..9: echo tmp if it can prove (or is told) that digest is side-effect-free.

In libp2p for example, there's a type DigestFcn = proc(...): ... {.noSideEffect.} - it requires that all digest functions are side-effect-free so that when someone uses a DigestFcn, the compiler can optimise it like above.

Nim added (partially) support addressable constants iirc.

I made the request here nim-lang/RFCs#257 though I'm not sure what part is implemented and what not. cc @Araq

I can confirm that we can take the address of a const since at least 1.6, see example with ssse3 impl of SHA256:

https://github.com/mratsim/constantine/blob/7cffd2f/constantine/hashes/sha256/sha256_generic.nim#L42-L51

const K256* = [ 0x428a2f98'u32, 0x71374491'u32, 0xb5c0fbcf'u32, 0xe9b5dba5'u32, 0x3956c25b'u32, 0x59f111f1'u32, 0x923f82a4'u32, 0xab1c5ed5'u32, 0xd807aa98'u32, 0x12835b01'u32, 0x243185be'u32, 0x550c7dc3'u32, 0x72be5d74'u32, 0x80deb1fe'u32, 0x9bdc06a7'u32, 0xc19bf174'u32, 0xe49b69c1'u32, 0xefbe4786'u32, 0x0fc19dc6'u32, 0x240ca1cc'u32, 0x2de92c6f'u32, 0x4a7484aa'u32, 0x5cb0a9dc'u32, 0x76f988da'u32, 0x983e5152'u32, 0xa831c66d'u32, 0xb00327c8'u32, 0xbf597fc7'u32, 0xc6e00bf3'u32, 0xd5a79147'u32, 0x06ca6351'u32, 0x14292967'u32, 0x27b70a85'u32, 0x2e1b2138'u32, 0x4d2c6dfc'u32, 0x53380d13'u32, 0x650a7354'u32, 0x766a0abb'u32, 0x81c2c92e'u32, 0x92722c85'u32, 0xa2bfe8a1'u32, 0xa81a664b'u32, 0xc24b8b70'u32, 0xc76c51a3'u32, 0xd192e819'u32, 0xd6990624'u32, 0xf40e3585'u32, 0x106aa070'u32, 0x19a4c116'u32, 0x1e376c08'u32, 0x2748774c'u32, 0x34b0bcb5'u32, 0x391c0cb3'u32, 0x4ed8aa4a'u32, 0x5b9cca4f'u32, 0x682e6ff3'u32, 0x748f82ee'u32, 0x78a5636f'u32, 0x84c87814'u32, 0x8cc70208'u32, 0x90befffa'u32, 0xa4506ceb'u32, 0xbef9a3f7'u32, 0xc67178f2'u32 ]

https://github.com/mratsim/constantine/blob/7cffd2f/constantine/hashes/sha256/sha256_x86_ssse3.nim#L52-L66

const VecWords = 16 div sizeof(Word) # sizeof(m128i) / sizeof(Word) func initMessageSchedule( msnext: var array[VecNum, m128i], ms: var Sha256_MessageSchedule, message: ptr UncheckedArray[byte]) {.inline.} = ## Initial state, from data ## - Precompute steps for the future message schedule `msnext` ## - compute the current message schedule `ms` let mask = setr_u32x4(0x00010203, 0x04050607, 0x08090a0b, 0x0c0d0e0f) let pK256 = K256.unsafeAddr() staticFor i, 0, VecNum: msnext[i] = loadu_u128(message[i * sizeof(m128i)].addr) msnext[i] = shuf_u8x16(msnext[i], mask) storea_u128(ms.w[VecWords*i].addr, add_u32x4(msnext[i], loadu_u128(pK256[VecWords*i].addr)))

I think it still doesn't support aligning though

arnetheduck · 2024-12-17T17:44:21Z

nimcrypto/sha2/sha2.nim

+
+proc init*(ctx: var Sha2Context,
+           implementation = Sha2Implementation.Auto,
+           cpufeatures: set[CpuFeature] = {}) {.noinit.} =


with empty features by default, it becomes impractical to use the cpuid-based selection - users will end up with the slow ref implementation in most cases, including when using the convenience helpers found elsewhere in the library.

Also, this is not really a setting that is useful to have at the context level: we create a new context every time we hash something - doing the feature selection here is wasteful (hashing a beacon state creates millions of contexts).

The classic way to do this is something like:

let sha256update: proc(p: pointer, size: int) {.noSideEffect.} = selectBestImpl() # alternatively: var p = proc(p: pointer, size: int) = p = selectBestImplementation() p(pointer, size) template callUpdate(p, s) = # There's no side effect here logically, just that the compiler doesn't understand it {.noSideEffect.}: sha256Update(p, s)

there are fancier options but they tend to be overkill.

Its supposed that you will obtain CPU features list when application starts, and use it everywhere when you instantiate contexts.

use it everywhere when you instantiate contexts.

This is very impractical - ie generally, there exists no reason why you would want the slower implementation outside of benchmarking and testing, so it should be the default. This is how I would expect things to work, ie openssl, blst, hashtree etc all use autodection of this kind by default.

The most simple example is the nimcrypto itself that fails to do this, when calling sha256.digest(myvalue) - it would be silly not to use the sha2ext version in this case since most computers have that by now.

Its not impractical when you start tuning your application to use different types of algorithms for different purposes, for example:
You have CPU which does support only AVX, AVX2 and does not support SHAEXT. If your application is mostly hashes 64 byte blocks it is better to use AVX implementation, but if you supposed to use it to hash big blobs of data - you should start using AVX2.

start tuning

I'm not talking about removing the ability to choose an algo - I'm highlighting that defaulting to "use the slowest algo" is impractical (the vast majority of users will not care about avx or avx2 - they just want a decently hash fast and certainly do not want the slowest one available)

If you look at my sample code from above, what I'm proposing is that you have a function pointer that you replace with a cpu-specific implementation on first call - the first time it's called, it calls cpuid and picks the best general-purpose implementation.

It looks like we discussing 2 points in this comment.

How to optimize compress function invocation, using procedure variable in context.

What type of implementation should be used when no arguments passed in context initialization.

All my comments is exactly about point 2.

My comments are about both:

a proc var removes case and assert overhead

choosing a default value for it based on cpuid on first invokation is simple, easy, safe and desireable

It looks like you trying to propose me global variable or thread-local variable?

yes - for the default

arnetheduck · 2024-12-17T18:37:15Z

A few benchmarks for hashing the beacon state - this is certainly not a exhaustive benchmark because it only tests 64-byte values, but it's still indicative on that particular sample size:

11th Gen Intel(R) Core(TM) i7-11850H @ 2.50GHz

# best of 3 runs of each

# current nimcrypto
arnetheduck@praeceps:~/status/nimbus-eth2$ ncli/ncli --print-times hashTreeRoot deneb_state state.ssz 
683ed74f8fb7f3322e2b746796d22c1a03023e0aa82299b536f66598bc928407
All time are ms
     Average,       StdDev,          Min,          Max,      Samples,         Test
    1293.098,        0.000,     1293.098,     1293.098,            1, Load file
    6333.745,        0.000,     6333.745,     6333.745,            1, Compute

# new-sha2 with reference implementation - slightly slower
     Average,       StdDev,          Min,          Max,      Samples,         Test
    1328.006,        0.000,     1328.006,     1328.006,            1, Load file
    6780.952,        0.000,     6780.952,     6780.952,            1, Compute

# new-sha2 with cpuid with `shaext` implementation, cpu detection for every new context
     Average,       StdDev,          Min,          Max,      Samples,         Test
    1156.908,        0.000,     1156.908,     1156.908,            1, Load file
    4662.638,        0.000,     4662.638,     4662.638,            1, Compute

# new-sha2 with hardcoded `shaext` implementation,
     Average,       StdDev,          Min,          Max,      Samples,         Test
     714.325,        0.000,      714.325,      714.325,            1, Load file
    1512.727,        0.000,     1512.727,     1512.727,            1, Compute

# new-sha2 with hardcoded `avx2`
     Average,       StdDev,          Min,          Max,      Samples,         Test
    1250.886,        0.000,     1250.886,     1250.886,            1, Load file
    5794.621,        0.000,     5794.621,     5794.621,            1, Compute

# new-sha2 with hardcoded `avx` - oddly, this one is a bit faster than avx2
     Average,       StdDev,          Min,          Max,      Samples,         Test
    1225.362,        0.000,     1225.362,     1225.362,            1, Load file
    5662.962,        0.000,     5662.962,     5662.962,            1, Compute


# blst
     Average,       StdDev,          Min,          Max,      Samples,         Test
     747.602,        0.000,      747.602,      747.602,            1, Load file
    1581.679,        0.000,     1581.679,     1581.679,            1, Compute

cheatfate · 2024-12-18T02:26:45Z

AVX is faster than AVX2 because of data size... AVX2 implementation uses AVX implementation for 64 bytes data.

arnetheduck · 2024-12-18T07:28:12Z

nimcrypto/sha2/sha2.nim

+  ctx.module = ctx.getImplementation(Sha2Implementation.Auto, cpufeatures)
+  ctx.reset()
+
+proc clear*(ctx: var Sha2Context) {.noinit.} =


arnetheduck · 2024-12-18T07:32:48Z

nimcrypto/sha2/sha2.nim

+template compress(ctx: var Sha2Context, data: openArray[byte], blocks: int) =
+  when (ctx is sha224) or (ctx is sha256):
+    mixin sha256Compress
+    case ctx.module


a function pointer here should also result in better code

ctx.module could be a proc(...) {.nimcall.} that is set on init - then you don't need a case traversal for every compress call

arnetheduck · 2024-12-18T07:35:27Z

nimcrypto/sha2/sha2.nim

+      when compiles(SHA2_AVX_sha256Compress):
+        sha2_avx.sha256Compress(ctx.state, data, blocks)
+      else:
+        raiseAssert "AVX implementation is not available for " &


this cannot happen really since init is supposed to pick a working version - asserts like this add a lot of bloat to the generated c code which tend to slow things down by a tiny bit. that said, a pointer-based impl wouldn't have this case statement at all.

Its possible that some compilation settings will not bundle some of the implementations to application binary, but ctx.module is variable which is set at runtime.

arnetheduck · 2024-12-18T07:41:38Z

nimcrypto/pbkdf2.nim

+    glength: int
+    bytesWrite: int
+
+  when not((M is byte) or (M is char)):


why not pbkdf2*[M, N: char|byte]?

It looks like workaround for some old Nim compiler version nimcrypto supports.

in other places, the N: bchar style is used

arnetheduck · 2024-12-18T07:43:13Z

nimcrypto/sha2/sha2_neon.nim

+when defined(arm64):
+  import "."/sha2_common
+
+  {.passC: "-march=armv8-a+crypto".}


is localpassc enough here? this will override -march for all c files which might not be desired

arnetheduck · 2024-12-18T07:46:29Z

nimcrypto/sha2/sha2_ref.nim

+    W[14] = beLoad32(data, offset + 56); W[15] = beLoad32(data, offset + 60)
+
+    for i in 16 ..< 64:
+      W[i] = SIG1(W[i - 2]) + W[i - 7] + SIG0(W[i - 15]) + W[i - 16]


profiler says a lot of time is spent here - maybe a good target for unrolling? W should probably be aligned to 32/64 as well and maybe noinit?

arnetheduck · 2024-12-18T07:48:38Z

nimcrypto/sha2/sha2_sha.nim

+when defined(amd64):
+  import "."/sha2_common
+
+  {.passC:"-msha".}


same here, localpassc - passc will enable these extensions for all c files which is a problem because gcc will then make use of these features in "general" code which is not hidden behind cpuid-based detection and the whole program will sigill on older machines.

mratsim · 2024-12-19T20:18:25Z

Note that you can bench also vs Constantine which includes OpenSSL

git clone https://github.com/mratsim/constantine
cd constantine
CC=clang nimble bench_sha256

arnetheduck · 2024-12-19T20:26:59Z

includes OpenSSL

from what I remember, openssl == blst more or less

…gnment issue.

…es not detected.

Remove debugging echoes.

arnetheduck reviewed Dec 17, 2024

View reviewed changes

arnetheduck reviewed Dec 18, 2024

View reviewed changes

cheatfate added 17 commits December 21, 2024 05:34

Initial commit.

8201342

Add implementation and tests.

a49504f

Fix compilation issues and warnings.

3eb0b73

Make sha2 state and buffer arrays aligned.

c896e8e

Upgrade mingw64 compiler.

e13c8b7

Convert everything to templates to workaround mingw64-gcc 32bytes ali…

5862c0c

…gnment issue.

Do some NEON fixes.

902f56c

Add MacOS detection scheme.

aea1059

Get some MacOS debugging output, to understand why Neon extensions do…

9e8ca40

…es not detected.

Fix MacOS detection scheme.

f8fb6a8

Remove debugging echoes.

Get old compilers back.

a1cd351

Add optimized HMAC[sha2] primitives and tests.

4445ac8

Add optimized PBKDF2[HMAC[sha2]] primitives and tests.

6b5719c

Add optimized scrypt() primitives and tests.

cfaf16f

Add missing test name.

3fb7d4e

Fix MacOS scrypt test failure.

b56943a

Add isAvailable() procedure.

86fdf1a

Add more noinit and templates.

7db50db

cheatfate force-pushed the new-sha2 branch from 50e8a9b to 7db50db Compare December 21, 2024 03:42

cheatfate added 2 commits December 21, 2024 05:51

Use localPassc instead of passc.

e4c31ae

Unroll and align reference implementation.

e0186ef

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimized SHA2 implementation. #78

Optimized SHA2 implementation. #78

cheatfate commented May 8, 2024

arnetheduck Dec 17, 2024

cheatfate Dec 18, 2024 •

edited

Loading

arnetheduck Dec 18, 2024

cheatfate Dec 19, 2024 •

edited

Loading

arnetheduck Dec 19, 2024

mratsim Dec 19, 2024

mratsim Dec 19, 2024

arnetheduck Dec 19, 2024

arnetheduck Dec 17, 2024

cheatfate Dec 18, 2024

arnetheduck Dec 18, 2024

cheatfate Dec 18, 2024

arnetheduck Dec 19, 2024 •

edited

Loading

arnetheduck Dec 19, 2024

cheatfate Dec 20, 2024

arnetheduck Dec 20, 2024 •

edited

Loading

cheatfate Dec 21, 2024

arnetheduck Dec 21, 2024

arnetheduck commented Dec 17, 2024

cheatfate commented Dec 18, 2024

arnetheduck Dec 18, 2024

arnetheduck Dec 18, 2024

cheatfate Dec 19, 2024

arnetheduck Dec 19, 2024

arnetheduck Dec 18, 2024

cheatfate Dec 19, 2024

arnetheduck Dec 18, 2024

cheatfate Dec 18, 2024

arnetheduck Dec 19, 2024

arnetheduck Dec 18, 2024

arnetheduck Dec 18, 2024 •

edited

Loading

arnetheduck Dec 18, 2024 •

edited

Loading

mratsim commented Dec 19, 2024

arnetheduck commented Dec 19, 2024

Optimized SHA2 implementation. #78

Are you sure you want to change the base?

Optimized SHA2 implementation. #78

Conversation

cheatfate commented May 8, 2024

Choose a reason for hiding this comment

cheatfate Dec 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cheatfate Dec 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arnetheduck Dec 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arnetheduck Dec 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arnetheduck commented Dec 17, 2024

cheatfate commented Dec 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arnetheduck Dec 18, 2024 • edited Loading

Choose a reason for hiding this comment

arnetheduck Dec 18, 2024 • edited Loading

Choose a reason for hiding this comment

mratsim commented Dec 19, 2024

arnetheduck commented Dec 19, 2024

cheatfate Dec 18, 2024 •

edited

Loading

cheatfate Dec 19, 2024 •

edited

Loading

arnetheduck Dec 19, 2024 •

edited

Loading

arnetheduck Dec 20, 2024 •

edited

Loading

arnetheduck Dec 18, 2024 •

edited

Loading

arnetheduck Dec 18, 2024 •

edited

Loading