Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimized SHA2 implementation. #78

Open
wants to merge 20 commits into
base: master
Choose a base branch
from
Open

Optimized SHA2 implementation. #78

wants to merge 20 commits into from

Conversation

cheatfate
Copy link
Owner

Should address #36

m128i* {.importc: "__m128i", x86type.} = object
data: array[2, uint64]

let
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let introduces a global variable which, when you access it, is a side effect - this is (one of) the reasons func can't be used. It should be safe to switch this to const as long as only unaligned loads are used.

sha2 needs to be func for lots of reasons, ie one should be able to assume that it has no side effects (in fact, our code doesn't compile with this branch)

Copy link
Owner Author

@cheatfate cheatfate Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const is not working here, because when Nim encounters <const>[x] in the code it substitutes it with value, but for AVX/AVX2 implementations i need constants to be stored in memory in specific order one by one. So AVX/AVX2 operations could load not single 64bit value but 4 or 8 values with one instruction.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in that case, access to it needs to go through the same {.noSideEffect.} trick so that the public interface remains side-effect-free.

This is really a bug in nim though, that there is no native mechanism to make this kind of code side-effect-free - because at the end of the day, there's actually no side effect in the generated code, not even with let - it's just a semantic gap in nim.

Copy link
Owner Author

@cheatfate cheatfate Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not sure about this {.noSideEffect.} thing.
init() updates context - side effect present
update() updates context - side effect present
finish() updates context - side effect present
reset() updates context - side effect present
What the reason to avoid side effect for this specific API functions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

side effect in this context means that it depends or updates something outside of the given parameters - for example a global - the "test" to determine whether something has a side effect or not is to reason about whether two calls with the same inputs can yield different outputs - in this case, they can't - the reason the compiler doesn't understand this is due to a limitation in the nim compiler - it can't guarantee that a global variable (K0 in this case) keeps the same value between one call to update and the next call.

Hash functions by definition are side-effect-free, or they wouldn't be hash functions - ie the output should depend only on the inputs and not anything else - "side-effect-free" for example allows the compiler to remove repeated computation of the same thing - for i in 0..9: echo digest("hello") can be rewritten by the compiler as let tmp = digest("hello"); for i in 0..9: echo tmp if it can prove (or is told) that digest is side-effect-free.

In libp2p for example, there's a type DigestFcn = proc(...): ... {.noSideEffect.} - it requires that all digest functions are side-effect-free so that when someone uses a DigestFcn, the compiler can optimise it like above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nim added (partially) support addressable constants iirc.

I made the request here nim-lang/RFCs#257 though I'm not sure what part is implemented and what not. cc @Araq

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm that we can take the address of a const since at least 1.6, see example with ssse3 impl of SHA256:

https://github.com/mratsim/constantine/blob/7cffd2f/constantine/hashes/sha256/sha256_generic.nim#L42-L51

const K256* = [
  0x428a2f98'u32, 0x71374491'u32, 0xb5c0fbcf'u32, 0xe9b5dba5'u32, 0x3956c25b'u32, 0x59f111f1'u32, 0x923f82a4'u32, 0xab1c5ed5'u32,
  0xd807aa98'u32, 0x12835b01'u32, 0x243185be'u32, 0x550c7dc3'u32, 0x72be5d74'u32, 0x80deb1fe'u32, 0x9bdc06a7'u32, 0xc19bf174'u32,
  0xe49b69c1'u32, 0xefbe4786'u32, 0x0fc19dc6'u32, 0x240ca1cc'u32, 0x2de92c6f'u32, 0x4a7484aa'u32, 0x5cb0a9dc'u32, 0x76f988da'u32,
  0x983e5152'u32, 0xa831c66d'u32, 0xb00327c8'u32, 0xbf597fc7'u32, 0xc6e00bf3'u32, 0xd5a79147'u32, 0x06ca6351'u32, 0x14292967'u32,
  0x27b70a85'u32, 0x2e1b2138'u32, 0x4d2c6dfc'u32, 0x53380d13'u32, 0x650a7354'u32, 0x766a0abb'u32, 0x81c2c92e'u32, 0x92722c85'u32,
  0xa2bfe8a1'u32, 0xa81a664b'u32, 0xc24b8b70'u32, 0xc76c51a3'u32, 0xd192e819'u32, 0xd6990624'u32, 0xf40e3585'u32, 0x106aa070'u32,
  0x19a4c116'u32, 0x1e376c08'u32, 0x2748774c'u32, 0x34b0bcb5'u32, 0x391c0cb3'u32, 0x4ed8aa4a'u32, 0x5b9cca4f'u32, 0x682e6ff3'u32,
  0x748f82ee'u32, 0x78a5636f'u32, 0x84c87814'u32, 0x8cc70208'u32, 0x90befffa'u32, 0xa4506ceb'u32, 0xbef9a3f7'u32, 0xc67178f2'u32
]

https://github.com/mratsim/constantine/blob/7cffd2f/constantine/hashes/sha256/sha256_x86_ssse3.nim#L52-L66

const VecWords = 16 div sizeof(Word) # sizeof(m128i) / sizeof(Word)

func initMessageSchedule(
       msnext: var array[VecNum, m128i],
       ms: var Sha256_MessageSchedule,
       message: ptr UncheckedArray[byte]) {.inline.} =
  ## Initial state, from data
  ## - Precompute steps for the future message schedule `msnext`
  ## - compute the current message schedule `ms`

  let mask = setr_u32x4(0x00010203, 0x04050607, 0x08090a0b, 0x0c0d0e0f)
  let pK256 = K256.unsafeAddr()

  staticFor i, 0, VecNum:
    msnext[i] = loadu_u128(message[i * sizeof(m128i)].addr)
    msnext[i] = shuf_u8x16(msnext[i], mask)
    storea_u128(ms.w[VecWords*i].addr, add_u32x4(msnext[i], loadu_u128(pK256[VecWords*i].addr)))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it still doesn't support aligning though


proc init*(ctx: var Sha2Context,
implementation = Sha2Implementation.Auto,
cpufeatures: set[CpuFeature] = {}) {.noinit.} =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with empty features by default, it becomes impractical to use the cpuid-based selection - users will end up with the slow ref implementation in most cases, including when using the convenience helpers found elsewhere in the library.

Also, this is not really a setting that is useful to have at the context level: we create a new context every time we hash something - doing the feature selection here is wasteful (hashing a beacon state creates millions of contexts).

The classic way to do this is something like:

let sha256update: proc(p: pointer, size: int) {.noSideEffect.} = selectBestImpl()

# alternatively:
var p = proc(p: pointer, size: int) =
  p = selectBestImplementation() 
  p(pointer, size)

template callUpdate(p, s) = 
  # There's no side effect here logically, just that the compiler doesn't understand it
  {.noSideEffect.}:
   sha256Update(p, s)

there are fancier options but they tend to be overkill.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its supposed that you will obtain CPU features list when application starts, and use it everywhere when you instantiate contexts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use it everywhere when you instantiate contexts.

This is very impractical - ie generally, there exists no reason why you would want the slower implementation outside of benchmarking and testing, so it should be the default. This is how I would expect things to work, ie openssl, blst, hashtree etc all use autodection of this kind by default.

The most simple example is the nimcrypto itself that fails to do this, when calling sha256.digest(myvalue) - it would be silly not to use the sha2ext version in this case since most computers have that by now.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its not impractical when you start tuning your application to use different types of algorithms for different purposes, for example:
You have CPU which does support only AVX, AVX2 and does not support SHAEXT. If your application is mostly hashes 64 byte blocks it is better to use AVX implementation, but if you supposed to use it to hash big blobs of data - you should start using AVX2.

Copy link
Contributor

@arnetheduck arnetheduck Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start tuning

I'm not talking about removing the ability to choose an algo - I'm highlighting that defaulting to "use the slowest algo" is impractical (the vast majority of users will not care about avx or avx2 - they just want a decently hash fast and certainly do not want the slowest one available)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look at my sample code from above, what I'm proposing is that you have a function pointer that you replace with a cpu-specific implementation on first call - the first time it's called, it calls cpuid and picks the best general-purpose implementation.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we discussing 2 points in this comment.

  1. How to optimize compress function invocation, using procedure variable in context.
  2. What type of implementation should be used when no arguments passed in context initialization.

All my comments is exactly about point 2.

Copy link
Contributor

@arnetheduck arnetheduck Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comments are about both:

  1. a proc var removes case and assert overhead
  2. choosing a default value for it based on cpuid on first invokation is simple, easy, safe and desireable

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you trying to propose me global variable or thread-local variable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes - for the default

@arnetheduck
Copy link
Contributor

A few benchmarks for hashing the beacon state - this is certainly not a exhaustive benchmark because it only tests 64-byte values, but it's still indicative on that particular sample size:

11th Gen Intel(R) Core(TM) i7-11850H @ 2.50GHz

# best of 3 runs of each

# current nimcrypto
arnetheduck@praeceps:~/status/nimbus-eth2$ ncli/ncli --print-times hashTreeRoot deneb_state state.ssz 
683ed74f8fb7f3322e2b746796d22c1a03023e0aa82299b536f66598bc928407
All time are ms
     Average,       StdDev,          Min,          Max,      Samples,         Test
    1293.098,        0.000,     1293.098,     1293.098,            1, Load file
    6333.745,        0.000,     6333.745,     6333.745,            1, Compute

# new-sha2 with reference implementation - slightly slower
     Average,       StdDev,          Min,          Max,      Samples,         Test
    1328.006,        0.000,     1328.006,     1328.006,            1, Load file
    6780.952,        0.000,     6780.952,     6780.952,            1, Compute

# new-sha2 with cpuid with `shaext` implementation, cpu detection for every new context
     Average,       StdDev,          Min,          Max,      Samples,         Test
    1156.908,        0.000,     1156.908,     1156.908,            1, Load file
    4662.638,        0.000,     4662.638,     4662.638,            1, Compute

# new-sha2 with hardcoded `shaext` implementation,
     Average,       StdDev,          Min,          Max,      Samples,         Test
     714.325,        0.000,      714.325,      714.325,            1, Load file
    1512.727,        0.000,     1512.727,     1512.727,            1, Compute

# new-sha2 with hardcoded `avx2`
     Average,       StdDev,          Min,          Max,      Samples,         Test
    1250.886,        0.000,     1250.886,     1250.886,            1, Load file
    5794.621,        0.000,     5794.621,     5794.621,            1, Compute

# new-sha2 with hardcoded `avx` - oddly, this one is a bit faster than avx2
     Average,       StdDev,          Min,          Max,      Samples,         Test
    1225.362,        0.000,     1225.362,     1225.362,            1, Load file
    5662.962,        0.000,     5662.962,     5662.962,            1, Compute


# blst
     Average,       StdDev,          Min,          Max,      Samples,         Test
     747.602,        0.000,      747.602,      747.602,            1, Load file
    1581.679,        0.000,     1581.679,     1581.679,            1, Compute

@cheatfate
Copy link
Owner Author

AVX is faster than AVX2 because of data size... AVX2 implementation uses AVX implementation for 64 bytes data.

ctx.module = ctx.getImplementation(Sha2Implementation.Auto, cpufeatures)
ctx.reset()

proc clear*(ctx: var Sha2Context) {.noinit.} =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noinit?

template compress(ctx: var Sha2Context, data: openArray[byte], blocks: int) =
when (ctx is sha224) or (ctx is sha256):
mixin sha256Compress
case ctx.module
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a function pointer here should also result in better code

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ctx.module could be a proc(...) {.nimcall.} that is set on init - then you don't need a case traversal for every compress call

when compiles(SHA2_AVX_sha256Compress):
sha2_avx.sha256Compress(ctx.state, data, blocks)
else:
raiseAssert "AVX implementation is not available for " &
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this cannot happen really since init is supposed to pick a working version - asserts like this add a lot of bloat to the generated c code which tend to slow things down by a tiny bit. that said, a pointer-based impl wouldn't have this case statement at all.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its possible that some compilation settings will not bundle some of the implementations to application binary, but ctx.module is variable which is set at runtime.

glength: int
bytesWrite: int

when not((M is byte) or (M is char)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not pbkdf2*[M, N: char|byte]?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like workaround for some old Nim compiler version nimcrypto supports.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in other places, the N: bchar style is used

when defined(arm64):
import "."/sha2_common

{.passC: "-march=armv8-a+crypto".}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is localpassc enough here? this will override -march for all c files which might not be desired

W[14] = beLoad32(data, offset + 56); W[15] = beLoad32(data, offset + 60)

for i in 16 ..< 64:
W[i] = SIG1(W[i - 2]) + W[i - 7] + SIG0(W[i - 15]) + W[i - 16]
Copy link
Contributor

@arnetheduck arnetheduck Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

profiler says a lot of time is spent here - maybe a good target for unrolling? W should probably be aligned to 32/64 as well and maybe noinit?

when defined(amd64):
import "."/sha2_common

{.passC:"-msha".}
Copy link
Contributor

@arnetheduck arnetheduck Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, localpassc - passc will enable these extensions for all c files which is a problem because gcc will then make use of these features in "general" code which is not hidden behind cpuid-based detection and the whole program will sigill on older machines.

@mratsim
Copy link
Contributor

mratsim commented Dec 19, 2024

Note that you can bench also vs Constantine which includes OpenSSL

git clone https://github.com/mratsim/constantine
cd constantine
CC=clang nimble bench_sha256

@arnetheduck
Copy link
Contributor

includes OpenSSL

from what I remember, openssl == blst more or less

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants