Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vectorized activation vm #1008

Draft
wants to merge 25 commits into
base: main
Choose a base branch
from
Draft

vectorized activation vm #1008

wants to merge 25 commits into from

Conversation

kali
Copy link
Collaborator

@kali kali commented Mar 21, 2023

This is a side-project I'm working on during live-coding sessions on twitch. Primary time for streams is Monday evenings (8:30 CET or so) on https://www.twitch.tv/kalicoding . The unedited videos are also https://www.youtube.com/playlist?list=PL2L_7bZXXxQuNpcxjg8gipegP32pmwVZF .

Landing page for twitch with some explanations is here. https://github.com/kali/coding .

  • Activation function inventory and Rust POC (parts 1 and 2)
  • Early benchs (part 3)
  • Bringing up assembly integration in tract (parts 4, 5, 6)
  • Rework the protocol so constant are in the flow of instructions (Rust, assembly) (part 7)
  • Jump table generation (part 8)
  • Support for "bloc-1" activation (ReLu to HardSwish) (part 8, part 9)
  • Test should be one-liner (Rust) (part 8)
  • bench (part 10)
  • integrate in tract-core
  • ARMv8.2 f16 support
  • other activations: Sigmoid (part 10), Tanh, Exp2f
  • implement on ARMv7
  • implement on intel (primary target AVX2+FMA)
  • consider compiling instead of interpreting

@kali kali marked this pull request as draft March 22, 2023 06:49
@kali
Copy link
Collaborator Author

kali commented Mar 22, 2023

  • We are interested in Element wise operations:

    • result is independent of tensor geometry (only accept scalar operations, no broadcasting of tensors)
    • output = f(x) for x in input
  • "simple ops + if/then/else"

    • Relu(x) = max(0, x)
    • Affine = alpha * x + beta
    • LeakyRelu = if x >= 0 { x } else { alpha * x }
    • ThresholdRelu = if x >= alpha { x } else { 0 }
    • HardSigmoid = min(max(alpha * x + beta), 0, 1)
    • Sofsign = x / (1 + abs(x))
    • HardSwish = x * max(0(min(1, alpha * x + beta))) (alpha = 1/6 and beta 1/2)

Proposed virtual machine def

  • 4 (?) big registers (mapped to one or several hardware vector registers)
    • on armv8, one big register could be 4 (or 5?) NEON registers (44 f32 values or 48 f16 values)
    • if 16 NEON registers are used, it leaves 16 for housekeeping (pre-fetch caching + operators)
  • focusing on the "simple" activation functions
    • BigRegs are a,b,c,d
    • min, max, abs, +, -, *, /, ite, ifpos
    • constants
  • calling convention: framework preloads x into a, expects y in a

Moves, load consts to any reg, unary, binary, ternary on a,b,c only

  • we want to limit the combination of op*operands

  • moves: 12 moves ops (from 4 registers to 3)

  • other ops have all fixed registers: unary ops a <- a, binary a <- a#b, ternary a <- a#b#c

  • Relu: load(b, 0) | max

  • Affine: load(b, alpha) | mul | load(v, beta) | add

  • LeakyRelu: b <- a | load(a, alpha) | mul | c <- a | ifpos

  • ThresholdRelu: c <- a | load(b, alpha) | sub | b <- c | load(c, 0) | ifpos

  • Softsign: c <- a | abs | load(b, 1) | add | recip | b <- c | mul

  • HardSwish: load(b, alpha) | mul | load(b, beta) | add | load(b, 1) | min | load(b, 0) | max

Better spec:

  • what if we add: a <- a#K (K constant)

  • Relu: max(0)

  • Affine: mul(alpha) | add(beta)

  • LeakyRelu: b <- a | mul(alpha) | c <- a | ifpos

  • ThresholdRelu: b <- a | sub(alpha) | load(c, 0) | ifpos

  • Softsign: b <- a | abs | add(1) | recip | mul

  • HardSwish: mul(alpha) | add(beta) | min(0) | max(1)

Much better :)

What about other activation functions ?

  • Some are implemented in tract with rational approximations

    • P(x)/Q(x) where P and Q are polys of x
    • Some can be expressed from e^x:
      • Tanh(x) - (1 - e(-2x)/(1 + e(-2x))
      • Sigmoid(x) - 1/(1 + e(-x))
    • But it is faster to use rational approx. Tanh and Sigmoid have handcrafted vectorized assembly impls.
    • Q: Can we extend our VM to support rational approx ?
    • Erf can only be computed from rational approx (as it is less used than tanh and sigmoid, it just have a rust implementation).
  • More activation functions computed from exp (and also log, tanh)

    • implemented as separate expansion to several tract operators
    • should we look for approximation ? rational or poly ?
    • can we / should we add e^x as a primitive ?
      • can we do as well / better than default impl ?
    • ScaledTanh = alpha * tanh(beta * x) (should be easily derived from tanh)
    • Softplus = log(1 + e^x)
    • Celu = max(0,x) + min(0,alpha * exp(x / alpha) - 1)
    • Elu = if x < 0 { alpha * (exp(x) - 1) } else { x }
    • Selu = if x < 0 { gamma * (alpha * e^x - alpha) } else { gamma * x }
    • Mish(x) = x * tanh(softplus(x)) = x * tanh(ln(1 * e^x))

@kali kali changed the title beginning activation vm vectorized activation vm Apr 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant