Stella Nera - Halutmatmul

Algorithmic CI

ML CI

Hardware CI

Paper

Stella Nera: Achieving 161 TOp/s/W with Multiplier-free DNN Acceleration based on Approximate Matrix Multiplication

Abstract

The recent Maddness method approximates Matrix Multiplication (MatMul) without the need for multiplication by using a hash-based version of product quantization (PQ). The hash function is a decision tree, allowing for efficient hardware implementation, as multiply-accumulate operations are replaced by decision tree passes and LUT lookups. Stella Nera is the first Maddness accelerator achieving 15x higher area efficiency (GMAC/s/mm^2) and 25x higher energy efficiency (TMAC/s/W) than direct MatMul accelerators in the same technology. In a commercial 14 nm technology and scaled to 3 nm, we achieve an energy efficiency of 161 TOp/s/W@0.55V with a Top-1 accuracy on CIFAR-10 of over 92.5% using ResNet9.

Algorithmic - Maddness

Differentiable Maddness

ResNet-9 LUTs, Thresholds, Dims

Download 92%+ Model

Halutmatmul example

example.py

import numpy as np
from halutmatmul.halutmatmul import HalutMatmul

A = np.random.random((10000, 512))
A_train = A[:8000]
A_test = A[8000:]
B = np.random.random((512, 10))
C = np.matmul(A_test, B)

hm = HalutMatmul(C=32, K=16)
hm.learn_offline(A_train, B)
C_halut = hm.matmul_online(A_test)

mse = np.square(C_halut - C).mean()
print(mse)

Hardware - OpenROAD flow results from CI

All NanGate45 results are NOT OPTIMIZED! The results are only for reference and to show the flow works.

All Designs	NanGate45
All Report	All
History	History

Full design (halutmatmul)

Run locally with:

git submodule update --init --recursive
cd hardware
ACC_TYPE=INT DATA_WIDTH=8 NUM_M=8 NUM_DECODER_UNITS=4 NUM_C=16 make halut-open-synth-and-pnr-halut_matmul

Full Design

halut_matmul	NanGate45
Area [μm^2]	128816
Freq [Mhz]	166.7
GE	161.423 kGE
Std Cell [#]	65496
Voltage [V]	1.1
Util [%]	50.4
TNS	0
Clock Net
Routing
GDS	GDS Download

Encoder

halut_encoder_4	NanGate45
Area [μm^2]	46782
Freq [Mhz]	166.7
GE	58.624 kGE
Std Cell [#]	23130
Voltage [V]	1.1
Util [%]	48.7
TNS	0
Clock Net
Routing
GDS	GDS Download

Decoder

halut_decoder	NanGate45
Area [μm^2]	24667.5
Freq [Mhz]	166.7
GE	30.911 kGE
Std Cell [#]	12256
Voltage [V]	1.1
Util [%]	52.1
TNS	0
Clock Net
Routing
GDS	GDS Download

Install

# install conda environment & activate
# mamba is recommended for faster install
conda env create -f environment_gpu.yml
conda activate halutmatmul

# IIS prefixed env
conda env create -f environment_gpu.yml --prefix /scratch/janniss/conda/halutmatmul_gpu

References

arXiv Maddness paper
Based on MADDness/Bolt.

Hackernews mention (comments only) and discussion

HN: Bolt: Faster matrix and vector operations that run on compressed data

Name		Name	Last commit message	Last commit date
Latest commit History 775 Commits
.github		.github
.vscode		.vscode
docs		docs
hardware		hardware
src/python		src/python
.editorconfig		.editorconfig
.gitignore		.gitignore
.gitlint		.gitlint
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
Dockerfile		Dockerfile
LICENSE.txt		LICENSE.txt
README.md		README.md
environment_cpu.yml		environment_cpu.yml
environment_gpu.yml		environment_gpu.yml
halut		halut
halut.env		halut.env
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stella Nera - Halutmatmul

Algorithmic CI

ML CI

Hardware CI

Paper

Abstract

Algorithmic - Maddness

Differentiable Maddness

ResNet-9 LUTs, Thresholds, Dims

Halutmatmul example

Hardware - OpenROAD flow results from CI

Full design (halutmatmul)

Full Design

Encoder

Decoder

Install

References

Hackernews mention (comments only) and discussion

About

Releases

Packages

Languages

License

deepware-ai/halutmatmul

Folders and files

Latest commit

History

Repository files navigation

Stella Nera - Halutmatmul

Algorithmic CI

ML CI

Hardware CI

Paper

Abstract

Algorithmic - Maddness

Differentiable Maddness

ResNet-9 LUTs, Thresholds, Dims

Halutmatmul example

Hardware - OpenROAD flow results from CI

Full design (halutmatmul)

Full Design

Encoder

Decoder

Install

References

Hackernews mention (comments only) and discussion

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages