Description
Prerequisites
- I am running the latest code. Mention the version if possible as well.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
https://arxiv.org/pdf/2407.00088
Answer
T-MAC (Table-based Matrix-Activation Computation) is an innovative method designed to enable efficient deployment of low-bit Large Language Models (LLMs) on edge devices using CPUs. Here are the key aspects of T-MAC:
Purpose: T-MAC addresses the challenge of deploying weight-quantized LLMs on edge devices with limited resources, focusing on efficient mixed-precision matrix multiplication (mpGEMM) without relying on GPUs.
Core Technique: It uses a lookup table (LUT)-based approach to directly support mpGEMM without the need for weight dequantization. This method transforms traditional data-type-centric multiplication into bit-wise table lookup operations.
Performance Improvements:
Up to 4x increase in throughput compared to llama.cpp
70% reduction in energy consumption
For BitNet-b1.58-3B model:
30 tokens/s with a single core on M2-Ultra
71 tokens/s with eight cores on M2-Ultra
11 tokens/s on Raspberry Pi 5
Key Features:
Scales linearly with weight bit-width
Eliminates multiplications and reduces additions
Supports various activation types (fp8, fp16, int8) using fast table lookup and add instructions
Implementation Techniques:
LUT-centric data layout for efficient on-chip memory usage
Table quantization and mirror consolidation to reduce table size
Utilization of tbl/pshuf instructions for fast table lookup on CPUs
Evaluation:
Tested on various edge devices including Apple M2 Ultra, Jetson AGX Orin, Surface Book 3, and Raspberry Pi 5
Achieved up to 6.6x speedup (average 3.6x) compared to llama.cpp
End-to-end LLM inference speedup of 2.8x for Llama-2-7B-2bit model
Significance: T-MAC provides a practical solution for deploying LLMs on edge devices using widely available CPUs, making LLM inference speed on CPUs comparable or even superior to GPUs on the same devices in some cases.
Availability: The T-MAC system is open-sourced and available on GitHub for further development and implementation.
Motivation
Looks like a good addition to current Bitnet 1.58bit to speed it up even further