The bitsandbytes is a lightweight wrapper around CUDA custom functions, in particular 8-bit optimizers, matrix multiplication (LLM.int8()), and quantization functions. This fork is the ROCm adaptation of bitsandbytes 0.39.1. The repo is inspired by agrocylo/bitsandbytes-rocm, which is a ROCm version of bitsandbytes 0.37. While this fork incorporating the majority of features from bitsandbytes 0.39.1, including the crucial 4 bit quantization feature, certain features such as hipblaslt and hip_bfloat16 have been disabled. Enabling these features is listed as a task for the future.
Resources:
-
8-bit Optimizer Paper -- Video -- Docs
-
LLM.int8() Paper -- LLM.int8() Software Blog Post -- LLM.int8() Emergent Features Blog Post
Requirements Python >=3.8. Linux distribution (Ubuntu, MacOS, etc.) + ROCm >= 5.4.2 or CUDA > 10.0
Installation:
You need to compile from source.
Compilation quickstart:
git clone https://github.com/Lzy17/bitsandbytes-rocm
cd bitsandbytes-rocm
make hip
python setup.py install
#to test if you have successfully installed
python -m bitsandbytes
#To be benchmarks accuray benchmark from https://github.com/TimDettmers/bitsandbytes/issues/565
cd benchmarking/accuracy
python bnb_accuracy.py
#Accurate results should looks like
#tensor(526.7872, device='cuda:0')
#tensor(551.2297, device='cuda:0')
#tensor(574.9075, device='cuda:0')
#tensor(3435.1819, device='cuda:0')
#tensor(3480.1541, device='cuda:0')
#
Using Int8 inference with HuggingFace Transformers
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
'decapoda-research/llama-7b-hf,
device_map='auto',
load_in_8bit=True,
max_memory=f'{int(torch.cuda.mem_get_info()[0]/1024**3)-2}GB')
A more detailed example, can be found in examples/int8_inference_huggingface.py.
Using 8-bit optimizer:
- Comment out optimizer:
#torch.optim.Adam(....)
- Add 8-bit optimizer of your choice
bnb.optim.Adam8bit(....)
(arguments stay the same) - Replace embedding layer if necessary:
torch.nn.Embedding(..) -> bnb.nn.Embedding(..)
Using 8-bit Inference:
- Comment out torch.nn.Linear:
#linear = torch.nn.Linear(...)
- Add bnb 8-bit linear light module:
linear = bnb.nn.Linear8bitLt(...)
(base arguments stay the same) - There are two modes:
- Mixed 8-bit training with 16-bit main weights. Pass the argument
has_fp16_weights=True
(default) - Int8 inference. Pass the argument
has_fp16_weights=False
- Mixed 8-bit training with 16-bit main weights. Pass the argument
- To use the full LLM.int8() method, use the
threshold=k
argument. We recommendk=6.0
.
# LLM.int8()
linear = bnb.nn.Linear8bitLt(dim1, dim2, bias=True, has_fp16_weights=False, threshold=6.0)
# inputs need to be fp16
out = linear(x.to(torch.float16))
- 8-bit Matrix multiplication with mixed precision decomposition
- LLM.int8() inference
- 8-bit Optimizers: Adam, AdamW, RMSProp, LARS, LAMB, Lion (saves 75% memory)
- Stable Embedding Layer: Improved stability through better initialization, and normalization
- 8-bit quantization: Quantile, Linear, and Dynamic quantization
- Fast quantile estimation: Up to 100x faster than other algorithms
For straight Int8 matrix multiplication with mixed precision decomposition you can use bnb.matmul(...)
. To enable mixed precision decomposition, use the threshold parameter:
bnb.matmul(..., threshold=6.0)
For instructions how to use LLM.int8() inference layers in your own code, see the TL;DR above or for extended instruction see this blog post.
With bitsandbytes 8-bit optimizers can be used by changing a single line of code in your codebase. For NLP models we recommend also to use the StableEmbedding layers (see below) which improves results and helps with stable 8-bit optimization. To get started with 8-bit optimizers, it is sufficient to replace your old optimizer with the 8-bit optimizer in the following way:
import bitsandbytes as bnb
# adam = torch.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995)) # comment out old optimizer
adam = bnb.optim.Adam8bit(model.parameters(), lr=0.001, betas=(0.9, 0.995)) # add bnb optimizer
adam = bnb.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995), optim_bits=8) # equivalent
torch.nn.Embedding(...) -> bnb.nn.StableEmbedding(...) # recommended for NLP models
Note that by default all parameter tensors with less than 4096 elements are kept at 32-bit even if you initialize those parameters with 8-bit optimizers. This is done since such small tensors do not save much memory and often contain highly variable parameters (biases) or parameters that require high precision (batch norm, layer norm). You can change this behavior like so:
# parameter tensors with less than 16384 values are optimized in 32-bit
# it is recommended to use multiplies of 4096
adam = bnb.optim.Adam8bit(model.parameters(), min_8bit_size=16384)
If you want to optimize some unstable parameters with 32-bit Adam and others with 8-bit Adam, you can use the GlobalOptimManager
. With this, we can also configure specific hyperparameters for particular layers, such as embedding layers. To do that, we need two things: (1) register the parameter while they are still on the CPU, (2) override the config with the new desired hyperparameters (anytime, anywhere). See our guide for more details
To use the Stable Embedding Layer, override the respective build_embedding(...)
function of your model. Make sure to also use the --no-scale-embedding
flag to disable scaling of the word embedding layer (nor replaced with layer norm). You can use the optimizers by replacing the optimizer in the respective file (adam.py
etc.).
For upcoming features and changes and full history see Patch Notes.
- RuntimeError: CUDA error: no kernel image is available for execution on the device. Solution
- _fatbinwrap.. Solution
The majority of bitsandbytes is licensed under MIT, however portions of the project are available under separate license terms: Pytorch is licensed under the BSD license.
We thank Fabio Cannizzo for his work on FastBinarySearch which we use for CPU quantization.
If you found this library and found LLM.int8() useful, please consider citing our work:
@article{dettmers2022llmint8,
title={LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale},
author={Dettmers, Tim and Lewis, Mike and Belkada, Younes and Zettlemoyer, Luke},
journal={arXiv preprint arXiv:2208.07339},
year={2022}
}
For 8-bit optimizers or quantization routines, please consider citing the following work:
@article{dettmers2022optimizers,
title={8-bit Optimizers via Block-wise Quantization},
author={Dettmers, Tim and Lewis, Mike and Shleifer, Sam and Zettlemoyer, Luke},
journal={9th International Conference on Learning Representations, ICLR},
year={2022}
}