New OOM bug introduced in bitsandbytes 0.38.x? #324

KukumavMozolo · 2023-04-17T09:34:47Z

Hi there, apparently 0.38.1 or 0.38.0 introduced a bug that increases memory consumption by a lot when trying to save a model.
For details see this post in the alpaca-lora githubg.

The bug doesn't happen when using bitsandbytes==0.37.2

Note, while this script says otherwise i am pretty sure i have cuda 11.7 installed

` python -m bitsandbytes

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

CUDA SETUP: CUDA version lower than 11 are currently not supported for LLM.int8(). You will be only to use 8-bit optimizers and quantization routines!!
CUDA SETUP: CUDA runtime path found: ...lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 102
CUDA SETUP: Required library version not found: libbitsandbytes_cuda102.so. Maybe you need to compile it from source?
CUDA SETUP: Defaulting to libbitsandbytes_cpu.so...

================================================ERROR=====================================
CUDA SETUP: CUDA detection failed! Possible reasons:

CUDA driver not installed
CUDA not installed
You have multiple conflicting CUDA libraries
Required library not pre-compiled for this bitsandbytes release!
CUDA SETUP: If you compiled from source, try again with make CUDA_VERSION=DETECTED_CUDA_VERSION for example, make CUDA_VERSION=113.
CUDA SETUP: The CUDA version for the compile might depend on your conda install. Inspect CUDA version via conda list | grep cuda.
================================================================================

CUDA SETUP: Something unexpected happened. Please compile from source:
git clone git@github.com:TimDettmers/bitsandbytes.git
cd bitsandbytes
CUDA_VERSION=102
python setup.py install
CUDA SETUP: CUDA version lower than 11 are currently not supported for LLM.int8(). You will be only to use 8-bit optimizers and quantization routines!!
CUDA SETUP: CUDA runtime path found: .../lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 102
CUDA SETUP: Required library version not found: libbitsandbytes_cuda102.so. Maybe you need to compile it from source?
CUDA SETUP: Defaulting to libbitsandbytes_cpu.so...

================================================ERROR=====================================
CUDA SETUP: CUDA detection failed! Possible reasons:

CUDA driver not installed
CUDA not installed
You have multiple conflicting CUDA libraries
Required library not pre-compiled for this bitsandbytes release!
CUDA SETUP: If you compiled from source, try again with make CUDA_VERSION=DETECTED_CUDA_VERSION for example, make CUDA_VERSION=113.
CUDA SETUP: The CUDA version for the compile might depend on your conda install. Inspect CUDA version via conda list | grep cuda.
================================================================================

CUDA SETUP: Something unexpected happened. Please compile from source:
git clone git@github.com:TimDettmers/bitsandbytes.git
cd bitsandbytes
CUDA_VERSION=102
python setup.py install
CUDA SETUP: Setup Failed!
CUDA SETUP: CUDA version lower than 11 are currently not supported for LLM.int8(). You will be only to use 8-bit optimizers and quantization routines!!
CUDA SETUP: CUDA runtime path found:.../lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 102
CUDA SETUP: Required library version not found: libbitsandbytes_cuda102.so. Maybe you need to compile it from source?
CUDA SETUP: Defaulting to libbitsandbytes_cpu.so...

================================================ERROR=====================================
CUDA SETUP: CUDA detection failed! Possible reasons:

CUDA driver not installed
CUDA not installed
You have multiple conflicting CUDA libraries
Required library not pre-compiled for this bitsandbytes release!
CUDA SETUP: If you compiled from source, try again with make CUDA_VERSION=DETECTED_CUDA_VERSION for example, make CUDA_VERSION=113.
CUDA SETUP: The CUDA version for the compile might depend on your conda install. Inspect CUDA version via conda list | grep cuda.
================================================================================

CUDA SETUP: Something unexpected happened. Please compile from source:
git clone git@github.com:TimDettmers/bitsandbytes.git
cd bitsandbytes
CUDA_VERSION=102
python setup.py install
CUDA SETUP: Setup Failed!
CUDA SETUP: CUDA version lower than 11 are currently not supported for LLM.int8(). You will be only to use 8-bit optimizers and quantization routines!!
CUDA SETUP: CUDA runtime path found: .../lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 102
CUDA SETUP: Required library version not found: libbitsandbytes_cuda102.so. Maybe you need to compile it from source?
CUDA SETUP: Defaulting to libbitsandbytes_cpu.so...

================================================ERROR=====================================
CUDA SETUP: CUDA detection failed! Possible reasons:

CUDA driver not installed
CUDA not installed
You have multiple conflicting CUDA libraries
Required library not pre-compiled for this bitsandbytes release!
CUDA SETUP: If you compiled from source, try again with make CUDA_VERSION=DETECTED_CUDA_VERSION for example, make CUDA_VERSION=113.
CUDA SETUP: The CUDA version for the compile might depend on your conda install. Inspect CUDA version via conda list | grep cuda.
================================================================================

CUDA SETUP: Something unexpected happened. Please compile from source:
git clone git@github.com:TimDettmers/bitsandbytes.git
cd bitsandbytes
CUDA_VERSION=102
python setup.py install
CUDA SETUP: Setup Failed!
CUDA SETUP: Something unexpected happened. Please compile from source:
git clone git@github.com:TimDettmers/bitsandbytes.git
cd bitsandbytes
CUDA_VERSION=102
python setup.py install
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 187, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "/usr/lib/python3.10/runpy.py", line 146, in _get_module_details
return _get_module_details(pkg_main_name, error)
File "/usr/lib/python3.10/runpy.py", line 110, in _get_module_details
import(pkg_name)
File ".../lib/python3.10/site-packages/bitsandbytes/init.py", line 7, in
from .autograd._functions import (
File ".../lib/python3.10/site-packages/bitsandbytes/autograd/init.py", line 1, in
from ._functions import undo_layout, get_inverse_transform_indices
File ".../lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 9, in
import bitsandbytes.functional as F
File ".../lib/python3.10/site-packages/bitsandbytes/functional.py", line 17, in
from .cextension import COMPILED_WITH_CUDA, lib
File ".../lib/python3.10/site-packages/bitsandbytes/cextension.py", line 22, in
raise RuntimeError('''
RuntimeError:
CUDA Setup failed despite GPU being available. Inspect the CUDA SETUP outputs above to fix your environment!
If you cannot find any issues and suspect a bug, please open an issue with detals about your environment:
https://github.com/TimDettmers/bitsandbytes/issues

`

The text was updated successfully, but these errors were encountered:

mryab · 2023-04-20T22:55:43Z

Hi, thanks for reporting this! I have investigated this and proposed a fix in #330. Overall, it looks like the issue is caused by weight layout conversions when .state_dict is called: unfortunately, changing the layout inplace is not trivial. Before @TimDettmers reviews this, I'd be glad if you could try the fix and see if it resolves the issue

better629 · 2023-05-06T12:16:38Z

similar problem when finetune with alpaca-lora or other similar llama based repo.

Environment
cuda 11.7
bitsandbytes 0.38.1
peft 0.2.0
transformers 4.28.1

nvidia card: A100 80G

Log

finetune llama-30b with lora failed using FastChat and then try to use alpaca-lora, still meet the same problem.

# two cards
model = LlamaForCausalLM.from_pretrained(
        base_model,
        load_in_8bit=True,
        torch_dtype=torch.float16,
        device_map={"":0},
    )

after loading the model, it costs about ~34G. and then rises to ~53G after model = get_peft_model(model, config). But occurs OOM when try to save model in the save_steps or model.save_pretrained(output_dir) stage. Below is the error log:

│ /data/llm/alpaca-lora/finetune.py:280 in train                                                   │
│                                                                                                  │
│   277 │                                                                                          │
│   278 │   trainer.train(resume_from_checkpoint=resume_from_checkpoint)                           │
│   279 │                                                                                          │
│ ❱ 280 │   model.save_pretrained(output_dir)                                                      │
│   281 │                                                                                          │
│   282 │   print(                                                                                 │
│   283 │   │   "\n If there's a warning about missing keys above, please disregard :)"            │
│                                                                                                  │
│ /root/anaconda3/lib/python3.10/site-packages/peft/peft_model.py:102 in save_pretrained           │
│                                                                                                  │
│    99 │   │   os.makedirs(save_directory, exist_ok=True)                                         │
│   100 │   │                                                                                      │
│   101 │   │   # save only the trainable weights                                                  │
│ ❱ 102 │   │   output_state_dict = get_peft_model_state_dict(self, kwargs.get("state_dict", Non   │
│   103 │   │   torch.save(output_state_dict, os.path.join(save_directory, WEIGHTS_NAME))          │
│   104 │   │                                                                                      │
│   105 │   │   # save the config and change the inference mode to `True`                          │
│                                                                                                  │
│ /root/anaconda3/lib/python3.10/site-packages/peft/utils/save_and_load.py:31 in                   │
│ get_peft_model_state_dict                                                                        │
│                                                                                                  │
│   28 │   │   will be used.                                                                       │
│   29 │   """                                                                                     │
│   30 │   if state_dict is None:                                                                  │
│ ❱ 31 │   │   state_dict = model.state_dict()                                                     │
│   32 │   if model.peft_config.peft_type == PeftType.LORA:                                        │
│   33 │   │   # to_return = lora_state_dict(model, bias=model.peft_config.bias)                   │
│   34 │   │   # adapted from `https://github.com/microsoft/LoRA/blob/main/loralib/utils.py`       │
│                                                                                                  │
│ /data/llm/alpaca-lora/finetune.py:271 in <lambda>                                                │
│                                                                                                  │
│   268 │   old_state_dict = model.state_dict                                                      │
│   269 │   model.state_dict = (                                                                   │
│   270 │   │   lambda self, *_, **__: get_peft_model_state_dict(                                  │
│ ❱ 271 │   │   │   self, old_state_dict()                                                         │
│   272 │   │   )                                                                                  │
│   273 │   ).__get__(model, type(model))                                                          │
│   274                                                                                            │
│                                                                                                  │
│ /root/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py:1448 in state_dict       │
│                                                                                                  │
│   1445 │   │   self._save_to_state_dict(destination, prefix, keep_vars)                          │
│   1446 │   │   for name, module in self._modules.items():                                        │
│   1447 │   │   │   if module is not None:                                                        │
│ ❱ 1448 │   │   │   │   module.state_dict(destination=destination, prefix=prefix + name + '.', k  │
│   1449 │   │   for hook in self._state_dict_hooks.values():                                      │
│   1450 │   │   │   hook_result = hook(self, destination, prefix, local_metadata)                 │
│   1451 │   │   │   if hook_result is not None:                                                   │

......

│ /root/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py:1445 in state_dict       │
│                                                                                                  │
│   1442 │   │   if hasattr(destination, "_metadata"):                                             │
│   1443 │   │   │   destination._metadata[prefix[:-1]] = local_metadata                           │
│   1444 │   │                                                                                     │
│ ❱ 1445 │   │   self._save_to_state_dict(destination, prefix, keep_vars)                          │
│   1446 │   │   for name, module in self._modules.items():                                        │
│   1447 │   │   │   if module is not None:                                                        │
│   1448 │   │   │   │   module.state_dict(destination=destination, prefix=prefix + name + '.', k  │
│                                                                                                  │
│ /root/anaconda3/lib/python3.10/site-packages/bitsandbytes/nn/modules.py:268 in                   │
│ _save_to_state_dict                                                                              │
│                                                                                                  │
│   265 │   │                                                                                      │
│   266 │   │   try:                                                                               │
│   267 │   │   │   if reorder_layout:                                                             │
│ ❱ 268 │   │   │   │   self.weight.data = undo_layout(self.state.CxB, self.state.tile_indices)    │
│   269 │   │   │                                                                                  │
│   270 │   │   │   super()._save_to_state_dict(destination, prefix, keep_vars)                    │
│   271                                                                                            │
│                                                                                                  │
│ /root/anaconda3/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py:96 in           │
│ undo_layout                                                                                      │
│                                                                                                  │
│    93 │   (rows, cols), (tile_rows, tile_cols) = permuted_tensor.shape, tile_indices.shape       │
│    94 │   assert rows % tile_rows == cols % tile_cols == 0, "tensor must contain a whole numbe   │
│    95 │   tensor = permuted_tensor.reshape(-1, tile_indices.numel()).t()                         │
│ ❱  96 │   outputs = torch.empty_like(tensor)  # note: not using .index_copy because it was slo   │
│    97 │   outputs[tile_indices.flatten()] = tensor                                               │
│    98 │   outputs = outputs.reshape(tile_rows, tile_cols, cols // tile_cols, rows // tile_rows   │
│    99 │   outputs = outputs.permute(3, 0, 2, 1)  # (rows // tile_rows, tile_rows), (cols // ti   │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
OutOfMemoryError: CUDA out of memory. Tried to allocate 44.00 MiB (GPU 0; 79.21 GiB total capacity; 74.26 GiB already allocated; 39.56 MiB free; 77.52 GiB 
reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory 
Management and PYTORCH_CUDA_ALLOC_CONF

Solution
downgrade the bitsandbytes version into 0.37.2 or 0.37.0, and the training and saving works well. The gpu mem keeps at 58G when training.

cfhammill · 2023-05-26T18:55:02Z

does anyone know if this issue is fixed in 0.39?

Qubitium · 2023-05-27T15:25:34Z

@cfhammill This is not fixed in 0.39.0. Note this oom bug only happens on 8bit. 4bit is fine. I am trying to figure out what is causing this.

psinger · 2023-06-02T06:13:23Z

Anyone found a workaround to this? Really frustrating bug making it basically impossible to save the state_dict of large 8bit models.

Downgrading does not seem to be a good solution.

cc @TimDettmers

mryab · 2023-06-09T20:15:33Z

Hi everybody, and sorry for the delay with this issue! I took a closer look at the underlying problem, and the issue seems to be here: https://github.com/TimDettmers/bitsandbytes/blob/main/bitsandbytes/nn/modules.py#L335-L338.

To the best of my understanding, the problem is that the current logic duplicates the memory usage of each Linear8bitLt.weight when we call state_dict. For efficiency purposes, bitsandbytes rearranges the weight matrix to the GPU-dependent format: my original implementation of the 8-bit serialization was designed to save weights a standard row-major format instead. As a results, the state dict tensors no longer share the underlying storage with model weights: thus, when we try to get it from the model, we need to allocate new tensors for each weight matrix. That leads to a significant increase in the memory consumption as we build the state dict.

I made an attempt to fix this problem in #503: in my setup, the OOM issue seems to disappear, but the solution involves a memory overhead when you load the checkpoint (which should be not more than ~100-200 MB of GPU RAM for most setups)

@KukumavMozolo @better629 @cfhammill @Qubitium @psinger if you have the time, I'd be very happy if you could try a fix from the PR above and see if it resolves the issue in your case. Also, if the GPU memory overhead at checkpoint loading time is not acceptable, we can try to think of another solution

TimDettmers · 2023-06-20T18:18:46Z

Thank you, this has been addressed in 2d321a7.

The downgrade to 0.37 was to address bitsandbytes-foundation/bitsandbytes#324, but this has been addressed in 2d321a7524cd5b which landed in 0.39.1.

KukumavMozolo changed the title ~~OOM when saving model with bitsandbytes 0.38.1~~ New OOM bug introduced in bitsandbytes 0.38.x? Apr 17, 2023

SerCeMan mentioned this issue Apr 17, 2023

Cuda OOM when fine-tuning 13B tloen/alpaca-lora#344

Open

pyamin1878 mentioned this issue Apr 18, 2023

Report hardware spec and parameters that training works tloen/alpaca-lora#47

Open

SerCeMan mentioned this issue Apr 18, 2023

pin bitsandbytes version to avoid OOM issues tloen/alpaca-lora#361

Closed

USBhost mentioned this issue Apr 18, 2023

Lora trainer improvements part 3 oobabooga/text-generation-webui#1098

Merged

mryab mentioned this issue Apr 20, 2023

Make Linear8bitLt serialization more memory-efficient #330

Closed

hiyouga mentioned this issue Apr 21, 2023

OOM Error when saving models trained in INT8 mode hiyouga/ChatGLM-Efficient-Tuning#13

Closed

This was referenced Apr 24, 2023

Getting OOM lxe/simple-llm-finetuner#46

Open

"Save every n steps" in training cause an CUDA out of memory oobabooga/text-generation-webui#1490

Closed

pseudotensor mentioned this issue Apr 28, 2023

8-bit checkpoint saving uses same memory as 16-bit, making 8-bit useless h2oai/h2ogpt#95

Open

mcmonkey4eva mentioned this issue May 3, 2023

Lora Training fails to save checkpoint oobabooga/text-generation-webui#1669

Closed

1 task

rtk-jeremy-richards mentioned this issue May 8, 2023

Finetune.py OOM when saving checkpoint if trained on 24GB 3090 bigcode-project/starcoder#15

Open

NanoCode012 mentioned this issue May 8, 2023

Fix BNB OOM by pinning version axolotl-ai-cloud/axolotl#22

Merged

Nan-Do mentioned this issue May 21, 2023

Change requirements for bitsandbytes to 0.37.2 oobabooga/text-generation-webui#2248

Closed

1 task

AlionSSS mentioned this issue May 30, 2023

bitsandbytes doesn't work #452

Closed

TimDettmers closed this as completed Jun 20, 2023

alexeiga mentioned this issue Jul 5, 2023

falcon-40b out of memory Lightning-AI/litgpt#165

Closed

christopher-w-murphy mentioned this issue Jul 6, 2023

RuntimeErrors: No executable batch size found, reached zero. & CUDA out of memory #554

Closed

airaria mentioned this issue Sep 19, 2023

在第200step的时候保存模型出现显存溢出，困扰好几天了，求助！ ymcui/Chinese-LLaMA-Alpaca-2#294

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New OOM bug introduced in bitsandbytes 0.38.x? #324

New OOM bug introduced in bitsandbytes 0.38.x? #324

KukumavMozolo commented Apr 17, 2023 •

edited

Loading

mryab commented Apr 20, 2023

better629 commented May 6, 2023 •

edited

Loading

cfhammill commented May 26, 2023

Qubitium commented May 27, 2023

psinger commented Jun 2, 2023

mryab commented Jun 9, 2023

TimDettmers commented Jun 20, 2023

New OOM bug introduced in bitsandbytes 0.38.x? #324

New OOM bug introduced in bitsandbytes 0.38.x? #324

Comments

KukumavMozolo commented Apr 17, 2023 • edited Loading

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

mryab commented Apr 20, 2023

better629 commented May 6, 2023 • edited Loading

cfhammill commented May 26, 2023

Qubitium commented May 27, 2023

psinger commented Jun 2, 2023

mryab commented Jun 9, 2023

TimDettmers commented Jun 20, 2023

KukumavMozolo commented Apr 17, 2023 •

edited

Loading

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

better629 commented May 6, 2023 •

edited

Loading