Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New OOM bug introduced in bitsandbytes 0.38.x? #324

Closed
KukumavMozolo opened this issue Apr 17, 2023 · 7 comments
Closed

New OOM bug introduced in bitsandbytes 0.38.x? #324

KukumavMozolo opened this issue Apr 17, 2023 · 7 comments

Comments

@KukumavMozolo
Copy link

KukumavMozolo commented Apr 17, 2023

Hi there, apparently 0.38.1 or 0.38.0 introduced a bug that increases memory consumption by a lot when trying to save a model.
For details see this post in the alpaca-lora githubg.

The bug doesn't happen when using bitsandbytes==0.37.2

Note, while this script says otherwise i am pretty sure i have cuda 11.7 installed

` python -m bitsandbytes

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

CUDA SETUP: CUDA version lower than 11 are currently not supported for LLM.int8(). You will be only to use 8-bit optimizers and quantization routines!!
CUDA SETUP: CUDA runtime path found: ...lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 102
CUDA SETUP: Required library version not found: libbitsandbytes_cuda102.so. Maybe you need to compile it from source?
CUDA SETUP: Defaulting to libbitsandbytes_cpu.so...

================================================ERROR=====================================
CUDA SETUP: CUDA detection failed! Possible reasons:

  1. CUDA driver not installed
  2. CUDA not installed
  3. You have multiple conflicting CUDA libraries
  4. Required library not pre-compiled for this bitsandbytes release!
    CUDA SETUP: If you compiled from source, try again with make CUDA_VERSION=DETECTED_CUDA_VERSION for example, make CUDA_VERSION=113.
    CUDA SETUP: The CUDA version for the compile might depend on your conda install. Inspect CUDA version via conda list | grep cuda.
    ================================================================================

CUDA SETUP: Something unexpected happened. Please compile from source:
git clone git@github.com:TimDettmers/bitsandbytes.git
cd bitsandbytes
CUDA_VERSION=102
python setup.py install
CUDA SETUP: CUDA version lower than 11 are currently not supported for LLM.int8(). You will be only to use 8-bit optimizers and quantization routines!!
CUDA SETUP: CUDA runtime path found: .../lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 102
CUDA SETUP: Required library version not found: libbitsandbytes_cuda102.so. Maybe you need to compile it from source?
CUDA SETUP: Defaulting to libbitsandbytes_cpu.so...

================================================ERROR=====================================
CUDA SETUP: CUDA detection failed! Possible reasons:

  1. CUDA driver not installed
  2. CUDA not installed
  3. You have multiple conflicting CUDA libraries
  4. Required library not pre-compiled for this bitsandbytes release!
    CUDA SETUP: If you compiled from source, try again with make CUDA_VERSION=DETECTED_CUDA_VERSION for example, make CUDA_VERSION=113.
    CUDA SETUP: The CUDA version for the compile might depend on your conda install. Inspect CUDA version via conda list | grep cuda.
    ================================================================================

CUDA SETUP: Something unexpected happened. Please compile from source:
git clone git@github.com:TimDettmers/bitsandbytes.git
cd bitsandbytes
CUDA_VERSION=102
python setup.py install
CUDA SETUP: Setup Failed!
CUDA SETUP: CUDA version lower than 11 are currently not supported for LLM.int8(). You will be only to use 8-bit optimizers and quantization routines!!
CUDA SETUP: CUDA runtime path found:.../lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 102
CUDA SETUP: Required library version not found: libbitsandbytes_cuda102.so. Maybe you need to compile it from source?
CUDA SETUP: Defaulting to libbitsandbytes_cpu.so...

================================================ERROR=====================================
CUDA SETUP: CUDA detection failed! Possible reasons:

  1. CUDA driver not installed
  2. CUDA not installed
  3. You have multiple conflicting CUDA libraries
  4. Required library not pre-compiled for this bitsandbytes release!
    CUDA SETUP: If you compiled from source, try again with make CUDA_VERSION=DETECTED_CUDA_VERSION for example, make CUDA_VERSION=113.
    CUDA SETUP: The CUDA version for the compile might depend on your conda install. Inspect CUDA version via conda list | grep cuda.
    ================================================================================

CUDA SETUP: Something unexpected happened. Please compile from source:
git clone git@github.com:TimDettmers/bitsandbytes.git
cd bitsandbytes
CUDA_VERSION=102
python setup.py install
CUDA SETUP: Setup Failed!
CUDA SETUP: CUDA version lower than 11 are currently not supported for LLM.int8(). You will be only to use 8-bit optimizers and quantization routines!!
CUDA SETUP: CUDA runtime path found: .../lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 102
CUDA SETUP: Required library version not found: libbitsandbytes_cuda102.so. Maybe you need to compile it from source?
CUDA SETUP: Defaulting to libbitsandbytes_cpu.so...

================================================ERROR=====================================
CUDA SETUP: CUDA detection failed! Possible reasons:

  1. CUDA driver not installed
  2. CUDA not installed
  3. You have multiple conflicting CUDA libraries
  4. Required library not pre-compiled for this bitsandbytes release!
    CUDA SETUP: If you compiled from source, try again with make CUDA_VERSION=DETECTED_CUDA_VERSION for example, make CUDA_VERSION=113.
    CUDA SETUP: The CUDA version for the compile might depend on your conda install. Inspect CUDA version via conda list | grep cuda.
    ================================================================================

CUDA SETUP: Something unexpected happened. Please compile from source:
git clone git@github.com:TimDettmers/bitsandbytes.git
cd bitsandbytes
CUDA_VERSION=102
python setup.py install
CUDA SETUP: Setup Failed!
CUDA SETUP: Something unexpected happened. Please compile from source:
git clone git@github.com:TimDettmers/bitsandbytes.git
cd bitsandbytes
CUDA_VERSION=102
python setup.py install
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 187, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "/usr/lib/python3.10/runpy.py", line 146, in _get_module_details
return _get_module_details(pkg_main_name, error)
File "/usr/lib/python3.10/runpy.py", line 110, in _get_module_details
import(pkg_name)
File ".../lib/python3.10/site-packages/bitsandbytes/init.py", line 7, in
from .autograd._functions import (
File ".../lib/python3.10/site-packages/bitsandbytes/autograd/init.py", line 1, in
from ._functions import undo_layout, get_inverse_transform_indices
File ".../lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 9, in
import bitsandbytes.functional as F
File ".../lib/python3.10/site-packages/bitsandbytes/functional.py", line 17, in
from .cextension import COMPILED_WITH_CUDA, lib
File ".../lib/python3.10/site-packages/bitsandbytes/cextension.py", line 22, in
raise RuntimeError('''
RuntimeError:
CUDA Setup failed despite GPU being available. Inspect the CUDA SETUP outputs above to fix your environment!
If you cannot find any issues and suspect a bug, please open an issue with detals about your environment:
https://github.com/TimDettmers/bitsandbytes/issues

`

@mryab
Copy link
Collaborator

mryab commented Apr 20, 2023

Hi, thanks for reporting this! I have investigated this and proposed a fix in #330. Overall, it looks like the issue is caused by weight layout conversions when .state_dict is called: unfortunately, changing the layout inplace is not trivial. Before @TimDettmers reviews this, I'd be glad if you could try the fix and see if it resolves the issue

@better629
Copy link

better629 commented May 6, 2023

similar problem when finetune with alpaca-lora or other similar llama based repo.

Environment
cuda 11.7
bitsandbytes 0.38.1
peft 0.2.0
transformers 4.28.1

nvidia card: A100 80G

Log

finetune llama-30b with lora failed using FastChat and then try to use alpaca-lora, still meet the same problem.

# two cards
model = LlamaForCausalLM.from_pretrained(
        base_model,
        load_in_8bit=True,
        torch_dtype=torch.float16,
        device_map={"":0},
    )

after loading the model, it costs about ~34G. and then rises to ~53G after model = get_peft_model(model, config). But occurs OOM when try to save model in the save_steps or model.save_pretrained(output_dir) stage. Below is the error log:

│ /data/llm/alpaca-lora/finetune.py:280 in train                                                   │
│                                                                                                  │
│   277 │                                                                                          │
│   278 │   trainer.train(resume_from_checkpoint=resume_from_checkpoint)                           │
│   279 │                                                                                          │
│ ❱ 280 │   model.save_pretrained(output_dir)                                                      │
│   281 │                                                                                          │
│   282 │   print(                                                                                 │
│   283 │   │   "\n If there's a warning about missing keys above, please disregard :)"            │
│                                                                                                  │
│ /root/anaconda3/lib/python3.10/site-packages/peft/peft_model.py:102 in save_pretrained           │
│                                                                                                  │
│    99 │   │   os.makedirs(save_directory, exist_ok=True)                                         │
│   100 │   │                                                                                      │
│   101 │   │   # save only the trainable weights                                                  │
│ ❱ 102 │   │   output_state_dict = get_peft_model_state_dict(self, kwargs.get("state_dict", Non   │
│   103 │   │   torch.save(output_state_dict, os.path.join(save_directory, WEIGHTS_NAME))          │
│   104 │   │                                                                                      │
│   105 │   │   # save the config and change the inference mode to `True`                          │
│                                                                                                  │
│ /root/anaconda3/lib/python3.10/site-packages/peft/utils/save_and_load.py:31 in                   │
│ get_peft_model_state_dict                                                                        │
│                                                                                                  │
│   28 │   │   will be used.                                                                       │
│   29 │   """                                                                                     │
│   30 │   if state_dict is None:                                                                  │
│ ❱ 31 │   │   state_dict = model.state_dict()                                                     │
│   32 │   if model.peft_config.peft_type == PeftType.LORA:                                        │
│   33 │   │   # to_return = lora_state_dict(model, bias=model.peft_config.bias)                   │
│   34 │   │   # adapted from `https://github.com/microsoft/LoRA/blob/main/loralib/utils.py`       │
│                                                                                                  │
│ /data/llm/alpaca-lora/finetune.py:271 in <lambda>                                                │
│                                                                                                  │
│   268 │   old_state_dict = model.state_dict                                                      │
│   269 │   model.state_dict = (                                                                   │
│   270 │   │   lambda self, *_, **__: get_peft_model_state_dict(                                  │
│ ❱ 271 │   │   │   self, old_state_dict()                                                         │
│   272 │   │   )                                                                                  │
│   273 │   ).__get__(model, type(model))                                                          │
│   274                                                                                            │
│                                                                                                  │
│ /root/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py:1448 in state_dict       │
│                                                                                                  │
│   1445 │   │   self._save_to_state_dict(destination, prefix, keep_vars)                          │
│   1446 │   │   for name, module in self._modules.items():                                        │
│   1447 │   │   │   if module is not None:                                                        │
│ ❱ 1448 │   │   │   │   module.state_dict(destination=destination, prefix=prefix + name + '.', k  │
│   1449 │   │   for hook in self._state_dict_hooks.values():                                      │
│   1450 │   │   │   hook_result = hook(self, destination, prefix, local_metadata)                 │
│   1451 │   │   │   if hook_result is not None:                                                   │

......

│ /root/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py:1445 in state_dict       │
│                                                                                                  │
│   1442 │   │   if hasattr(destination, "_metadata"):                                             │
│   1443 │   │   │   destination._metadata[prefix[:-1]] = local_metadata                           │
│   1444 │   │                                                                                     │
│ ❱ 1445 │   │   self._save_to_state_dict(destination, prefix, keep_vars)                          │
│   1446 │   │   for name, module in self._modules.items():                                        │
│   1447 │   │   │   if module is not None:                                                        │
│   1448 │   │   │   │   module.state_dict(destination=destination, prefix=prefix + name + '.', k  │
│                                                                                                  │
│ /root/anaconda3/lib/python3.10/site-packages/bitsandbytes/nn/modules.py:268 in                   │
│ _save_to_state_dict                                                                              │
│                                                                                                  │
│   265 │   │                                                                                      │
│   266 │   │   try:                                                                               │
│   267 │   │   │   if reorder_layout:                                                             │
│ ❱ 268 │   │   │   │   self.weight.data = undo_layout(self.state.CxB, self.state.tile_indices)    │
│   269 │   │   │                                                                                  │
│   270 │   │   │   super()._save_to_state_dict(destination, prefix, keep_vars)                    │
│   271                                                                                            │
│                                                                                                  │
│ /root/anaconda3/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py:96 in           │
│ undo_layout                                                                                      │
│                                                                                                  │
│    93 │   (rows, cols), (tile_rows, tile_cols) = permuted_tensor.shape, tile_indices.shape       │
│    94 │   assert rows % tile_rows == cols % tile_cols == 0, "tensor must contain a whole numbe   │
│    95 │   tensor = permuted_tensor.reshape(-1, tile_indices.numel()).t()                         │
│ ❱  96 │   outputs = torch.empty_like(tensor)  # note: not using .index_copy because it was slo   │
│    97 │   outputs[tile_indices.flatten()] = tensor                                               │
│    98 │   outputs = outputs.reshape(tile_rows, tile_cols, cols // tile_cols, rows // tile_rows   │
│    99 │   outputs = outputs.permute(3, 0, 2, 1)  # (rows // tile_rows, tile_rows), (cols // ti   │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
OutOfMemoryError: CUDA out of memory. Tried to allocate 44.00 MiB (GPU 0; 79.21 GiB total capacity; 74.26 GiB already allocated; 39.56 MiB free; 77.52 GiB 
reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory 
Management and PYTORCH_CUDA_ALLOC_CONF

Solution
downgrade the bitsandbytes version into 0.37.2 or 0.37.0, and the training and saving works well. The gpu mem keeps at 58G when training.

@cfhammill
Copy link

does anyone know if this issue is fixed in 0.39?

@Qubitium
Copy link

@cfhammill This is not fixed in 0.39.0. Note this oom bug only happens on 8bit. 4bit is fine. I am trying to figure out what is causing this.

@psinger
Copy link

psinger commented Jun 2, 2023

Anyone found a workaround to this? Really frustrating bug making it basically impossible to save the state_dict of large 8bit models.

Downgrading does not seem to be a good solution.

cc @TimDettmers

@mryab
Copy link
Collaborator

mryab commented Jun 9, 2023

Hi everybody, and sorry for the delay with this issue! I took a closer look at the underlying problem, and the issue seems to be here: https://github.com/TimDettmers/bitsandbytes/blob/main/bitsandbytes/nn/modules.py#L335-L338.

To the best of my understanding, the problem is that the current logic duplicates the memory usage of each Linear8bitLt.weight when we call state_dict. For efficiency purposes, bitsandbytes rearranges the weight matrix to the GPU-dependent format: my original implementation of the 8-bit serialization was designed to save weights a standard row-major format instead. As a results, the state dict tensors no longer share the underlying storage with model weights: thus, when we try to get it from the model, we need to allocate new tensors for each weight matrix. That leads to a significant increase in the memory consumption as we build the state dict.

I made an attempt to fix this problem in #503: in my setup, the OOM issue seems to disappear, but the solution involves a memory overhead when you load the checkpoint (which should be not more than ~100-200 MB of GPU RAM for most setups)

@KukumavMozolo @better629 @cfhammill @Qubitium @psinger if you have the time, I'd be very happy if you could try a fix from the PR above and see if it resolves the issue in your case. Also, if the GPU memory overhead at checkpoint loading time is not acceptable, we can try to think of another solution

@TimDettmers
Copy link
Collaborator

Thank you, this has been addressed in 2d321a7.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants