Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting OOM #46

Open
alior101 opened this issue Apr 12, 2023 · 2 comments
Open

Getting OOM #46

alior101 opened this issue Apr 12, 2023 · 2 comments

Comments

@alior101
Copy link

Training on T4:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 44.00 MiB (GPU 0; 14.56 GiB total capacity; 13.25 GiB already allocated; 10.44 MiB free; 13.83 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I suspect a change of verisons in peft or transformers ... Does it make sense ?

@MillionthOdin16
Copy link

Same. This didn't use to happen.

{'loss': 1.1006, 'learning_rate': 2.748091603053435e-05, 'epoch': 0.92}
{'train_runtime': 350.2814, 'train_samples_per_second': 0.374, 'train_steps_per_second': 0.374, 'train_loss': 1.0609159615203625, 'epoch': 1.0}
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/gradio/routes.py", line 395, in run_predict
    output = await app.get_blocks().process_api(
  File "/usr/local/lib/python3.9/dist-packages/gradio/blocks.py", line 1193, in process_api
    result = await self.call_function(
  File "/usr/local/lib/python3.9/dist-packages/gradio/blocks.py", line 916, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/usr/local/lib/python3.9/dist-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.9/dist-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.9/dist-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/usr/local/lib/python3.9/dist-packages/gradio/helpers.py", line 588, in tracked_fn
    response = fn(*args)
  File "/content/simple-llama-finetuner/main.py", line 253, in tokenize_and_train
    model.save_pretrained(output_dir)
  File "/usr/local/lib/python3.9/dist-packages/peft/peft_model.py", line 116, in save_pretrained
    output_state_dict = get_peft_model_state_dict(
  File "/usr/local/lib/python3.9/dist-packages/peft/utils/save_and_load.py", line 32, in get_peft_model_state_dict
    state_dict = model.state_dict()
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1818, in state_dict
    module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1818, in state_dict
    module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1818, in state_dict
    module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
  [Previous line repeated 4 more times]
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1815, in state_dict
    self._save_to_state_dict(destination, prefix, keep_vars)
  File "/usr/local/lib/python3.9/dist-packages/bitsandbytes/nn/modules.py", line 268, in _save_to_state_dict
    self.weight.data = undo_layout(self.state.CxB, self.state.tile_indices)
  File "/usr/local/lib/python3.9/dist-packages/bitsandbytes/autograd/_functions.py", line 96, in undo_layout
    outputs = torch.empty_like(tensor)  # note: not using .index_copy because it was slower on cuda
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 68.00 MiB (GPU 0; 39.56 GiB total capacity; 35.96 GiB already allocated; 4.56 MiB free; 37.96 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Keyboard interruption in main thread... closing server.

@ark1st
Copy link

ark1st commented Apr 24, 2023

My case was the bitsandbytes error.

Referring to the issue below, using the bitsandbytes==0.37.2 version, the problem does not occur.

bitsandbytes-foundation/bitsandbytes#324

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants