Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] AMD - Out of memory errors despite having plenty of VRAM #662

Open
3 tasks done
RSAStudioGames opened this issue Oct 27, 2024 · 0 comments
Open
3 tasks done
Labels
bug Something isn't working

Comments

@RSAStudioGames
Copy link

RSAStudioGames commented Oct 27, 2024

OS

Linux

GPU Library

AMD ROCm

Python version

3.11

Pytorch version

2.4.0

Model

Meta-Llama-3.1-70B-Instruct-6.0bpw-h6-exl2

Describe the bug

I am running Llama 3.1 70B at 6.0BPW with ExLlamav2_HF loader, 64K context, no_flash_attn, and autosplit.
I still have at least 20GB of VRAM leftover after fully loading the model with the above parameters.

I can send some messages to the AI in the chat tab at first, but as soon as the context passes 6 - 7K, it gives me an OOM error despite still having more than enough VRAM.

Reproduction steps

  1. Load Llama 3.1 70B model. - ExLlamav2_HF, 64K Context, no_flash_attn, and autosplit enabled.
  2. Send a large message of around 7K - 8K tokens in length, well within the limits the system can handle, to test.
  3. Out of memory

Expected behavior

On version 0.1.8, EXL2 models would work as expected. Since updating to 0.2.3 through Oobabooga, it gives me out of memory problems very often, even when under 4K context, despite having context set to 64K and successfully using it at 64K on ExLlamav2 0.1.8.

Logs

16:36:40-789264 INFO Loading "Meta-Llama-3.1-70B-Instruct-6.0bpw-h6-exl2"
16:36:42-092387 WARNING Failed to load flash-attention due to the following error:

Traceback (most recent call last):
File "/home/rsa/text-generation-webui/modules/exllamav2_hf.py", line 23, in
import flash_attn
ModuleNotFoundError: No module named 'flash_attn'
/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/transformers/generation/configuration_utils.py:600: UserWarning: do_sample is set to False. However, min_p is set to 0.0 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset min_p.
warnings.warn(
/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/exllamav2/model.py:575: UserWarning: expandable_segments not supported on this platform (Triggered internally at ../c10/hip/HIPAllocatorConfig.h:29.)
reserved_vram_tensors.append(torch.empty((b,), dtype = torch.int8, device = _torch_device(current_device)))
16:38:03-832021 INFO Loaded "Meta-Llama-3.1-70B-Instruct-6.0bpw-h6-exl2" in 83.04 seconds.
16:38:03-833268 INFO LOADER: "ExLlamav2_HF"
16:38:03-833786 INFO TRUNCATION LENGTH: 65536
16:38:03-834231 INFO INSTRUCTION TEMPLATE: "Custom (obtained from model metadata)"
Output generated in 8.11 seconds (5.92 tokens/s, 48 tokens, context 83, seed 614014744)
Traceback (most recent call last):
File "/home/rsa/text-generation-webui/modules/callbacks.py", line 61, in gentask
ret = self.mfunc(callback=_callback, *args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/rsa/text-generation-webui/modules/text_generation.py", line 398, in generate_with_callback
shared.model.generate(**kwargs)
File "/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/transformers/generation/utils.py", line 2215, in generate
result = self._sample(
^^^^^^^^^^^^^
File "/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/transformers/generation/utils.py", line 3206, in _sample
outputs = self(**model_inputs, return_dict=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/rsa/text-generation-webui/modules/exllamav2_hf.py", line 129, in call
self.ex_model.forward(seq_tensor[longest_prefix:-1].view(1, -1), ex_cache, preprocess_only=True, loras=self.loras)
File "/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/exllamav2/model.py", line 878, in forward
r = self.forward_chunk(
^^^^^^^^^^^^^^^^^^^
File "/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/exllamav2/model.py", line 984, in forward_chunk
x = module.forward(x, cache = cache, attn_params = attn_params, past_len = past_len, loras = loras, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/exllamav2/attn.py", line 1102, in forward
attn_output = attn_func(batch_size, q_len, q_states, k_states, v_states, attn_params, cfg)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/exllamav2/attn.py", line 856, in _attn_torch
attn_output = F.scaled_dot_product_attention(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/torch/nn/attention/bias.py", line 281, in torch_function
return cls._dispatch(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/torch/nn/attention/bias.py", line 258, in _dispatch
return scaled_dot_product_attention(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: HIP out of memory. Tried to allocate 512.00 MiB. GPU 0 has a total capacity of 31.98 GiB of which 82.00 MiB is free. Of the allocated memory 30.73 GiB is allocated by PyTorch, and 792.56 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Output generated in 0.68 seconds (0.00 tokens/s, 0 tokens, context 8527, seed 1232530018)

Additional context

System Info:
OS: Ubuntu 22.04
GPU: 3x Radeon Instinct MI100 (32GB VRAM each)
CPU: AMD Epyc 9334
ROCM 6.1.2
Text Gen Web UI V1.16

Acknowledgements

  • I have looked for similar issues before submitting this one.
  • I understand that the developers have lives and my issue will be answered when possible.
  • I understand the developers of this program are human, and I will ask my questions politely.
@RSAStudioGames RSAStudioGames added the bug Something isn't working label Oct 27, 2024
@RSAStudioGames RSAStudioGames changed the title [BUG] AMD - Out of Memory Errors [BUG] AMD - Out of memory errors despite having plenty of VRAM Oct 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant