You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am running Llama 3.1 70B at 6.0BPW with ExLlamav2_HF loader, 64K context, no_flash_attn, and autosplit.
I still have at least 20GB of VRAM leftover after fully loading the model with the above parameters.
I can send some messages to the AI in the chat tab at first, but as soon as the context passes 6 - 7K, it gives me an OOM error despite still having more than enough VRAM.
Send a large message of around 7K - 8K tokens in length, well within the limits the system can handle, to test.
Out of memory
Expected behavior
On version 0.1.8, EXL2 models would work as expected. Since updating to 0.2.3 through Oobabooga, it gives me out of memory problems very often, even when under 4K context, despite having context set to 64K and successfully using it at 64K on ExLlamav2 0.1.8.
Logs
16:36:40-789264 INFO Loading "Meta-Llama-3.1-70B-Instruct-6.0bpw-h6-exl2"
16:36:42-092387 WARNING Failed to load flash-attention due to the following error:
Traceback (most recent call last):
File "/home/rsa/text-generation-webui/modules/exllamav2_hf.py", line 23, in
import flash_attn
ModuleNotFoundError: No module named 'flash_attn'
/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/transformers/generation/configuration_utils.py:600: UserWarning: do_sample is set to False. However, min_p is set to 0.0 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset min_p.
warnings.warn(
/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/exllamav2/model.py:575: UserWarning: expandable_segments not supported on this platform (Triggered internally at ../c10/hip/HIPAllocatorConfig.h:29.)
reserved_vram_tensors.append(torch.empty((b,), dtype = torch.int8, device = _torch_device(current_device)))
16:38:03-832021 INFO Loaded "Meta-Llama-3.1-70B-Instruct-6.0bpw-h6-exl2" in 83.04 seconds.
16:38:03-833268 INFO LOADER: "ExLlamav2_HF"
16:38:03-833786 INFO TRUNCATION LENGTH: 65536
16:38:03-834231 INFO INSTRUCTION TEMPLATE: "Custom (obtained from model metadata)"
Output generated in 8.11 seconds (5.92 tokens/s, 48 tokens, context 83, seed 614014744)
Traceback (most recent call last):
File "/home/rsa/text-generation-webui/modules/callbacks.py", line 61, in gentask
ret = self.mfunc(callback=_callback, *args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/rsa/text-generation-webui/modules/text_generation.py", line 398, in generate_with_callback
shared.model.generate(**kwargs)
File "/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/transformers/generation/utils.py", line 2215, in generate
result = self._sample(
^^^^^^^^^^^^^
File "/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/transformers/generation/utils.py", line 3206, in _sample
outputs = self(**model_inputs, return_dict=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/rsa/text-generation-webui/modules/exllamav2_hf.py", line 129, in call
self.ex_model.forward(seq_tensor[longest_prefix:-1].view(1, -1), ex_cache, preprocess_only=True, loras=self.loras)
File "/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/exllamav2/model.py", line 878, in forward
r = self.forward_chunk(
^^^^^^^^^^^^^^^^^^^
File "/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/exllamav2/model.py", line 984, in forward_chunk
x = module.forward(x, cache = cache, attn_params = attn_params, past_len = past_len, loras = loras, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/exllamav2/attn.py", line 1102, in forward
attn_output = attn_func(batch_size, q_len, q_states, k_states, v_states, attn_params, cfg)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/exllamav2/attn.py", line 856, in _attn_torch
attn_output = F.scaled_dot_product_attention(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/torch/nn/attention/bias.py", line 281, in torch_function
return cls._dispatch(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/torch/nn/attention/bias.py", line 258, in _dispatch
return scaled_dot_product_attention(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: HIP out of memory. Tried to allocate 512.00 MiB. GPU 0 has a total capacity of 31.98 GiB of which 82.00 MiB is free. Of the allocated memory 30.73 GiB is allocated by PyTorch, and 792.56 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Output generated in 0.68 seconds (0.00 tokens/s, 0 tokens, context 8527, seed 1232530018)
Additional context
System Info:
OS: Ubuntu 22.04
GPU: 3x Radeon Instinct MI100 (32GB VRAM each)
CPU: AMD Epyc 9334
ROCM 6.1.2
Text Gen Web UI V1.16
Acknowledgements
I have looked for similar issues before submitting this one.
I understand that the developers have lives and my issue will be answered when possible.
I understand the developers of this program are human, and I will ask my questions politely.
The text was updated successfully, but these errors were encountered:
OS
Linux
GPU Library
AMD ROCm
Python version
3.11
Pytorch version
2.4.0
Model
Meta-Llama-3.1-70B-Instruct-6.0bpw-h6-exl2
Describe the bug
I am running Llama 3.1 70B at 6.0BPW with ExLlamav2_HF loader, 64K context, no_flash_attn, and autosplit.
I still have at least 20GB of VRAM leftover after fully loading the model with the above parameters.
I can send some messages to the AI in the chat tab at first, but as soon as the context passes 6 - 7K, it gives me an OOM error despite still having more than enough VRAM.
Reproduction steps
Expected behavior
On version 0.1.8, EXL2 models would work as expected. Since updating to 0.2.3 through Oobabooga, it gives me out of memory problems very often, even when under 4K context, despite having context set to 64K and successfully using it at 64K on ExLlamav2 0.1.8.
Logs
16:36:40-789264 INFO Loading "Meta-Llama-3.1-70B-Instruct-6.0bpw-h6-exl2"
16:36:42-092387 WARNING Failed to load flash-attention due to the following error:
Traceback (most recent call last):
File "/home/rsa/text-generation-webui/modules/exllamav2_hf.py", line 23, in
import flash_attn
ModuleNotFoundError: No module named 'flash_attn'
/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/transformers/generation/configuration_utils.py:600: UserWarning:
do_sample
is set toFalse
. However,min_p
is set to0.0
-- this flag is only used in sample-based generation modes. You should setdo_sample=True
or unsetmin_p
.warnings.warn(
/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/exllamav2/model.py:575: UserWarning: expandable_segments not supported on this platform (Triggered internally at ../c10/hip/HIPAllocatorConfig.h:29.)
reserved_vram_tensors.append(torch.empty((b,), dtype = torch.int8, device = _torch_device(current_device)))
16:38:03-832021 INFO Loaded "Meta-Llama-3.1-70B-Instruct-6.0bpw-h6-exl2" in 83.04 seconds.
16:38:03-833268 INFO LOADER: "ExLlamav2_HF"
16:38:03-833786 INFO TRUNCATION LENGTH: 65536
16:38:03-834231 INFO INSTRUCTION TEMPLATE: "Custom (obtained from model metadata)"
Output generated in 8.11 seconds (5.92 tokens/s, 48 tokens, context 83, seed 614014744)
Traceback (most recent call last):
File "/home/rsa/text-generation-webui/modules/callbacks.py", line 61, in gentask
ret = self.mfunc(callback=_callback, *args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/rsa/text-generation-webui/modules/text_generation.py", line 398, in generate_with_callback
shared.model.generate(**kwargs)
File "/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/transformers/generation/utils.py", line 2215, in generate
result = self._sample(
^^^^^^^^^^^^^
File "/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/transformers/generation/utils.py", line 3206, in _sample
outputs = self(**model_inputs, return_dict=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/rsa/text-generation-webui/modules/exllamav2_hf.py", line 129, in call
self.ex_model.forward(seq_tensor[longest_prefix:-1].view(1, -1), ex_cache, preprocess_only=True, loras=self.loras)
File "/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/exllamav2/model.py", line 878, in forward
r = self.forward_chunk(
^^^^^^^^^^^^^^^^^^^
File "/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/exllamav2/model.py", line 984, in forward_chunk
x = module.forward(x, cache = cache, attn_params = attn_params, past_len = past_len, loras = loras, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/exllamav2/attn.py", line 1102, in forward
attn_output = attn_func(batch_size, q_len, q_states, k_states, v_states, attn_params, cfg)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/exllamav2/attn.py", line 856, in _attn_torch
attn_output = F.scaled_dot_product_attention(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/torch/nn/attention/bias.py", line 281, in torch_function
return cls._dispatch(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/torch/nn/attention/bias.py", line 258, in _dispatch
return scaled_dot_product_attention(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: HIP out of memory. Tried to allocate 512.00 MiB. GPU 0 has a total capacity of 31.98 GiB of which 82.00 MiB is free. Of the allocated memory 30.73 GiB is allocated by PyTorch, and 792.56 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Output generated in 0.68 seconds (0.00 tokens/s, 0 tokens, context 8527, seed 1232530018)
Additional context
System Info:
OS: Ubuntu 22.04
GPU: 3x Radeon Instinct MI100 (32GB VRAM each)
CPU: AMD Epyc 9334
ROCM 6.1.2
Text Gen Web UI V1.16
Acknowledgements
The text was updated successfully, but these errors were encountered: