Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NVIDIA GPU and numpy #1979

Open
MarioRossiGithub opened this issue Jun 21, 2024 · 4 comments
Open

NVIDIA GPU and numpy #1979

MarioRossiGithub opened this issue Jun 21, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@MarioRossiGithub
Copy link

MarioRossiGithub commented Jun 21, 2024

Hi,
I'm trying to setup Private GPT on windows WSL.
I followed the instructions here and here but I'm not able to correctly run PGTP.
If I follow this instructions:
poetry install --extras "ui llms-llama-cpp embeddings-huggingface vector-stores-qdrant"
I'm able to run PGPT with numpy 1.26.4 but with BLAS=0 (CPU).

If I run this instead:
CMAKE_ARGS='-DLLAMA_CUBLAS=on' poetry run pip install --force-reinstall --no-cache-dir llama-cpp-python
I get BLAS=1 (GPU) but it automatically upgrades numpy to a 2.x version and PGPT doesn't work because it gives an error like "A module that was compiled using NumPy 1.x cannot be run in NumPy 2.0.0 as it may crash".
immagine

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.

packagex requires numpy x.y.z but you have numpy 2.0.0 which is incompatible.

Is there a way I can downgrade numpy AND use GPU (BLAS=1)?

@MarioRossiGithub
Copy link
Author

MarioRossiGithub commented Jun 21, 2024

After several hours of troubleshooting I finally managed to solve the issue.

Install

First of all you have to install llama-cpp forcing a specific version of numpy<2:

CMAKE_ARGS='-DLLAMA_CUBLAS=on' poetry run pip install --force-reinstall --no-cache-dir llama-cpp-python numpy==1.26.0

Ensure to:

  • Update your windows drivers to the latest (I'm not really sure if this helped solve the issue but I did it anyway).
  • Reboot your system.

Run Private GPT:

PGPT_PROFILES=local make run

If this solves your problem, good, you're done.


If you instead stumble upon another error about "CUDA error: out of memory" and "TOKENIZERS_PARALLELISM=(true | false)", ensure to set this variable to true:

TOKENIZERS_PARALLELISM=true

Then rerun Private GPT as always:

PGPT_PROFILES=local make run

This solved the issue for me.
Now Private GPT uses my NVIDIA GPU, is super fast and replies in 2-3 seconds.

I also suppose the first command should be updated on the official documentation.



On a side note:
I have this warning at the end of the run that I do not quite understand and that I cannot solve. If someone has a suggestion, thanks in advance.

py.warnings - /home/<user>/.cache/pypoetry/virtualenvs/private-gpt-ta_62_V8-py3.11/lib/python3.11/site-packages/llama_cpp/llama.py:1054: RuntimeWarning: Detected duplicate leading "<s>" in prompt, this will likely reduce response quality, consider removing it...
  warnings.warn(

@theodufort
Copy link

After several hours of troubleshooting I finally managed to solve the issue.

Install

First of all you have to install llama-cpp forcing a specific version of numpy<2:

CMAKE_ARGS='-DLLAMA_CUBLAS=on' poetry run pip install --force-reinstall --no-cache-dir llama-cpp-python numpy==1.26.0

Ensure to:

  • Update your windows drivers to the latest (I'm not really sure if this helped solve the issue but I did it anyway).
  • Reboot your system.

Run Private GPT:

PGPT_PROFILES=local make run

If this solves your problem, good, you're done.

If you instead stumble upon another error about "CUDA error: out of memory" and "TOKENIZERS_PARALLELISM=(true | false)", ensure to set this variable to true:

TOKENIZERS_PARALLELISM=true

Then rerun Private GPT as always:

PGPT_PROFILES=local make run

This solved the issue for me. Now Private GPT uses my NVIDIA GPU, is super fast and replies in 2-3 seconds.

I also suppose the first command should be updated on the official documentation.

On a side note: I have this warning at the end of the run that I do not quite understand and that I cannot solve. If someone has a suggestion, thanks in advance.

py.warnings - /home/<user>/.cache/pypoetry/virtualenvs/private-gpt-ta_62_V8-py3.11/lib/python3.11/site-packages/llama_cpp/llama.py:1054: RuntimeWarning: Detected duplicate leading "<s>" in prompt, this will likely reduce response quality, consider removing it...
  warnings.warn(

Hey thank you for that numpy part!
Like you, I am also having a GPU memory problem, it seems that 7 of 8GBs fill up as soon as i start the UI and then sometime when the file is too big I see 8GB in NVTOP and then I get a memory error message:
CUDA out of memory. Tried to allocate 22.00 MiB. GPU 0 has a total capacty of 7.92 GiB of which 4.62 MiB is free. Including non-PyTorch memory, this process has 7.91 GiB memory in use. Of the allocated memory 2.99 GiB is allocated by PyTorch, and 41.72 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

@jaluma jaluma added the bug Something isn't working label Jul 8, 2024
@jaluma
Copy link
Collaborator

jaluma commented Aug 5, 2024

I've just opened a PR to add a CUDA-compatible dockerfile with these problems fixes, can you try?
#2044
@MarioRossiGithub @theodufort

@MarioRossiGithub
Copy link
Author

Sorry, I cannot do it anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants