Run Llama 3 70b locally combining ram and vram like with other apps? #5965

311-code · 2024-04-30T13:54:56Z

Sorry I am pretty novice here in LLM space. I have noticed that some users are able to run the llama 3 70b model as a gguf locally with quantization by offloading some of the model to cpu, ram, and vram with other programs with a larger context somehow. I don't really see any information on how to do this for text-generation-webui (which I much prefer)

I have 24gb vram and 64gb ram. Can anyone explain what model from the bloke? to download (if there's a better version, uncensored, etc) Someone referred me to this one which allows for a larger context length for llama 3 https://huggingface.co/models?sort=modified&search=llama+gradient+exl2 and what settings to set or if this is currently possible to do this with text-generation-webui?

I am also unsure if this can be done with exl2 or if I should be using gguf.

Edit: Apparantly flash attention was added today for llama.cpp ggerganov/llama.cpp#5021 for larger contexts over 64k, not sure if this is relevant.

MTStrothers · 2024-04-30T18:24:57Z

So a few things... first off, I was asking about this earlier in the discussions but it takes a little while—a few weeks I guess—for the updates to llama.cpp to trickle into this program. That's because text-generation-webui doesn't use https://github.com/ggerganov/llama.cpp directly, it uses abetlen/llama-cpp-python/, which is, as I understand it, a port of llama.cpp into python. So once llama.cpp updates, then llama-cpp-python has to update, and THEN text-generation-webui has to update its compatibility to use the new version of llama-cpp-python. You can see in the requirements.txt file they just bumped this program to use llama-cpp-python 0.2.64, when the most recent release of llama-cpp-python is 0.2.68. I guess you could edit the requirements.txt of your local install but there's a good chance you'd break something idk.

As to your main question I'd recommend this version: https://huggingface.co/bartowski/Meta-Llama-3-70B-Instruct-GGUF

It wasn't quantized with the newest version of llama.cpp but still pretty recent. The guy who makes them says he will have an even newer version of llama 3 70b up in today-ish so keep an eye out for that. TheBloke is apparently retired btw.
There are a ton of different versions with decensoring, extended context etc, really depends of your use case I guess. But I'm kinda skeptical of these finetunes at this stage because with llama 3 only recently coming out I don't think many of them are really dialed in.

To my knowledge, the only way to properly use both your cpu and gpu together is to use gguf. That will be what you want to do. You've got enough memory to run the 6_K quant without too much trouble I think, that's a pretty good sweet spot imo for reducing memory use without losing accuracy. You will have to splice the two files together for 6_k, but that's pretty easy you can do it with command line, just look it up.

To get it to run in text-generation-webui just drop it into your models folder and then load it. It should automatically default to 8k context. The only thing you will have to play with is n-gpu-layers in the model tab. Try like 20 or something to start and keep an eye on resource monitor and the CLI of text-generation-webui. Every layer you add in n-gpu-layers adds to the VRAM usage on your GPU. Just got to find the sweet spot.

311-code added the enhancement New feature or request label Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run Llama 3 70b locally combining ram and vram like with other apps? #5965

Run Llama 3 70b locally combining ram and vram like with other apps? #5965

311-code commented Apr 30, 2024 •

edited

Loading

MTStrothers commented Apr 30, 2024

Run Llama 3 70b locally combining ram and vram like with other apps? #5965

Run Llama 3 70b locally combining ram and vram like with other apps? #5965

Comments

311-code commented Apr 30, 2024 • edited Loading

MTStrothers commented Apr 30, 2024

311-code commented Apr 30, 2024 •

edited

Loading