Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model files are big? #6

Closed
python273 opened this issue Apr 19, 2023 · 9 comments
Closed

Model files are big? #6

python273 opened this issue Apr 19, 2023 · 9 comments

Comments

@python273
Copy link

python273 commented Apr 19, 2023

https://huggingface.co/stabilityai/stablelm-base-alpha-3b/tree/main

Looks like 3B is 14.7GB, and if I understand correctly, it's supposed to be f16. Even with f32, it should be about 11.2G. With f16, 5.6G. Am I missing something?

For reference LLaMA 7B (f16) is 12.6G.

upd: I guess it's actually f32. But still seems a little bigger than should be?

@jon-tow
Copy link
Collaborator

jon-tow commented Apr 19, 2023

The actual model sizes are:
3B: 3,638,525,952
7B: 7,869,358,080

The fp32 weights are provided to allow users to reduce precision to their needs. We will consider providing the weights in f16 since this is a common complaint :)

Thank you for pointing it out!

@python273
Copy link
Author

Ok, the size seems about right then.

# took the size from disk. huggingface shows in / 1000**3
>>> (10_161_140_290+4_656_666_941) / 1024 / 1024 / 1024
13.800158380530775
>>> (3_638_525_952 * 4) / 1024 / 1024 / 1024
13.5545654296875

f16 weights would be nice, to download less stuff

@amrrs amrrs mentioned this issue Apr 19, 2023
@andysalerno
Copy link

@jon-tow on this topic, do you expect these models to quantize well down to 4bits (or lower) via GPTQ and/or other quantizing strategies?

I don't see why not, since GPTQ seems to be a general technique that works well for different transformer models. But I'm asking because part of reason behind Stable Diffusion's success is from how well it runs on consumer hardware. So I'm wondering if these models will follow a similar goal, of running very well on consumer hardware, and therefore consider quantization from the very beginning?

@jon-tow
Copy link
Collaborator

jon-tow commented Apr 20, 2023

Hi, @andysalerno! I do expect these models to quantize quite well. They're pretty wide, which should help reduce bandwidth boundness compared to models of similar size when quantized.

@MarkSchmidty
Copy link
Contributor

MarkSchmidty commented Apr 20, 2023

There's a 4.9GB ggml 4bit GPTQ quantization for StableLM-7B up on HuggingFace which works in llama.cpp for fast CPU inference.

(For comparison, LLaMA-7B in the same format is 4.1GB. But, StableLM-7B is actually closer to 8B parameters than 7B.)

@vvsotnikov
Copy link

For the sake of convenience (2x less download size/RAM/VRAM), I've uploaded 16-bit versions of tuned models to HF Hub:
https://huggingface.co/vvsotnikov/stablelm-tuned-alpha-7b-16bit
https://huggingface.co/vvsotnikov/stablelm-tuned-alpha-3b-16bit

@iboyles
Copy link

iboyles commented Apr 20, 2023

Yeah we need a Colab for this stuff that doesn't crash from ram out of memory lol

@jon-tow jon-tow pinned this issue Apr 24, 2023
@twmmason twmmason unpinned this issue Apr 25, 2023
@twmmason twmmason pinned this issue Apr 25, 2023
@jrincayc
Copy link

jrincayc commented Apr 26, 2023

There's a 4.9GB ggml 4bit GPTQ quantization for StableLM-7B up on HuggingFace which works in llama.cpp for fast CPU inference.

(For comparison, LLaMA-7B in the same format is 4.1GB. But, StableLM-7B is actually closer to 8B parameters than 7B.)

Hm, how do you actually run this?
I tried https://github.com/ggerganov/llama.cpp ( 4afcc378698e057fcde64e23eb664e5af8dd6956 and also 5addcb120cf2682c7ede0b1c520592700d74c87c )

and got:

./main -m ../ggml-q4_0-stablelm-tuned-alpha-7b/ggml-model-stablelm-tuned-alpha-7b-q4_0.bin -p "this is a test"
main: seed = 1682468827
llama.cpp: loading model from ../ggml-q4_0-stablelm-tuned-alpha-7b/ggml-model-stablelm-tuned-alpha-7b-q4_0.bin
error loading model: missing tok_embeddings.weight
llama_init_from_file: failed to load model
main: error: failed to load model '../ggml-q4_0-stablelm-tuned-alpha-7b/ggml-model-stablelm-tuned-alpha-7b-q4_0.bin'

@pratikchhapolika
Copy link

Hi @jon-tow @python273 Why we have multiple .bin files inside stabilityai/stablelm-base-alpha-7b? When we load the model which bin file is loaded?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants