-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Int 8 / FP 8 quantization support similar to bnb #24
Labels
enhancement
New feature or request
Comments
4 bit! Wow! This reminds me of https://github.com/THUDM/GLM-130B/blob/main/docs/quantization.md |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi there @ggerganov and great work. The performance on CPU is just amazing.
Would it be possible in the future to also implement Int 8 / FP 8 loading of models (a few layers still must be loaded with their original fp16 or fp32 weights) similar to bitsandbytes library: https://github.com/TimDettmers/bitsandbytes
This would allow loading of bigger models on some systems with limited amount of cpu ram. Or perhaps even faster inference for models like GPT-J.
In theory on a mac system or x64 (AVX2 or AVX512) system with 128GB cpu ram you would be able load a 120B model this way... Wouldnt that be amazing :)))
The text was updated successfully, but these errors were encountered: