-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENHANCEMENT] New MPT 30B + CUDA support. #1971
Comments
I'll just add my usual 2c on this subject: I would love if llama.cpp supported all major model types, bringing its hundreds of wonderful features to as many models as possible. PS. FYI, KoboldCpp release 1.32 has now added OpenCL acceleration for MPT, as well as GPT-2 (StarCoder), GPT-J and GPTNeoX. I tested my MPT 30B Instruct and Chat GGML uploads with it earlier and it's working pretty well - 8 tokens/s on shorter responses. (But I'd still love llama.cpp to support this and other model types, and eventually bring CUDA and Metal acceleration to them.) |
Yes, this is it! Would love to see us start with MPT as it contains quite a few features that other models also use. Supporting MPT models also means supporting Replit models since Replit chose LLM Foundry.
Not sure what your setup looks like, but sounds like there is lots of room for improvement if we add CUDA acceleration in llama.cpp. I remember LLaMa 33B running at 29 tokens/s on your 4090 + i9-13900K rig. My bet is that MPT 30B could run faster than that if we give it full optimization. |
It is possible to test it here: The results are impressive! |
Too much work. Maybe once I get around to writing a binary that runs an exported ggml graph using CUDA (realistically in a few months at the earliest). |
I am working on something similar, but it will be at least a few weeks until it can be merged. |
If you do it that's fine with me too. |
A binary execution graph would be amazing. But yes, large task. My biggest wish is that repos like llama/ggml can enable optimized inference for quantized models that are commercially available. There is not a lot of tech that can do that right now. |
Would be amazing if MPT and Falcon support could be build-in! |
MosaicML (MPT creators) was just acquired by Databricks for $1.3B, so I expect more initiatives for LLMs. Even more of an argument to start supporting their Foundry models. @slaren since you said you will have it ready in a few weeks, I wanted to ask you the following. Do you see the path to supporting most models to export to a graph to run CUDA execution? It would be huge to have this kind of support native for most popular models. |
@casperbh96 That's crazy. I hope the don't change the policies with the MPT series. |
That's the goal in the long run. At first, some of the operations required by non-llama models may be missing a CUDA implementation, but eventually we should add everything that is needed to support the different models. |
anything happening on this? |
Looks like the authors do not have a plan to support MPT models |
There needs to be just one developer capable and intetested enough to start it... |
We now kind of have a process for adding new models to |
and there is a complete stack of akl original mpt models quantized to gguf at |
closed in #3417 |
Hi, how can you convert from mpt to gguf i have an isssue when run convert-hf-to-gguf.py with the lasted version of gguf and torch==2.1.1, "Can not map tensor 'transformer.wpe.weight'" looking for some help |
MosaicML released its MPT 30B version today with 8k context, with Apache 2.0 license.
Why you should support MPT 30B
Let me present my argumentation for why MPT should be supported including CUDA support. Arguably, LLaMa models or Falcon models are great on paper and in evaluation, but what they really lack is commercial licensing (in the case of LLaMa) and an actively maintained tech stack (in the case of Falcon).
Tech stack:
Performance:
Evaluation: The performance on generic benchmarks of LLaMa 33B, Falcon 40B, and MPT 30B is mostly the same. Although MPT 30B is the smallest model, the performance is incredibly close, and the difference is negligible except for HumanEval where MPT 30B (base) scores 25%, LLaMa 33B scores 20%, while Falcon scores 1.2% (did not generate code) in MPTs tests.
Inference speed: The inference speed of MPT models is roughly 1.5-2.0x faster than LLaMa models because of FlashAttention and Low Precision Layernorm.
Memory usage: The MPT 30B model fits on 1x A100-80GB at 16 bits. Falcon 40B requires 85-100GB VRAM at 16 bits which means it conventionally needs 2x GPUs without the use of quantization.
Cost:
LLaMa is roughly 1.44x more expensive and Falcon 1.27x more expensive in compute power used to train the full models. This is remarkable because it means the MPT models can achieve the same performance as more expensive models at a much lower cost.
MPT-30B FLOPs ~= 6 * 30e9 [params] * 1.05e12 [tokens] = 1.89e23 FLOPs
LLaMa-30B FLOPs ~= 6 * 32.5e9 [params] * 1.4e12 [tokens] = 2.73e23 FLOPs (1.44x more)
Falcon-40B FLOPs ~= 6 * 40e9 [params] * 1e12 [tokens] = 2.40e23 FLOps (1.27x more)
Conclusion
If the community decides to support MPT models with CUDA support, we gain the following benefits:
Links
https://www.mosaicml.com/blog/mpt-30b
https://huggingface.co/mosaicml/mpt-30b
The text was updated successfully, but these errors were encountered: