Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENHANCEMENT] New MPT 30B + CUDA support. #1971

Closed
casper-hansen opened this issue Jun 22, 2023 · 18 comments
Closed

[ENHANCEMENT] New MPT 30B + CUDA support. #1971

casper-hansen opened this issue Jun 22, 2023 · 18 comments

Comments

@casper-hansen
Copy link

casper-hansen commented Jun 22, 2023

MosaicML released its MPT 30B version today with 8k context, with Apache 2.0 license.

image

Why you should support MPT 30B

Let me present my argumentation for why MPT should be supported including CUDA support. Arguably, LLaMa models or Falcon models are great on paper and in evaluation, but what they really lack is commercial licensing (in the case of LLaMa) and an actively maintained tech stack (in the case of Falcon).

Tech stack:

  1. MosaicML has 8 employees actively contributing to their own open-source repo LLM-Foundry and another few researching for improvements. Recently they upgraded to PyTorch 2.0 and added H100 support just before this 30B version was released.
  2. A streaming library; train and fine-tune models while streaming your dataset from S3/GCP/Azure data storage options. This reduces costs at train time and you can easily resume upon hardware failures.
  3. They have developed tools like Composer that lets you train and fine-tune models much faster (e.g. GPT-2 for roughly $145 with Composer, and $255 with vanilla PyTorch).

Performance:

Evaluation: The performance on generic benchmarks of LLaMa 33B, Falcon 40B, and MPT 30B is mostly the same. Although MPT 30B is the smallest model, the performance is incredibly close, and the difference is negligible except for HumanEval where MPT 30B (base) scores 25%, LLaMa 33B scores 20%, while Falcon scores 1.2% (did not generate code) in MPTs tests.

Inference speed: The inference speed of MPT models is roughly 1.5-2.0x faster than LLaMa models because of FlashAttention and Low Precision Layernorm.

Memory usage: The MPT 30B model fits on 1x A100-80GB at 16 bits. Falcon 40B requires 85-100GB VRAM at 16 bits which means it conventionally needs 2x GPUs without the use of quantization.

Cost:

LLaMa is roughly 1.44x more expensive and Falcon 1.27x more expensive in compute power used to train the full models. This is remarkable because it means the MPT models can achieve the same performance as more expensive models at a much lower cost.

MPT-30B FLOPs ~= 6 * 30e9 [params] * 1.05e12 [tokens] = 1.89e23 FLOPs
LLaMa-30B FLOPs ~= 6 * 32.5e9 [params] * 1.4e12 [tokens] = 2.73e23 FLOPs (1.44x more)
Falcon-40B FLOPs ~= 6 * 40e9 [params] * 1e12 [tokens] = 2.40e23 FLOps (1.27x more)

Conclusion

If the community decides to support MPT models with CUDA support, we gain the following benefits:

  1. Being able to train and fine-tune LLMs at a lower cost than LLaMa models and enable commercial usage using llama.cpp/ggml for inference.
  2. Faster LLMs compared to LLaMa. Even faster once quantized and CUDA support is enabled.
  3. Much larger default context size (8k vs 2k), but also the ability to extend context size using ALiBi.

Links

https://www.mosaicml.com/blog/mpt-30b
https://huggingface.co/mosaicml/mpt-30b

@casper-hansen casper-hansen changed the title [ENHANCEMENT] New Support MPT 30B + CUDA support. [ENHANCEMENT] New MPT 30B + CUDA support. Jun 22, 2023
@TheBloke
Copy link
Contributor

TheBloke commented Jun 22, 2023

I'll just add my usual 2c on this subject: I would love if llama.cpp supported all major model types, bringing its hundreds of wonderful features to as many models as possible.

PS. FYI, KoboldCpp release 1.32 has now added OpenCL acceleration for MPT, as well as GPT-2 (StarCoder), GPT-J and GPTNeoX.

I tested my MPT 30B Instruct and Chat GGML uploads with it earlier and it's working pretty well - 8 tokens/s on shorter responses.

(But I'd still love llama.cpp to support this and other model types, and eventually bring CUDA and Metal acceleration to them.)

@casper-hansen
Copy link
Author

I'll just add my usual 2c on this subject: I would love if llama.cpp supported all major model types, bringing its hundreds of wonderful features to as many models as possible.

Yes, this is it! Would love to see us start with MPT as it contains quite a few features that other models also use. Supporting MPT models also means supporting Replit models since Replit chose LLM Foundry.

8 tokens/s on shorter responses.

Not sure what your setup looks like, but sounds like there is lots of room for improvement if we add CUDA acceleration in llama.cpp. I remember LLaMa 33B running at 29 tokens/s on your 4090 + i9-13900K rig. My bet is that MPT 30B could run faster than that if we give it full optimization.

@bratao
Copy link

bratao commented Jun 22, 2023

It is possible to test it here:
https://huggingface.co/spaces/mosaicml/mpt-30b-chat

The results are impressive!

@JohannesGaessler
Copy link
Collaborator

Too much work. Maybe once I get around to writing a binary that runs an exported ggml graph using CUDA (realistically in a few months at the earliest).

@slaren
Copy link
Collaborator

slaren commented Jun 23, 2023

a binary that runs an exported ggml graph using CUDA

I am working on something similar, but it will be at least a few weeks until it can be merged.

@JohannesGaessler
Copy link
Collaborator

If you do it that's fine with me too.

@casper-hansen
Copy link
Author

Too much work. Maybe once I get around to writing a binary that runs an exported ggml graph using CUDA (realistically in a few months at the earliest).

A binary execution graph would be amazing. But yes, large task.

My biggest wish is that repos like llama/ggml can enable optimized inference for quantized models that are commercially available. There is not a lot of tech that can do that right now.

@CyborgArmy83
Copy link

Would be amazing if MPT and Falcon support could be build-in!

@casper-hansen
Copy link
Author

MosaicML (MPT creators) was just acquired by Databricks for $1.3B, so I expect more initiatives for LLMs. Even more of an argument to start supporting their Foundry models.

@slaren since you said you will have it ready in a few weeks, I wanted to ask you the following. Do you see the path to supporting most models to export to a graph to run CUDA execution? It would be huge to have this kind of support native for most popular models.

@sirajperson
Copy link

@casperbh96 That's crazy. I hope the don't change the policies with the MPT series.

@slaren
Copy link
Collaborator

slaren commented Jun 27, 2023

Do you see the path to supporting most models to export to a graph to run CUDA execution?

That's the goal in the long run. At first, some of the operations required by non-llama models may be missing a CUDA implementation, but eventually we should add everything that is needed to support the different models.

@maddes8cht
Copy link
Contributor

anything happening on this?
now as the new gguf format is well established and stable, wasn't the idea to implement new models easier?

@casper-hansen
Copy link
Author

Looks like the authors do not have a plan to support MPT models

@maddes8cht
Copy link
Contributor

There needs to be just one developer capable and intetested enough to start it...

@ggerganov
Copy link
Owner

ggerganov commented Sep 27, 2023

We now kind of have a process for adding new models to llama.cpp (see Falcon, StarCoder and Baichuan).
Looking for contributions to do something similar for MPT

@maddes8cht
Copy link
Contributor

and there is a complete stack of akl original mpt models quantized to gguf at
maddes8cht huggingface
with an own mpt-collection.
Finetuned mpt models will follow next.

@Galunid
Copy link
Collaborator

Galunid commented Nov 5, 2023

closed in #3417

@Galunid Galunid closed this as completed Nov 5, 2023
@AlexBlack2202
Copy link

original mpt models quantized to gguf at

Hi,

how can you convert from mpt to gguf

i have an isssue when run convert-hf-to-gguf.py with the lasted version of gguf and torch==2.1.1,
transformers==4.35.2

"Can not map tensor 'transformer.wpe.weight'"

looking for some help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

12 participants