[ENHANCEMENT] New MPT 30B + CUDA support. #1971

casper-hansen · 2023-06-22T19:44:30Z

MosaicML released its MPT 30B version today with 8k context, with Apache 2.0 license.

Why you should support MPT 30B

Let me present my argumentation for why MPT should be supported including CUDA support. Arguably, LLaMa models or Falcon models are great on paper and in evaluation, but what they really lack is commercial licensing (in the case of LLaMa) and an actively maintained tech stack (in the case of Falcon).

Tech stack:

MosaicML has 8 employees actively contributing to their own open-source repo LLM-Foundry and another few researching for improvements. Recently they upgraded to PyTorch 2.0 and added H100 support just before this 30B version was released.
A streaming library; train and fine-tune models while streaming your dataset from S3/GCP/Azure data storage options. This reduces costs at train time and you can easily resume upon hardware failures.
They have developed tools like Composer that lets you train and fine-tune models much faster (e.g. GPT-2 for roughly $145 with Composer, and $255 with vanilla PyTorch).

Performance:

Evaluation: The performance on generic benchmarks of LLaMa 33B, Falcon 40B, and MPT 30B is mostly the same. Although MPT 30B is the smallest model, the performance is incredibly close, and the difference is negligible except for HumanEval where MPT 30B (base) scores 25%, LLaMa 33B scores 20%, while Falcon scores 1.2% (did not generate code) in MPTs tests.

Inference speed: The inference speed of MPT models is roughly 1.5-2.0x faster than LLaMa models because of FlashAttention and Low Precision Layernorm.

Memory usage: The MPT 30B model fits on 1x A100-80GB at 16 bits. Falcon 40B requires 85-100GB VRAM at 16 bits which means it conventionally needs 2x GPUs without the use of quantization.

Cost:

LLaMa is roughly 1.44x more expensive and Falcon 1.27x more expensive in compute power used to train the full models. This is remarkable because it means the MPT models can achieve the same performance as more expensive models at a much lower cost.

MPT-30B FLOPs ~= 6 * 30e9 [params] * 1.05e12 [tokens] = 1.89e23 FLOPs
LLaMa-30B FLOPs ~= 6 * 32.5e9 [params] * 1.4e12 [tokens] = 2.73e23 FLOPs (1.44x more)
Falcon-40B FLOPs ~= 6 * 40e9 [params] * 1e12 [tokens] = 2.40e23 FLOps (1.27x more)

Conclusion

If the community decides to support MPT models with CUDA support, we gain the following benefits:

Being able to train and fine-tune LLMs at a lower cost than LLaMa models and enable commercial usage using llama.cpp/ggml for inference.
Faster LLMs compared to LLaMa. Even faster once quantized and CUDA support is enabled.
Much larger default context size (8k vs 2k), but also the ability to extend context size using ALiBi.

Links

https://www.mosaicml.com/blog/mpt-30b
https://huggingface.co/mosaicml/mpt-30b

TheBloke · 2023-06-22T20:32:59Z

I'll just add my usual 2c on this subject: I would love if llama.cpp supported all major model types, bringing its hundreds of wonderful features to as many models as possible.

PS. FYI, KoboldCpp release 1.32 has now added OpenCL acceleration for MPT, as well as GPT-2 (StarCoder), GPT-J and GPTNeoX.

I tested my MPT 30B Instruct and Chat GGML uploads with it earlier and it's working pretty well - 8 tokens/s on shorter responses.

(But I'd still love llama.cpp to support this and other model types, and eventually bring CUDA and Metal acceleration to them.)

casper-hansen · 2023-06-22T21:52:05Z

I'll just add my usual 2c on this subject: I would love if llama.cpp supported all major model types, bringing its hundreds of wonderful features to as many models as possible.

Yes, this is it! Would love to see us start with MPT as it contains quite a few features that other models also use. Supporting MPT models also means supporting Replit models since Replit chose LLM Foundry.

8 tokens/s on shorter responses.

Not sure what your setup looks like, but sounds like there is lots of room for improvement if we add CUDA acceleration in llama.cpp. I remember LLaMa 33B running at 29 tokens/s on your 4090 + i9-13900K rig. My bet is that MPT 30B could run faster than that if we give it full optimization.

bratao · 2023-06-22T22:04:41Z

It is possible to test it here:
https://huggingface.co/spaces/mosaicml/mpt-30b-chat

The results are impressive!

JohannesGaessler · 2023-06-23T16:03:07Z

Too much work. Maybe once I get around to writing a binary that runs an exported ggml graph using CUDA (realistically in a few months at the earliest).

slaren · 2023-06-23T16:08:53Z

a binary that runs an exported ggml graph using CUDA

I am working on something similar, but it will be at least a few weeks until it can be merged.

JohannesGaessler · 2023-06-23T16:14:07Z

If you do it that's fine with me too.

casper-hansen · 2023-06-23T19:08:48Z

Too much work. Maybe once I get around to writing a binary that runs an exported ggml graph using CUDA (realistically in a few months at the earliest).

A binary execution graph would be amazing. But yes, large task.

My biggest wish is that repos like llama/ggml can enable optimized inference for quantized models that are commercially available. There is not a lot of tech that can do that right now.

CyborgArmy83 · 2023-06-23T19:30:54Z

Would be amazing if MPT and Falcon support could be build-in!

casper-hansen · 2023-06-27T18:06:28Z

MosaicML (MPT creators) was just acquired by Databricks for $1.3B, so I expect more initiatives for LLMs. Even more of an argument to start supporting their Foundry models.

@slaren since you said you will have it ready in a few weeks, I wanted to ask you the following. Do you see the path to supporting most models to export to a graph to run CUDA execution? It would be huge to have this kind of support native for most popular models.

sirajperson · 2023-06-27T20:50:57Z

@casperbh96 That's crazy. I hope the don't change the policies with the MPT series.

slaren · 2023-06-27T22:10:59Z

Do you see the path to supporting most models to export to a graph to run CUDA execution?

That's the goal in the long run. At first, some of the operations required by non-llama models may be missing a CUDA implementation, but eventually we should add everything that is needed to support the different models.

maddes8cht · 2023-09-26T10:53:01Z

anything happening on this?
now as the new gguf format is well established and stable, wasn't the idea to implement new models easier?

casper-hansen · 2023-09-26T11:08:01Z

Looks like the authors do not have a plan to support MPT models

maddes8cht · 2023-09-26T13:07:16Z

There needs to be just one developer capable and intetested enough to start it...

ggerganov · 2023-09-27T15:29:36Z

We now kind of have a process for adding new models to llama.cpp (see Falcon, StarCoder and Baichuan).
Looking for contributions to do something similar for MPT

maddes8cht · 2023-11-02T09:14:39Z

and there is a complete stack of akl original mpt models quantized to gguf at
maddes8cht huggingface
with an own mpt-collection.
Finetuned mpt models will follow next.

Galunid · 2023-11-05T03:46:24Z

closed in #3417

AlexBlack2202 · 2023-12-04T09:27:57Z

original mpt models quantized to gguf at

Hi,

how can you convert from mpt to gguf

i have an isssue when run convert-hf-to-gguf.py with the lasted version of gguf and torch==2.1.1,
transformers==4.35.2

"Can not map tensor 'transformer.wpe.weight'"

looking for some help

casper-hansen changed the title ~~[ENHANCEMENT] New Support MPT 30B + CUDA support.~~ [ENHANCEMENT] New MPT 30B + CUDA support. Jun 22, 2023

Galunid closed this as completed Nov 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENHANCEMENT] New MPT 30B + CUDA support. #1971

[ENHANCEMENT] New MPT 30B + CUDA support. #1971

casper-hansen commented Jun 22, 2023 •

edited

Loading

TheBloke commented Jun 22, 2023 •

edited

Loading

casper-hansen commented Jun 22, 2023

bratao commented Jun 22, 2023

JohannesGaessler commented Jun 23, 2023

slaren commented Jun 23, 2023

JohannesGaessler commented Jun 23, 2023

casper-hansen commented Jun 23, 2023

CyborgArmy83 commented Jun 23, 2023

casper-hansen commented Jun 27, 2023

sirajperson commented Jun 27, 2023

slaren commented Jun 27, 2023

maddes8cht commented Sep 26, 2023

casper-hansen commented Sep 26, 2023

maddes8cht commented Sep 26, 2023

ggerganov commented Sep 27, 2023 •

edited

Loading

maddes8cht commented Nov 2, 2023

Galunid commented Nov 5, 2023

AlexBlack2202 commented Dec 4, 2023

[ENHANCEMENT] New MPT 30B + CUDA support. #1971

[ENHANCEMENT] New MPT 30B + CUDA support. #1971

Comments

casper-hansen commented Jun 22, 2023 • edited Loading

Why you should support MPT 30B

Tech stack:

Performance:

Cost:

Conclusion

Links

TheBloke commented Jun 22, 2023 • edited Loading

casper-hansen commented Jun 22, 2023

bratao commented Jun 22, 2023

JohannesGaessler commented Jun 23, 2023

slaren commented Jun 23, 2023

JohannesGaessler commented Jun 23, 2023

casper-hansen commented Jun 23, 2023

CyborgArmy83 commented Jun 23, 2023

casper-hansen commented Jun 27, 2023

sirajperson commented Jun 27, 2023

slaren commented Jun 27, 2023

maddes8cht commented Sep 26, 2023

casper-hansen commented Sep 26, 2023

maddes8cht commented Sep 26, 2023

ggerganov commented Sep 27, 2023 • edited Loading

maddes8cht commented Nov 2, 2023

Galunid commented Nov 5, 2023

AlexBlack2202 commented Dec 4, 2023

casper-hansen commented Jun 22, 2023 •

edited

Loading

TheBloke commented Jun 22, 2023 •

edited

Loading

ggerganov commented Sep 27, 2023 •

edited

Loading