Implement MosiacML's 7B model. #1333

rjb7731 · 2023-05-05T17:44:48Z

Comparative to Llama in results I believe and also commercially available for use!

https://www.mosaicml.com/blog/mpt-7b

ggerganov · 2023-05-05T20:11:04Z

Lets start with a basic inference example in the ggml repo.

If it lives up to the hype, we can think about also integrating it in llama.cpp so we get all the infrastructure benefits or maybe something better depending on the results.

sbsce · 2023-05-05T22:36:12Z

Licensed as Apache 2.0, and a context length of 65k! Yes, would be great to have this supported in llama.cpp.

Green-Sky · 2023-05-06T09:43:53Z

MPT-7B-StoryWriter-65k+ is a model designed to read and write stories with super long context lengths. It was built by finetuning MPT-7B with a context length of 65k tokens on a filtered fiction subset of the books3 dataset. At inference time, thanks to ALiBi, MPT-7B-StoryWriter-65k+ can extrapolate even beyond 65k tokens, and we have demonstrated generations as long as 84k tokens on a single node of A100-80GB GPUs.

that's a looooong context.

DannyDaemonic · 2023-05-06T10:12:26Z

As someone who likes to occasionally use LLMs as an aid in writing, the idea of a 65k token context is exciting. Only 7B parameters is a little disappointing but one thing we're learning is not to judge a model by it's parameter count.

One of the first things I did when I found this project was to hack my own custom context reset to restart on a sentence boundary and leave only ~10% free space for context generation instead of 50%, just to keep the context more relevant. It was terribly inefficient but that's how bad I wanted a longer context length. There's really no substitute to having more (relevant) text in the context.

jploski · 2023-05-06T22:50:53Z

I also opened an issue for it here: mosaicml/llm-foundry#60

drbh · 2023-05-07T01:37:05Z

@jploski thanks for starting this conversation in ggml and llm-foundry! I agree that adding Mosaic 7B is a great idea! I happen to see you mentioned that you started some work but ran into tensor formatting issue.

Would you be open to sharing that branch of ggml? Mostly because I'm eager to learn more about the quantization process and even if it is not the full implementation, it may be helpful to see others starting points. Thanks!

jploski · 2023-05-07T11:09:20Z

@jploski thanks for starting this conversation in ggml and llm-foundry! I agree that adding Mosaic 7B is a great idea! I happen to see you mentioned that you started some work but ran into tensor formatting issue.

Would you be open to sharing that branch of ggml? Mostly because I'm eager to learn more about the quantization process and even if it is not the full implementation, it may be helpful to see others starting points. Thanks!

FWIW:

https://github.com/jploski/ggml/tree/mpt-experiment/examples/mpt

See commit comments and "TODO" in source code and README.md in examples/mpt for things that I do not understand. The main challenge seems to be that MPT uses a transformer model with customized code (found in their llm-foundry repository), so it is probably silly to expect the stablelm code to just work. All I did was some (rather uninformed) guessing.

Also note that the inference will not even start for mpt-7b-storywriter using the default context length of 65535 - it will just complain about "not enough space in the context's memory pool" and segfault. But this can be worked around by specifying a smaller n_ctx (rather than letting it load from GGML/model config).

Please do not let this first attempt raise false hopes or stand in the way of an actual implementation. I do not feel qualified enough (wrt to understanding transformers) to implement it. At best my branch can save someone some typing for the boilerplate (and at worst it can mislead, too).

jon-chuang · 2023-05-20T05:17:26Z

As someone who likes to occasionally use LLMs as an aid in writing, the idea of a 65k token context is exciting

Just be aware that your RAM may run out and even if you evict to disk, it will be extremely slow due to quadratic scaling.

s-kostyaev · 2023-05-20T07:01:47Z

@jon-chuang I think with ALiBi it will not be quadratic scaling. Fix me if I am wrong.

klosax · 2023-05-21T23:35:53Z

Generation speed for StoryWriter model:

at token 1000, about 300 ms per token
at token 8000, about 2500 ms per token

So if tokens generated is increased 8 times, the generation time per token is increased about 8.3 times.

jon-chuang · 2023-05-22T09:25:20Z

@s-kostyaev AliBi is a positional encoding method, and has nothing to do the cost of attention.

https://paperswithcode.com/method/alibi

@klosax exactly, that is quadratic scaling.

Note that storywriter (and similarly claude's 100K context length) are largely impractical at the claimed lengths. I am betting on the next gen models with log-linear context scaling based on long convolutions to gain prominence. See https://github.com/HazyResearch/safari

acheong08 · 2023-06-21T11:34:46Z

Any plans/updates?

jploski · 2023-06-21T12:30:09Z

Any plans/updates?

Maybe the integration will become easier after Falcon #1602 - because that could be the first non-LLaMA model to obtain llama.cpp support and pave the way for others.

Green-Sky · 2023-06-21T14:26:19Z

working mpt inference can be found here ggml/examples/mpt

tcnevin · 2023-06-22T21:50:35Z

working mpt inference can be found here ggml/examples/mpt

How close is this to building main.exe to work with mpt models?

Jchang4 · 2023-06-23T20:49:04Z

Just checking in as well, with the ggml example would we be able to get an implementation? @ggerganov

ggerganov · 2023-06-24T10:47:30Z

I think the next big steps that need to happen are:

Finalize ggml : unified file format ggml#220 - this will give us a unified model format that will be more future-proof and would make sense to support long term
Refactor model loading llama.cpp - currently it is doing too much extra unnecessary stuff like supporting old models that no longer exists. The code needs a big refactor and simplification so that we can more easily start loading non-LLaMA models

We should put a big focus on these soon and throughout July and try to bring support for most new models (MPT, Falcon, etc.) into llama.cpp

Alternatively, a quick'n'dirty implementation of MPT in llama.cpp with tons of ifdefs and hacks can be done on a branch relatively quickly. But it is not something we want on master as it will bring further technical dept to the codebase

tcnevin · 2023-07-01T07:02:11Z

I think the next big steps that need to happen are:

Finalize ggml : unified file format ggml#220 - this will give us a unified model format that will be more future-proof and would make sense to support long term

Refactor model loading llama.cpp - currently it is doing too much extra unnecessary stuff like supporting old models that no longer exists. The code needs a big refactor and simplification so that we can more easily start loading non-LLaMA models

We should put a big focus on these soon and throughout July and try to bring support for most new models (MPT, Falcon, etc.) into llama.cpp

Alternatively, a quick'n'dirty implementation of MPT in llama.cpp with tons of ifdefs and hacks can be done on a branch relatively quickly. But it is not something we want on master as it will bring further technical dept to the codebase

Are there any llama.cpp branches working on MPT implementations currently?

As far as the ggml: unified file format, that's really interesting and I'm trying to understand it better, but could a standard "descriptive file" be developed in conjunction to support unknown formats by describing hyperparameters of whatever ggml file is supplied with it? I'm just wondering if that even makes sense, to allow for non unified files to work with readers that may accept a second "descriptive file.
"

mvsoom · 2023-09-13T19:12:09Z

Could this be easier with the new GGUF format?

tony352 · 2023-09-16T17:09:09Z

I just wanted to see if there were any updates on this? It would be great to have MPT Storywriter in Ollama.

maddes8cht · 2023-09-26T10:54:35Z

I'm also very interested on progress here
😊

ggerganov · 2023-09-27T15:27:45Z

We now kind of have a process for adding new models to llama.cpp (see Falcon, StarCoder and Baichuan).
Looking for contributions to do something similar for Mosaic

jploski · 2023-09-30T17:18:24Z

Some progress, see #3417

(You can help testing by checking out https://github.com/jploski/llama.cpp/tree/mpt)

maddes8cht · 2023-11-01T21:37:34Z

Maybe it's a place to note that there is a pretty complete set of gguf quantized mpt models available at my Huggingface Account, with a handy mpt-collection

Galunid · 2023-11-02T00:54:44Z

Implemented in #3417

ggerganov added model Model specific help wanted Extra attention is needed labels May 5, 2023

jimexist mentioned this issue May 6, 2023

Implement Together Computer's Red Pajama 3B Base/Chat model #1337

Closed

jploski mentioned this issue May 6, 2023

Add GGML support mosaicml/llm-foundry#60

Closed

digiwombat mentioned this issue May 7, 2023

Set n_ctx for llama.cpp models when loading/reloading oobabooga/text-generation-webui#1872

Closed

hannesj mentioned this issue May 8, 2023

Add support for MPT-7B-StoryWriter-65k+ #1364

Closed

ehartford mentioned this issue May 8, 2023

MosaicML MPT-7B ggerganov/ggml#136

Open

hippalectryon-0 mentioned this issue May 16, 2023

Custom GGML outside LlamaCpp scope su77ungr/CASALIOY#38

Open

MonkiesDance mentioned this issue May 20, 2023

mpt: ggml_new_tensor_impl: not enough space in the context's memory pool ggerganov/ggml#171

Closed

MonkiesDance mentioned this issue May 20, 2023

mpt - Add flags for prompt context size (-c/--ctx_size) ggerganov/ggml#174

Closed

matthoffner mentioned this issue May 25, 2023

Error: Missing field nGpuLayers Atome-FE/llama-node#80

Open

jploski mentioned this issue Sep 30, 2023

MPT support in llama.cpp #3417

Merged

Galunid closed this as completed Nov 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement MosiacML's 7B model. #1333

Implement MosiacML's 7B model. #1333

rjb7731 commented May 5, 2023

ggerganov commented May 5, 2023

sbsce commented May 5, 2023 •

edited

Loading

Green-Sky commented May 6, 2023

DannyDaemonic commented May 6, 2023

jploski commented May 6, 2023

drbh commented May 7, 2023 •

edited

Loading

jploski commented May 7, 2023

jon-chuang commented May 20, 2023

s-kostyaev commented May 20, 2023

klosax commented May 21, 2023

jon-chuang commented May 22, 2023

acheong08 commented Jun 21, 2023

jploski commented Jun 21, 2023

Green-Sky commented Jun 21, 2023

tcnevin commented Jun 22, 2023

Jchang4 commented Jun 23, 2023

ggerganov commented Jun 24, 2023 •

edited

Loading

tcnevin commented Jul 1, 2023

mvsoom commented Sep 13, 2023

tony352 commented Sep 16, 2023

maddes8cht commented Sep 26, 2023

ggerganov commented Sep 27, 2023

jploski commented Sep 30, 2023

maddes8cht commented Nov 1, 2023

Galunid commented Nov 2, 2023

Implement MosiacML's 7B model. #1333

Implement MosiacML's 7B model. #1333

Comments

rjb7731 commented May 5, 2023

ggerganov commented May 5, 2023

sbsce commented May 5, 2023 • edited Loading

Green-Sky commented May 6, 2023

DannyDaemonic commented May 6, 2023

jploski commented May 6, 2023

drbh commented May 7, 2023 • edited Loading

jploski commented May 7, 2023

jon-chuang commented May 20, 2023

s-kostyaev commented May 20, 2023

klosax commented May 21, 2023

jon-chuang commented May 22, 2023

acheong08 commented Jun 21, 2023

jploski commented Jun 21, 2023

Green-Sky commented Jun 21, 2023

tcnevin commented Jun 22, 2023

Jchang4 commented Jun 23, 2023

ggerganov commented Jun 24, 2023 • edited Loading

tcnevin commented Jul 1, 2023

mvsoom commented Sep 13, 2023

tony352 commented Sep 16, 2023

maddes8cht commented Sep 26, 2023

ggerganov commented Sep 27, 2023

jploski commented Sep 30, 2023

maddes8cht commented Nov 1, 2023

Galunid commented Nov 2, 2023

sbsce commented May 5, 2023 •

edited

Loading

drbh commented May 7, 2023 •

edited

Loading

ggerganov commented Jun 24, 2023 •

edited

Loading