Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement MosiacML's 7B model. #1333

Closed
rjb7731 opened this issue May 5, 2023 · 25 comments
Closed

Implement MosiacML's 7B model. #1333

rjb7731 opened this issue May 5, 2023 · 25 comments
Labels
help wanted Extra attention is needed model Model specific

Comments

@rjb7731
Copy link

rjb7731 commented May 5, 2023

Comparative to Llama in results I believe and also commercially available for use!

https://huggingface.co/mosaicml/mpt-7b

https://www.mosaicml.com/blog/mpt-7b

@ggerganov ggerganov added model Model specific help wanted Extra attention is needed labels May 5, 2023
@ggerganov
Copy link
Owner

Lets start with a basic inference example in the ggml repo.

If it lives up to the hype, we can think about also integrating it in llama.cpp so we get all the infrastructure benefits or maybe something better depending on the results.

@sbsce
Copy link

sbsce commented May 5, 2023

Licensed as Apache 2.0, and a context length of 65k! Yes, would be great to have this supported in llama.cpp.

@Green-Sky
Copy link
Collaborator

MPT-7B-StoryWriter-65k+ is a model designed to read and write stories with super long context lengths. It was built by finetuning MPT-7B with a context length of 65k tokens on a filtered fiction subset of the books3 dataset. At inference time, thanks to ALiBi, MPT-7B-StoryWriter-65k+ can extrapolate even beyond 65k tokens, and we have demonstrated generations as long as 84k tokens on a single node of A100-80GB GPUs.

that's a looooong context.

@DannyDaemonic
Copy link
Contributor

As someone who likes to occasionally use LLMs as an aid in writing, the idea of a 65k token context is exciting. Only 7B parameters is a little disappointing but one thing we're learning is not to judge a model by it's parameter count.

One of the first things I did when I found this project was to hack my own custom context reset to restart on a sentence boundary and leave only ~10% free space for context generation instead of 50%, just to keep the context more relevant. It was terribly inefficient but that's how bad I wanted a longer context length. There's really no substitute to having more (relevant) text in the context.

@jploski
Copy link
Contributor

jploski commented May 6, 2023

I also opened an issue for it here: mosaicml/llm-foundry#60

@drbh
Copy link
Contributor

drbh commented May 7, 2023

@jploski thanks for starting this conversation in ggml and llm-foundry! I agree that adding Mosaic 7B is a great idea! I happen to see you mentioned that you started some work but ran into tensor formatting issue.

Would you be open to sharing that branch of ggml? Mostly because I'm eager to learn more about the quantization process and even if it is not the full implementation, it may be helpful to see others starting points. Thanks!

@jploski
Copy link
Contributor

jploski commented May 7, 2023

@jploski thanks for starting this conversation in ggml and llm-foundry! I agree that adding Mosaic 7B is a great idea! I happen to see you mentioned that you started some work but ran into tensor formatting issue.

Would you be open to sharing that branch of ggml? Mostly because I'm eager to learn more about the quantization process and even if it is not the full implementation, it may be helpful to see others starting points. Thanks!

FWIW:

https://github.com/jploski/ggml/tree/mpt-experiment/examples/mpt

See commit comments and "TODO" in source code and README.md in examples/mpt for things that I do not understand. The main challenge seems to be that MPT uses a transformer model with customized code (found in their llm-foundry repository), so it is probably silly to expect the stablelm code to just work. All I did was some (rather uninformed) guessing.

Also note that the inference will not even start for mpt-7b-storywriter using the default context length of 65535 - it will just complain about "not enough space in the context's memory pool" and segfault. But this can be worked around by specifying a smaller n_ctx (rather than letting it load from GGML/model config).

Please do not let this first attempt raise false hopes or stand in the way of an actual implementation. I do not feel qualified enough (wrt to understanding transformers) to implement it. At best my branch can save someone some typing for the boilerplate (and at worst it can mislead, too).

@jon-chuang
Copy link
Contributor

As someone who likes to occasionally use LLMs as an aid in writing, the idea of a 65k token context is exciting

Just be aware that your RAM may run out and even if you evict to disk, it will be extremely slow due to quadratic scaling.

@s-kostyaev
Copy link

@jon-chuang I think with ALiBi it will not be quadratic scaling. Fix me if I am wrong.

@klosax
Copy link
Contributor

klosax commented May 21, 2023

Generation speed for StoryWriter model:

at token 1000, about 300 ms per token
at token 8000, about 2500 ms per token

So if tokens generated is increased 8 times, the generation time per token is increased about 8.3 times.

@jon-chuang
Copy link
Contributor

@s-kostyaev AliBi is a positional encoding method, and has nothing to do the cost of attention.

https://paperswithcode.com/method/alibi

@klosax exactly, that is quadratic scaling.

Note that storywriter (and similarly claude's 100K context length) are largely impractical at the claimed lengths. I am betting on the next gen models with log-linear context scaling based on long convolutions to gain prominence. See https://github.com/HazyResearch/safari

@acheong08
Copy link

Any plans/updates?

@jploski
Copy link
Contributor

jploski commented Jun 21, 2023

Any plans/updates?

Maybe the integration will become easier after Falcon #1602 - because that could be the first non-LLaMA model to obtain llama.cpp support and pave the way for others.

@Green-Sky
Copy link
Collaborator

working mpt inference can be found here ggml/examples/mpt

@tcnevin
Copy link

tcnevin commented Jun 22, 2023

working mpt inference can be found here ggml/examples/mpt

How close is this to building main.exe to work with mpt models?

@Jchang4
Copy link

Jchang4 commented Jun 23, 2023

Just checking in as well, with the ggml example would we be able to get an implementation? @ggerganov

@ggerganov
Copy link
Owner

ggerganov commented Jun 24, 2023

I think the next big steps that need to happen are:

  • Finalize ggml : unified file format ggml#220 - this will give us a unified model format that will be more future-proof and would make sense to support long term
  • Refactor model loading llama.cpp - currently it is doing too much extra unnecessary stuff like supporting old models that no longer exists. The code needs a big refactor and simplification so that we can more easily start loading non-LLaMA models

We should put a big focus on these soon and throughout July and try to bring support for most new models (MPT, Falcon, etc.) into llama.cpp

Alternatively, a quick'n'dirty implementation of MPT in llama.cpp with tons of ifdefs and hacks can be done on a branch relatively quickly. But it is not something we want on master as it will bring further technical dept to the codebase

@tcnevin
Copy link

tcnevin commented Jul 1, 2023

I think the next big steps that need to happen are:

  • Finalize ggml : unified file format ggml#220 - this will give us a unified model format that will be more future-proof and would make sense to support long term
  • Refactor model loading llama.cpp - currently it is doing too much extra unnecessary stuff like supporting old models that no longer exists. The code needs a big refactor and simplification so that we can more easily start loading non-LLaMA models

We should put a big focus on these soon and throughout July and try to bring support for most new models (MPT, Falcon, etc.) into llama.cpp

Alternatively, a quick'n'dirty implementation of MPT in llama.cpp with tons of ifdefs and hacks can be done on a branch relatively quickly. But it is not something we want on master as it will bring further technical dept to the codebase

Are there any llama.cpp branches working on MPT implementations currently?

As far as the ggml: unified file format, that's really interesting and I'm trying to understand it better, but could a standard "descriptive file" be developed in conjunction to support unknown formats by describing hyperparameters of whatever ggml file is supplied with it? I'm just wondering if that even makes sense, to allow for non unified files to work with readers that may accept a second "descriptive file.
"

@mvsoom
Copy link

mvsoom commented Sep 13, 2023

Could this be easier with the new GGUF format?

@tony352
Copy link

tony352 commented Sep 16, 2023

I just wanted to see if there were any updates on this? It would be great to have MPT Storywriter in Ollama.

@maddes8cht
Copy link
Contributor

I'm also very interested on progress here
😊

@ggerganov
Copy link
Owner

We now kind of have a process for adding new models to llama.cpp (see Falcon, StarCoder and Baichuan).
Looking for contributions to do something similar for Mosaic

@jploski
Copy link
Contributor

jploski commented Sep 30, 2023

Some progress, see #3417

(You can help testing by checking out https://github.com/jploski/llama.cpp/tree/mpt)

@maddes8cht
Copy link
Contributor

Maybe it's a place to note that there is a pretty complete set of gguf quantized mpt models available at my Huggingface Account, with a handy mpt-collection

@Galunid
Copy link
Collaborator

Galunid commented Nov 2, 2023

Implemented in #3417

@Galunid Galunid closed this as completed Nov 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed model Model specific
Projects
None yet
Development

No branches or pull requests