-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement MosiacML's 7B model. #1333
Comments
Lets start with a basic inference example in the ggml repo. If it lives up to the hype, we can think about also integrating it in |
Licensed as Apache 2.0, and a context length of 65k! Yes, would be great to have this supported in llama.cpp. |
that's a looooong context. |
As someone who likes to occasionally use LLMs as an aid in writing, the idea of a 65k token context is exciting. Only 7B parameters is a little disappointing but one thing we're learning is not to judge a model by it's parameter count. One of the first things I did when I found this project was to hack my own custom context reset to restart on a sentence boundary and leave only ~10% free space for context generation instead of 50%, just to keep the context more relevant. It was terribly inefficient but that's how bad I wanted a longer context length. There's really no substitute to having more (relevant) text in the context. |
I also opened an issue for it here: mosaicml/llm-foundry#60 |
@jploski thanks for starting this conversation in Would you be open to sharing that branch of |
FWIW: https://github.com/jploski/ggml/tree/mpt-experiment/examples/mpt See commit comments and "TODO" in source code and README.md in examples/mpt for things that I do not understand. The main challenge seems to be that MPT uses a transformer model with customized code (found in their llm-foundry repository), so it is probably silly to expect the stablelm code to just work. All I did was some (rather uninformed) guessing. Also note that the inference will not even start for mpt-7b-storywriter using the default context length of 65535 - it will just complain about "not enough space in the context's memory pool" and segfault. But this can be worked around by specifying a smaller n_ctx (rather than letting it load from GGML/model config). Please do not let this first attempt raise false hopes or stand in the way of an actual implementation. I do not feel qualified enough (wrt to understanding transformers) to implement it. At best my branch can save someone some typing for the boilerplate (and at worst it can mislead, too). |
Just be aware that your RAM may run out and even if you evict to disk, it will be extremely slow due to quadratic scaling. |
@jon-chuang I think with ALiBi it will not be quadratic scaling. Fix me if I am wrong. |
Generation speed for StoryWriter model: at token 1000, about 300 ms per token So if tokens generated is increased 8 times, the generation time per token is increased about 8.3 times. |
@s-kostyaev AliBi is a positional encoding method, and has nothing to do the cost of attention. https://paperswithcode.com/method/alibi @klosax exactly, that is quadratic scaling. Note that storywriter (and similarly claude's 100K context length) are largely impractical at the claimed lengths. I am betting on the next gen models with log-linear context scaling based on long convolutions to gain prominence. See https://github.com/HazyResearch/safari |
Any plans/updates? |
Maybe the integration will become easier after Falcon #1602 - because that could be the first non-LLaMA model to obtain llama.cpp support and pave the way for others. |
working mpt inference can be found here ggml/examples/mpt |
How close is this to building main.exe to work with mpt models? |
Just checking in as well, with the ggml example would we be able to get an implementation? @ggerganov |
I think the next big steps that need to happen are:
We should put a big focus on these soon and throughout July and try to bring support for most new models (MPT, Falcon, etc.) into Alternatively, a quick'n'dirty implementation of MPT in |
Are there any llama.cpp branches working on MPT implementations currently? As far as the ggml: unified file format, that's really interesting and I'm trying to understand it better, but could a standard "descriptive file" be developed in conjunction to support unknown formats by describing hyperparameters of whatever ggml file is supplied with it? I'm just wondering if that even makes sense, to allow for non unified files to work with readers that may accept a second "descriptive file. |
Could this be easier with the new GGUF format? |
I just wanted to see if there were any updates on this? It would be great to have MPT Storywriter in Ollama. |
I'm also very interested on progress here |
We now kind of have a process for adding new models to |
Some progress, see #3417 (You can help testing by checking out https://github.com/jploski/llama.cpp/tree/mpt) |
Maybe it's a place to note that there is a pretty complete set of gguf quantized mpt models available at my Huggingface Account, with a handy mpt-collection |
Implemented in #3417 |
Comparative to Llama in results I believe and also commercially available for use!
https://huggingface.co/mosaicml/mpt-7b
https://www.mosaicml.com/blog/mpt-7b
The text was updated successfully, but these errors were encountered: