-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama : initial Mamba-2 support #9126
base: master
Are you sure you want to change the base?
Conversation
* ggml : improve ggml_mul speed when masking recurrent states
* ggml : make the ggml_mul fast broadcast path more consistently formatted
e9b0d19
to
aff9692
Compare
Hey @compilade , thanks for implementing this! I tried converting https://huggingface.co/mistralai/Mamba-Codestral-7B-v0.1 using
Nevertheless, I successfully converted a Mamba-Codestral Run it output model (remember to select the correct chat template, since the model does not come with one):
The result looks promising, but I have no idea why there are
Link to download GGUF: https://huggingface.co/ngxson/codestral-mamba-llamacpp-test/tree/main |
The steps I took to convert Mamba-Codestral-7B-v0.1 are the following:
I did not have tokenization problems in my tests. Maybe because I was using the original SentencePiece tokenizer instead of a BPE tokenizer. That There are probably still problems with the SentencePiece tokenizer too, I think the SentencePiece tokenizer should be preferred for this model; it should be easier to handle without workarounds. I should change that in |
The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires workarounds to work correctly.
Thanks for the guide! I've successfully converted the original repository the gguf by following your steps. For the I'm wondering if (Also cc @Vaibhavs10 since he's the maintainer of gguf-my-repo.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @compilade/ @ngxson - JFYI - the transformers weights are now merged in the main repo: https://huggingface.co/mistralai/Mamba-Codestral-7B-v0.1
If you face any issues with the conversion with this could you open an issue on the repo for us to track! 🤗
Any updates on when Codestral Mamba should be supported? |
Nice work! Just a note on the ssm_scan kernel performance: a better fused implementation by the flash-linear-attention project can give the equivalent functionality as Mamba2's original kernel: sustcsonglin/flash-linear-attention#49 , and runs 2x faster: sustcsonglin/flash-linear-attention#50 |
Hi @compilade ! I worked on repo conversion for the transformers-compatible mamba2 version, let us know if you need anything from us to move forward with this PR :) |
It sounds like having a simple fallback of expected filenames would be a reasonable thing to include here? I don't know that we want to maintain a ton of different ones, but adding a second layer of fallbacks for alternate filenames doesn't feel arduous. |
That's not really a problem anymore (at least for Mamba-Codestral) since the official repo was updated in https://huggingface.co/mistralai/Mamba-Codestral-7B-v0.1/commit/88085f9cdfa832c3aca8a0315a4520cf7558c947 to use more standard names. What is currently blocking this is that the Metal and CUDA kernels for |
Any updates on this? |
The max index is 31, so trimming the arguments is necessary.
Whoops, this is needed for the offset in the concatenated output.
This was initially added because states were masked with ggml_mul, but this is no longer done and so this "optimisation" is no longer necessary, or at least not worth the additional code complexity.
This makes the weight buft detection in src/llama.cpp simpler. * convert : transpose Mamba-2 A, D and reshape SSM_NORM This breaks existing conversions of Mamba-2 models to avoid some reshapes. Not sure if it's a good idea, but it makes the graph slightly cleaner. * llama : more appropriate SSM_SCAN and SSM_CONV buft support checks
Follow-up from #8519 (comment). This should fix #7727 and fix #8519.
I've implemented the fully recurrent mode of Mamba-2, because it's very similar to Mamba-1, and also because it seems like the most appropriate mode for text generation.
This does not implement the sequentially semistructured matrix mode, because I'm not yet sure how the block decomposition would fit within the
batch
andubatch
framework ofllama.cpp
, and how the chunk size should be chosen. If the recurrent mode is faster at single-user auto-regressive text generation, then I'm not sure how to keep the graph node structure constant when using the most appropriate technique for the batch size.If the sequentially semistructured matrix mode is eventually implemented, it should help with prompt processing speed for large prompts.
What to expect
(mostly taken from #8519 (comment))
The state in Mamba-2 is bigger than I thought; Mamba-Codestral-7B-v0.1 takes
263.5 MiB
(inF32
) per sequence (e.g. with-np 1
), compared to38 MiB
(also inF32
) for Falcon-Mamba-7B (which is based on Mamba-1). But that remains constant whatever the context size. Mamba-2 is easier to implement efficiently, so the bigger state does not really impede inference speed.However, a big downside right now with recurrent models in
llama.cpp
is the lack of state rollback (which is implemented through state checkpoints in #7531, but needs to be re-adapted to #8526), so the prompt will be reprocessed a lot if usingllama-server
. I think usingllama-cli
in conversation mode does not have this problem, however (or maybe only the bare interactive mode with--in-prefix
and--in-suffix
, not sure).This initial implementation is CPU-only, but uses SIMD for the SSM scan, so even though the state is bigger than for Mamba-1 models, in my tests, the speed of
Mamba2-130M
is similar or better thanMamba-130M
(but still not that fast compared to transformer-based models with an empty context), when both are run on CPU.The speed of Mamba-2 models seems comparable to Transformer-based models when the latter have 2k to 4k tokens in their context.
Summary of changes
Mamba2ForCausalLM
(including the official Mamba-2 models, and Mamba-Codestral-7B-v0.1)config.json
needs to contain"architectures": ["Mamba2ForCausalLM"],
for the convert script to properly detect the architecture.d_inner
(aka2 * n_embd
) heads of size 1.ggml_ssm_scan
ggml
ggml_ssm_scan
.ssm_a
is broadcast)ssm_d
intoggml_ssm_scan
GGML_SIMD
.expf
in the state update unlike with Mamba-1.ggml_ssm_scan
.perf
.Other
Here's my favorite quote from Section 3.3 of https://arxiv.org/abs/2405.21060:
TODO
master
after merging llama : simplify Mamba with advanced batch splits #8526.ggml_ssm_scan
GGML_MUL
fast broadcast path because it's not used anymore to mask the states.Maybe use a new metadata key instead of(well, maybe kind of).{arch}.ssm.time_step_rank
for the number of heads of Mamba-2, because it's not really the rank of the time stepssm_d
inggml_ssm_scan
?ggml_ssm_scan
to separate the implementations for Mamba-1 and Mamba-2, although they do have a lot in common.