-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPT support in llama.cpp #3417
MPT support in llama.cpp #3417
Conversation
…odified with deltas from ggml/examples/mpt
quantize warns because it is looking for attn_k and not attn_qkv:
|
Now fixed as well. |
…rom metadata rather than use 0.0 to indicate "no clamping" (more compliant with the current GGUF spec?)
…T_KEY macro instead of duplicate code
|
…nd rope_shift from build_mpt
Note that this PR does not include the modifications of convert script proposed in #3252 and referred to in #3417 (comment) yet. Since this PR is based on a pre-merge commit of #3252, it may be easier to add this change after the merge. |
…nvert-gptneox-hf-to-gguf.py in pr:3252
@cebtenzzre Thanks for the merge. If anyone can give this a quick try and confirms working, we should merge. |
Works for me. The PR is now almost the same as my own previous private merge attempt. The disable-n_past-assertion changes to ggml_compute_forward_alibi_f16 and ggml_compute_forward_alibi_f32 could be made syntactically more consistent - but AFAICS they are functionally equivalent. So not a show stopper for merge into master. |
…g hparams["vocab_size"]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested this, works fine for me. The test failure in test-tokenizer-1-bpe is due to added tokens. I'll fix this in a future PR.
…example * 'master' of github.com:ggerganov/llama.cpp: (34 commits) examples: support LLaVA v1.5 (multimodal model) (ggerganov#3436) docs : fix typo GOMP_CPU_AFFINITY (ggerganov#3597) cmake : fix add_compile_options on macOS typo : it is `--n-gpu-layers` not `--gpu-layers` (ggerganov#3592) ci : check if there is enough VRAM (ggerganov#3596) server : add completion mode (no chat) (ggerganov#3582) prompts : add mnemonics.txt server : fix kv cache management (ggerganov#3588) main : fix session loading bug (ggerganov#3400) server : add parameter -tb N, --threads-batch N (ggerganov#3584) common : fix mirostat state when using multiple sequences (ggerganov#3543) batched : add bench tool (ggerganov#3545) examples : add batched.swift + improve CI for swift (ggerganov#3562) Add MPT model to supported models in README.md (ggerganov#3574) Minor improvements in GPT2 tokenizer (ggerganov#3567) readme : add bloom (ggerganov#3570) llm : add bloom models (ggerganov#3553) swift : improvements and fixes (ggerganov#3564) llm : add MPT support (ggerganov#3417) infill. : fix tokenization (ggerganov#3508) ...
Co-authored-by: Cebtenzzre <cebtenzzre@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> (cherry picked from commit f5f9121)
I converted the mpt-7b-chat and the mpt-7b-storywriter. The conversion and quantization completes sucessfully and produces the .gguf files. however, the files don't work for me. When running main with them, i get an
For reference, here is the full output:
I already have successfull converted a bunch of falcon models that work fine, butthe mpt conversion script does not work for me. |
Here is a hexdump of the beginning of the files:
in comparison to the openbuddy falconversion that works fine:
What I notice is that after In contrast, the actual falcon model has a |
As per #1333 (comment)
Some comments regarding this initial implementation: