-
Notifications
You must be signed in to change notification settings - Fork 10.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Phi 3 medium/small support #7439
Comments
Happens with the 128k variants as well. I tried both! |
woops, thanks, forgot and got lazy with pasting links lol |
Try all models using #7225 and report any issues |
Building now, will report any updates Also will try running created quants with those changes just to see if it works (need these changes for imatrix though of course) |
Will llamacpp work with blocksparse attention? These models seems to implement it. |
Do we also get vision support for Phi-3-Vision? I don't know how much this diverses from other archs like LLaVA. |
Could you remind me what was blocksparse attention? Edit: No, there is no API for that atm. |
@qnixsynapse Is this technique actually used in practice? I don't see how one would choose the attention mask in a reasonable way without the LLM "forgetting" important bits from the context |
Normally, I will not prefer it, however, I am seeing this which caught my attention. |
I tried it with #7225 using the 128k variants: microsoft/Phi-3-medium-128k-instruct:
microsoft/Phi-3-small-128k-instruct:bf16 gguf creation still fails with:
I tried the dumb fix for the Phi3SmallForCausalLm not supported:
And now it fails with:
|
using that PR imatrix works which should likely imply that generation will work, old created ones don't so any that are floating out there without the PR will be broken |
Phi3-medium-128k run
|
Try increasing the context, you are using only 512 so there are probably shifts happening. |
That did the trick |
tried with https://github.com/ggerganov/llama.cpp/pull/7225 and it worked but only with this version (PR 7225):
But if you use main from latest version to run gguf files created with this version, it will show this error: llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 245, got 243 |
with what model did this work for you? with convert-hf-to-gguf.py? probably medium? or if small, how did you get around tokenizer.model? |
Can anyone post working f16? |
So far so good on the 4k GGUF, it's able to respond to queries which is good enough for me lol uploaded here: https://huggingface.co/bartowski/Phi-3-medium-4k-instruct-GGUF |
It's loading now and working great with bartowski/Phi-3-medium-4k-instruct-GGUF/Phi-3-medium-4k-instruct-Q4_K_S.gguf on LM-Studio. |
looks great with medium. but small seems to need more work. in a first test, i can override some gguf_parameters and just use self._set_vocab_qwen() and it will convert and quantize, but then it won't run but throw "llama_model_load: error loading model: check_tensor_dims: tensor 'output.weight' not found". |
https://0.0g.gg/?8b6aa2a822f73b75#6dSFckfnxCttPKUX7rX4b35WEdt6woLdK65DTpSWSZ4w here is an issue ive been running into. Link is a paste of the model just completely imploding in on itself from a basic word problem. |
Btw I think if you're using something like LM studio you aren't getting the right performance It fails the tokenizer test of 3333+7777, but using the PR ./main gets it right Likely need to wait for merge and version bump |
I am using a bleeding edge llama.cpp commit and its doing that which is odd... |
Something is wrong with Phi-3-medium-4k-instruct output, I am often getting weird "B:" out of the blue: launched via:
using current master (9b3d833). |
AFAIK the |
@MoonRide303 @tristandruyen Just FYI - it looks like that |
I'm actually the one working on the --chat-template issue in #7449. However, it seems like @MoonRide303's issue is more related to the web ui not using any model specific templates, not that it's using the wrong one. The fix I'm working on in #7449 aims to improve the auto-detection of the phi3 template, so users won't have to explicitly specify it using the --chat-template flag. This fix will ensure that llama.cpp automatically detects and uses the appropriate template for the model. It's important to note that the behavior of the endpoints and the web UI will remain unchanged after my fix is merged. The web UI will still not use any model-specific template, just the auto-detection process will be more reliable. |
@ggerganov thanks for your interest on supporting phi-3-small. I am the author of the blocksparse attention in phi-3-small. I am not very familiar with ollama, but I could help explain the detail. The kernel is implemented in Triton, but you can find the code that generates the dense version of attention mask here There is a also a vllm paged attention version I tested other models with ollama on my mac, it is super responsive and cool, hope I could have our phi-3-small model on my mac as well! |
@linxihui Thanks for the information. Is my understanding correct that the vllm implementation skips non-attended blocks (i.e. full of If my understanding is correct, then I think we can easily support this in the Metal backend as the block-skipping logic is already there |
|
still can't convert phi-3-small :( Phi3SmallForCausalLM unsupported :( |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Please, can you reopen this, we need phi-3 small. |
I agree on that, 4k context is simply not enough. |
FYI: Microsoft has just released Phi3.5 models, with mini version having 128k context. See https://huggingface.co/collections/microsoft/phi-3-6626e15e9585a200d2d761e3. It doesn't have GGUF quants yet, because... because of this issue. Let's get to it! 💪 Edit: Just tested https://huggingface.co/bartowski/Phi-3.5-mini-instruct-GGUF with context size of 8192, works good, which fits my usecase. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
2 new models released from Microsoft:
https://huggingface.co/microsoft/Phi-3-medium-4k-instruct/
https://huggingface.co/microsoft/Phi-3-small-8k-instruct/
Medium uses Phi3ForCausalLM and converts without issue, but when trying to generate has an invalid tensor shape:
llama_model_load: error loading model: check_tensor_dims: tensor 'blk.0.attn_qkv.weight' has wrong shape; expected 5120, 15360, got 5120, 7680, 1, 1
And then Small uses a new Architecture tag 'Phi3SmallForCausalLM'
The text was updated successfully, but these errors were encountered: