-
Notifications
You must be signed in to change notification settings - Fork 913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Phi-3.5-MoE #946
Add Phi-3.5-MoE #946
Conversation
@awni ready ✅ |
Yes, the old SuRope was really slow so we made our fast RoPE more flexible and implement SuRoPE using it., which really speeds things up for the Phi models that use it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the addition!
My pleasure! I'm always happy to help :) |
I'm using an M1 Ultra, and the speed discrepancy between 8-bit and 4-bit is enormous. 4bitPrompt: 10 tokens, 13.931 tokens-per-sec 8bitPrompt: 10 tokens, 4.455 tokens-per-sec Is this an expected outcome? It seems disproportionally large! |
That's looks odd. I will investigate tomorrow 👌🏾 |
Can you share the link to the 8bit model you used? |
This looks like an issue with swapping / too much memory use. Once that happens the generation time plummets. If you're machine has enough RAM for the 8-bit model (likely 64GB is the minimum, maybe 48 but it's a stretch with 8bit) then it could be related to memory wiring issues. You could check out this related issue #776. The most consistent solution has been to upgrade to Sequoia (macOS 15.0) and set For older OS sometimes setting |
I downloaded https://huggingface.co/mlx-community/Phi-3.5-MoE-instruct-8bit with I'm using a 128GB M1 system so it's not a swap issue. |
Still WIP