Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Phi-3.5-MoE #946

Merged
merged 6 commits into from
Aug 24, 2024
Merged

Add Phi-3.5-MoE #946

merged 6 commits into from
Aug 24, 2024

Conversation

Blaizzy
Copy link
Contributor

@Blaizzy Blaizzy commented Aug 20, 2024

Still WIP

@Blaizzy Blaizzy marked this pull request as ready for review August 24, 2024 06:49
@Blaizzy
Copy link
Contributor Author

Blaizzy commented Aug 24, 2024

@awni ready ✅
Screenshot 2024-08-24 at 8 50 53 AM

@Blaizzy
Copy link
Contributor Author

Blaizzy commented Aug 24, 2024

It got 2x faster generation all the sudden 🔥

Screenshot 2024-08-24 at 9 28 53 AM

@awni
Copy link
Member

awni commented Aug 24, 2024

Yes, the old SuRope was really slow so we made our fast RoPE more flexible and implement SuRoPE using it., which really speeds things up for the Phi models that use it.

Copy link
Member

@awni awni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the addition!

@awni awni merged commit b5e18ef into ml-explore:main Aug 24, 2024
4 checks passed
@Blaizzy
Copy link
Contributor Author

Blaizzy commented Aug 24, 2024

My pleasure!

I'm always happy to help :)

@nickludlam
Copy link

I'm using an M1 Ultra, and the speed discrepancy between 8-bit and 4-bit is enormous.

4bit

Prompt: 10 tokens, 13.931 tokens-per-sec
Generation: 100 tokens, 59.524 tokens-per-sec

8bit

Prompt: 10 tokens, 4.455 tokens-per-sec
Generation: 100 tokens, 0.476 tokens-per-sec

Is this an expected outcome? It seems disproportionally large!

@Blaizzy
Copy link
Contributor Author

Blaizzy commented Aug 31, 2024

That's looks odd.

I will investigate tomorrow 👌🏾

@Blaizzy
Copy link
Contributor Author

Blaizzy commented Aug 31, 2024

Can you share the link to the 8bit model you used?

@awni
Copy link
Member

awni commented Sep 1, 2024

This looks like an issue with swapping / too much memory use. Once that happens the generation time plummets.

If you're machine has enough RAM for the 8-bit model (likely 64GB is the minimum, maybe 48 but it's a stretch with 8bit) then it could be related to memory wiring issues. You could check out this related issue #776.

The most consistent solution has been to upgrade to Sequoia (macOS 15.0) and set sudo sysctl iogpu.disable_wired_collector=1.

For older OS sometimes setting iogpu.wired_lwm_mb and iogpu.wired_limit_mb to some large value is helpful.

@nickludlam
Copy link

Can you share the link to the 8bit model you used?

I downloaded https://huggingface.co/mlx-community/Phi-3.5-MoE-instruct-8bit with
huggingface-cli download mlx-community/Phi-3.5-MoE-instruct-8bit

I'm using a 128GB M1 system so it's not a swap issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants