Add Phi-3.5-MoE #946

Blaizzy · 2024-08-20T21:49:25Z

Still WIP

Blaizzy · 2024-08-24T06:53:21Z

@awni ready ✅

Blaizzy · 2024-08-24T07:30:56Z

It got 2x faster generation all the sudden 🔥

awni · 2024-08-24T13:10:39Z

Yes, the old SuRope was really slow so we made our fast RoPE more flexible and implement SuRoPE using it., which really speeds things up for the Phi models that use it.

awni

Thanks for the addition!

Blaizzy · 2024-08-24T14:55:25Z

My pleasure!

I'm always happy to help :)

nickludlam · 2024-08-31T18:53:53Z

I'm using an M1 Ultra, and the speed discrepancy between 8-bit and 4-bit is enormous.

4bit

Prompt: 10 tokens, 13.931 tokens-per-sec
Generation: 100 tokens, 59.524 tokens-per-sec

8bit

Prompt: 10 tokens, 4.455 tokens-per-sec
Generation: 100 tokens, 0.476 tokens-per-sec

Is this an expected outcome? It seems disproportionally large!

Blaizzy · 2024-08-31T20:04:50Z

That's looks odd.

I will investigate tomorrow 👌🏾

Blaizzy · 2024-08-31T20:05:39Z

Can you share the link to the 8bit model you used?

awni · 2024-09-01T04:34:24Z

This looks like an issue with swapping / too much memory use. Once that happens the generation time plummets.

If you're machine has enough RAM for the 8-bit model (likely 64GB is the minimum, maybe 48 but it's a stretch with 8bit) then it could be related to memory wiring issues. You could check out this related issue #776.

The most consistent solution has been to upgrade to Sequoia (macOS 15.0) and set sudo sysctl iogpu.disable_wired_collector=1.

For older OS sometimes setting iogpu.wired_lwm_mb and iogpu.wired_limit_mb to some large value is helpful.

nickludlam · 2024-09-01T09:20:31Z

Can you share the link to the 8bit model you used?

I downloaded https://huggingface.co/mlx-community/Phi-3.5-MoE-instruct-8bit with
huggingface-cli download mlx-community/Phi-3.5-MoE-instruct-8bit

I'm using a 128GB M1 system so it's not a swap issue.

Blaizzy added 3 commits August 20, 2024 23:47

add phimoe

6b1d27a

add phimoe to tunner

0b20d08

add switch_mlp

e3df324

Blaizzy marked this pull request as ready for review August 24, 2024 06:49

Blaizzy added 2 commits August 24, 2024 09:04

Merge branch 'ml-explore:main' into pc/phimoe

300fbc1

fix SuScaled args

d764171

nits

0e7f44a

awni approved these changes Aug 24, 2024

View reviewed changes

awni merged commit b5e18ef into ml-explore:main Aug 24, 2024
4 checks passed

linkage001 mentioned this pull request Aug 26, 2024

Request to use Phi-3.5-MoE-instruct ggerganov/llama.cpp#9168

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Phi-3.5-MoE #946

Add Phi-3.5-MoE #946

Blaizzy commented Aug 20, 2024

Blaizzy commented Aug 24, 2024 •

edited

Loading

Blaizzy commented Aug 24, 2024

awni commented Aug 24, 2024

awni left a comment

Blaizzy commented Aug 24, 2024

nickludlam commented Aug 31, 2024

Blaizzy commented Aug 31, 2024

Blaizzy commented Aug 31, 2024

awni commented Sep 1, 2024

nickludlam commented Sep 1, 2024

Add Phi-3.5-MoE #946

Add Phi-3.5-MoE #946

Conversation

Blaizzy commented Aug 20, 2024

Blaizzy commented Aug 24, 2024 • edited Loading

Blaizzy commented Aug 24, 2024

awni commented Aug 24, 2024

awni left a comment

Choose a reason for hiding this comment

Blaizzy commented Aug 24, 2024

nickludlam commented Aug 31, 2024

4bit

8bit

Blaizzy commented Aug 31, 2024

Blaizzy commented Aug 31, 2024

awni commented Sep 1, 2024

nickludlam commented Sep 1, 2024

Blaizzy commented Aug 24, 2024 •

edited

Loading