Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate fused Mixtral MoE with Marlin kernels #7079

Closed

Conversation

ElizaWszola
Copy link
Contributor

@ElizaWszola ElizaWszola commented Aug 2, 2024

This PR's functionality has been implemented through PRs #8217 and #8032. I'm closing it.


Reimplement quantized Mixtral to combine Marlin kernels with fused MoE.

This PR rewrites the Mixtral model to run a modified Marlin kernel that takes advantage of fused_moe functionality.

The C++ code takes in all expert data and topk_ids tensor. It runs a kernel to compute sorted_ids offsets related to each expert, and then feeds them to the Marlin kernels. The Marlin kernels are run multiple times per each expert, using current expert number to figure out the current position inside sorted_ids and the number of tokens to process in each particular call. The values of sorted_ids are then used to indirectly access the rows of input/output A/C tensors. If the the rows of input A are identical for each of topk experts that access them (first MMM of fused MoE), tensor A consists of M x K elements, with each row being accessed topk times by the relevant experts. Otherwise (second MMM of fused MoE), A consists of M x topk x K elements, with each row being accessed once.

Unit testing:

pytest tests/kernels/test_moe.py -k test_fused_marlin_moe

End-to-end testing:
Run offline_inference.py with

// 4-bit quantized moe without act order
llm = LLM(model="TheBloke/Mixtral-8x7B-v0.1-GPTQ")
// 4-bit quantized moe with act order
llm = LLM(model="TheBloke/Mixtral-8x7B-v0.1-GPTQ",
          revision="gptq-4bit-128g-actorder_True")
// 8-bit quantized moe with act order
llm = LLM(model="TheBloke/Mixtral-8x7B-v0.1-GPTQ",
          revision="gptq-8bit-128g-actorder_True")

Sonnet benchmark results (no act order, 4-bit):

TTFT TPOT
vLLM_ TTFT, Mixtral 8x7B 4-bit vLLM_ TPOT, Mixtral 8x7B 4-bit-2
vLLM_ TTFT, Mixtral 8x7B 4-bit-2 vLLM_ TPOT, Mixtral 8x7B 4-bit-3
vLLM_ TTFT, Mixtral 8x7B 4-bit-3 vLLM_ TPOT, Mixtral 8x7B 4-bit

Sonnet benchmark results (with act order, 8-bit):

TTFT TPOT
vLLM_ TTFT, Mixtral 8x7B 8-bit vLLM_ TPOT, Mixtral 8x7B 8-bit
vLLM_ TTFT, Mixtral 8x7B 8-bit-2 vLLM_ TPOT, Mixtral 8x7B 8-bit-2
vLLM_ TTFT, Mixtral 8x7B 8-bit-3 vLLM_ TPOT, Mixtral 8x7B 8-bit-3

Copy link

github-actions bot commented Aug 2, 2024

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

  • Comment /ready on the PR
  • Add ready label to the PR
  • Enable auto-merge.

🚀

@alexm-neuralmagic
Copy link
Collaborator

@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 2, 2024
@ElizaWszola
Copy link
Contributor Author

/unready

@mgoin mgoin removed the ready ONLY add when PR is ready to merge/full CI is needed label Aug 2, 2024
@ElizaWszola
Copy link
Contributor Author

@dsikka I've added some compressed_tensors compatibility code (tested only offline inference with W4A16)

@ElizaWszola ElizaWszola changed the title Integrate fused MoE with Marlin kernels Integrate fused Mixtral MoE with Marlin kernels Aug 29, 2024
expert_id: int,
is_gptq: bool = False,
):
if is_gptq:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd want to use the weight loading functionality already present.

@@ -37,7 +37,7 @@
"LLaMAForCausalLM": ("llama", "LlamaForCausalLM"),
"MistralForCausalLM": ("llama", "LlamaForCausalLM"),
"MixtralForCausalLM": ("mixtral", "MixtralForCausalLM"),
"QuantMixtralForCausalLM": ("mixtral_quant", "MixtralForCausalLM"),
"QuantMixtralForCausalLM": ("mixtral", "MixtralForCausalLM"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'd want mixtral_quant by default

]
return ([
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we leverage what already exists?

@ElizaWszola
Copy link
Contributor Author

Closing this PR because its entire functionality has been implemented through PRs #8217 and #8032.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants