[Model] Support Llama4 in vLLM #16104

houseroad · 2025-04-05T19:09:43Z

Add the support for Llama4 Scout (17B x 16 Experts) and Maverick (17B x 128 Experts) in vLLM.

Using 8xH100, vLLM can serve Scout with 1M context and Maverick with about 430K.

vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --tensor-parallel-size 8 \
  --max-model-len 1280000

vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 \
  --tensor-parallel-size 8 \
  --max-model-len 430000

Using 8xH200, vLLM can serve Scout with 3.6M context and Maverick with full 1M context.

vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --tensor-parallel-size 8 \
  --max-model-len 3600000

vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 \
  --tensor-parallel-size 8

Using MI300x, we can run with default settings.

VLLM_WORKER_MULTIPROC_METHOD=spawn \
VLLM_USE_MODELSCOPE=False \
SAFETENSORS_FAST_GPU=1 VLLM_USE_V1=1 vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct \
  --disable-log-requests -tp 8 \
  --max-num-seqs 64

Check out blog post [link coming soon] for performance enhancement and leveraging long context.

FIX #16106

github-actions · 2025-04-05T19:09:52Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Co-authored-by: Aston Zhang <22279212+astonzhang@users.noreply.github.com> Co-authored-by: Chris Thi <chris.c.thi@gmail.com> Co-authored-by: drisspg <drisspguessous@gmail.com> Co-authored-by: Jon Swenson <jmswen@gmail.com> Co-authored-by: Keyun Tong <tongkeyun@gmail.com> Co-authored-by: Lu Fang <fanglu@meta.com> Co-authored-by: Lu Fang <lufang@meta.com> Co-authored-by: Xiaodong Wang <xdwang@meta.com> Co-authored-by: Yang Chen <yangche@fb.com> Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com> Co-authored-by: Yong Hoon Shin <yhshin@meta.com> Co-authored-by: Zijing Liu <liuzijing2014@gmail.com> Signed-off-by: Aston Zhang <22279212+astonzhang@users.noreply.github.com> Signed-off-by: Chris Thi <chris.c.thi@gmail.com> Signed-off-by: drisspg <drisspguessous@gmail.com> Signed-off-by: Jon Swenson <jmswen@gmail.com> Signed-off-by: Keyun Tong <tongkeyun@gmail.com> Signed-off-by: Lu Fang <fanglu@meta.com> Signed-off-by: Xiaodong Wang <xdwang@meta.com> Signed-off-by: Yang Chen <yangche@fb.com> Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com> Signed-off-by: Yong Hoon Shin <yhshin@meta.com> Signed-off-by: Zijing Liu <liuzijing2014@gmail.com> Signed-off-by: Lu Fang <lufang@fb.com>

Signed-off-by: Lu Fang <lufang@fb.com>

Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com> Signed-off-by: Lu Fang <lufang@fb.com>

This reverts commit 188bb52. Signed-off-by: Lu Fang <lufang@fb.com>

robertgshaw2-redhat · 2025-04-05T19:39:57Z

🔥

ywang96

Multimodal part looks fine to me - left some nits but we can fix them later

tests/models/registry.py

vllm/model_executor/models/mllama4.py

mgoin · 2025-04-05T19:57:10Z

vllm/model_executor/layers/fused_moe/cutlass_moe.py

+    assert topk == 1, \
+        "apply_router_weight_on_input is currently only implemented for topk=1"


Should we move this assert to be in the if apply_router_weight_on_input: conditional? This seems restrictive without checking if apply_router_weight_on_input is true

mgoin · 2025-04-05T19:58:43Z

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py

+            topk_ids=topk_ids,
+            inplace=True,
+            activation=activation,
+            apply_router_weight_on_input=self.apply_router_weight_on_input,


Forgot to add attribute like in other method

This is WIP by @luccafong

Are you guys referring to the pre-commit failure? sorry I think this was from my changes, @luccafong I have a fix for this I can push if you want, otherwise I can send you a patch (if you haven't already fixed it)

This reverts commit ee170a7. Signed-off-by: Lu Fang <lufang@fb.com>

dsingal0 · 2025-04-05T20:50:11Z

Is it expected to get this error:
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/registry.py", line 451, in inspect_model_cls

return self._raise_for_unsupported(architectures)

       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/registry.py", line 401, in _raise_for_unsupported

raise ValueError(

ValueError: Model architectures ['Llama4ForConditionalGeneration'] failed to be inspected. Please check the logs for more details.

* fix lint * remove unnecessary codes * remove apply_router_weight_on_input from abstract class and remaining unrelated moe quantized methods

ywang96 · 2025-04-05T20:54:07Z

Is it expected to get this error: File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/registry.py", line 451, in inspect_model_cls
return self._raise_for_unsupported(architectures)

       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/registry.py", line 401, in _raise_for_unsupported
raise ValueError(
ValueError: Model architectures ['Llama4ForConditionalGeneration'] failed to be inspected. Please check the logs for more details.

@dsingal0 Which version of transformers are you on?

dsingal0 · 2025-04-05T20:58:35Z

Is it expected to get this error: File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/registry.py", line 451, in inspect_model_cls
return self._raise_for_unsupported(architectures)

       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/registry.py", line 401, in _raise_for_unsupported
raise ValueError(
ValueError: Model architectures ['Llama4ForConditionalGeneration'] failed to be inspected. Please check the logs for more details.
@dsingal0 Which version of transformers are you on?

transformers-4.52.0.dev0

tlrmchlsmth · 2025-04-05T20:39:15Z

vllm/v1/attention/backends/flash_attn.py

+    block_table: torch.tensor,
+    page_size: int = 0,
+) -> tuple[np.ndarray, np.ndarray, np.ndarray, torch.tensor]:
+    q_seqlens = query_start_loc_np[1:] - query_start_loc_np[:-1]


Should it be named q_seqlens_np?

could be, I just dropped the np suffixes in this function since they are all numpy arrays, but we could add them back in a future PR

tlrmchlsmth · 2025-04-05T20:59:01Z

vllm/v1/attention/backends/flash_attn.py

LocalAttentionMetadata and make_local_attention_virtual_batches look good to me. BTW has anybody profiled this? We should look at writing a "kernel" as a followup

I don't believe so, atleast I never did. I think as a first cut we could even just write a C++ op, this code is ALOT easier to understand as a loop and honestly would probably be faster as a loop (assuming its a C++ loop and not a python loop) since theres sooo many numpy calls in this version. I just wrote it this way assuming it would scale to larger batch sizes better than a python loop.

dsingal0 · 2025-04-05T21:33:16Z

I think transformers.models.llama4.image_processing_llama4 needs to be changed to transformers.models.llama4.image_processing_llama4_fast

ywang96 · 2025-04-05T21:33:56Z

I think transformers.models.llama4.image_processing_llama4 needs to be changed to transformers.models.llama4.image_processing_llama4_fast

Yea it's been addressed in 62e9744 already

Signed-off-by: Roger Wang <ywang@roblox.com>

vllm/model_executor/models/registry.py

… apply (#4) Signed-off-by: Lu Fang <lufang@fb.com>

Signed-off-by: Roger Wang <ywang@roblox.com>

* Add missing apply_router_weight_on_input arg to all FusedMoEMethodBase classes * Make linter happy * More lint fixes * Revert "More lint fixes" This reverts commit 675b3c1.

Signed-off-by: Roger Wang <ywang@roblox.com>

AlekseyKorshuk · 2025-04-06T02:28:24Z

vllm/model_executor/models/mllama4.py

+                                         **kwargs)
+
+    def get_supported_mm_limits(self) -> Mapping[str, Optional[int]]:
+        return {"image": 10}


Why limit it to 10 images only if the model has to support way more, given its context length and benchmark results published by Meta claiming of processing up to 20 hours of video?

I don't think video inference is the scope of this release yet?

This PR doesn't support video modality so I guess it'll come in the next model update?

@AlekseyKorshuk 8-10 image is the recommended mm limit giving you acceptable quality although from the infra perspective it can do more.

Llama4’s video tokenizer works slightly different form image and we’ll update that once it’s available.

That's a fair point, but it raises an error if set cli argument to >10 multimodal limit. Shouldn't 10 be a default value, but not the hard limit that is not possible to overcome without changing the code?

I think I'm okay with not capping it at 10, but setting a default value for this will be something model-dependent which we currently don't support today on vLLM (and it's tricky to do that since today there's no standard on how many images a model can support up to), so we let user do it by passing limit-mm-per-prompt.

Sounds good, just wanted to make sure that this value is easy for users to change based on their needs. Thanks for the reply, gonna resolve the conversation

docs/source/models/supported_models.md

ywang96

Given test failures are not particularly related to changes in this PR and non-blocking, I think this PR is good to go! Thanks to Meta team for this amazing contribution to vLLM!

wenmengzhou · 2025-04-06T04:19:47Z

examples/offline_inference/vision_language.py

+        "role":
+        "user",
+        "content": [{
+            "type": "image"


missing content of image, it should be
{
"type": "image",
"image": "https://path/to/your/image.jpg"
}

The way it works with our offline inference llm.generate interface is actually a bit different from huggingface interface. In this case we're adding this chunk here only for it to insert the image placeholder token into the prompt when we apply the chat template from the tokenizer.

Co-authored-by: Aston Zhang <22279212+astonzhang@users.noreply.github.com> Co-authored-by: Chris Thi <chris.c.thi@gmail.com> Co-authored-by: drisspg <drisspguessous@gmail.com> Co-authored-by: Jon Swenson <jmswen@gmail.com> Co-authored-by: Keyun Tong <tongkeyun@gmail.com> Co-authored-by: Lu Fang <fanglu@meta.com> Co-authored-by: Lu Fang <lufang@meta.com> Co-authored-by: Xiaodong Wang <xdwang@meta.com> Co-authored-by: Yang Chen <yangche@fb.com> Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com> Co-authored-by: Yong Hoon Shin <yhshin@meta.com> Co-authored-by: Zijing Liu <liuzijing2014@gmail.com> Signed-off-by: Aston Zhang <22279212+astonzhang@users.noreply.github.com> Signed-off-by: Chris Thi <chris.c.thi@gmail.com> Signed-off-by: drisspg <drisspguessous@gmail.com> Signed-off-by: Jon Swenson <jmswen@gmail.com> Signed-off-by: Keyun Tong <tongkeyun@gmail.com> Signed-off-by: Lu Fang <fanglu@meta.com> Signed-off-by: Xiaodong Wang <xdwang@meta.com> Signed-off-by: Yang Chen <yangche@fb.com> Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com> Signed-off-by: Yong Hoon Shin <yhshin@meta.com> Signed-off-by: Zijing Liu <liuzijing2014@gmail.com> Signed-off-by: Lu Fang <lufang@fb.com>

fsaudm · 2025-04-06T16:25:39Z

Quantization support?

houseroad marked this pull request as ready for review April 5, 2025 19:09

houseroad requested review from DarkLight1337, WoosukKwon, alexm-redhat, comaniac, mgoin, njhill, robertgshaw2-redhat, tlrmchlsmth and ywang96 as code owners April 5, 2025 19:09

mergify bot added documentation Improvements or additions to documentation ci/build frontend multi-modality Related to multi-modality (#4194) v1 labels Apr 5, 2025

houseroad force-pushed the init_pr branch from 553ca90 to 8c36228 Compare April 5, 2025 19:11

houseroad and others added 5 commits April 5, 2025 12:12

revert changes in vllm/assets/image.py (vllm-project#116)

dcb2c77

Signed-off-by: Lu Fang <lufang@fb.com>

fix inplace_fused_experts_fake (vllm-project#117)

89083a6

Signed-off-by: Lu Fang <lufang@fb.com>

Bump transformers version to 4.51.0 (vllm-project#119)

188bb52

Signed-off-by: Lu Fang <lufang@fb.com>

clean up model names and whitespaces (vllm-project#120)

6ad393f

Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com> Signed-off-by: Lu Fang <lufang@fb.com>

houseroad force-pushed the init_pr branch from 8c36228 to 6ad393f Compare April 5, 2025 19:12

Revert "Bump transformers version to 4.51.0 (vllm-project#119)"

ee170a7

This reverts commit 188bb52. Signed-off-by: Lu Fang <lufang@fb.com>

ywang96 self-assigned this Apr 5, 2025

ywang96 reviewed Apr 5, 2025

View reviewed changes

tests/models/registry.py Show resolved Hide resolved

vllm/model_executor/models/mllama4.py Outdated Show resolved Hide resolved

vllm/model_executor/models/mllama4.py Show resolved Hide resolved

mgoin reviewed Apr 5, 2025

View reviewed changes

Reapply "Bump transformers version to 4.51.0 (vllm-project#119)"

a19cf7b

This reverts commit ee170a7. Signed-off-by: Lu Fang <lufang@fb.com>

houseroad force-pushed the init_pr branch from 380e99b to a19cf7b Compare April 5, 2025 20:05

fix MOE lint (#2)

ec6cdaa

* fix lint * remove unnecessary codes * remove apply_router_weight_on_input from abstract class and remaining unrelated moe quantized methods

tlrmchlsmth reviewed Apr 5, 2025

View reviewed changes

fix llama4 processing (#3)

62e9744

simon-mo added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 5, 2025

robertgshaw2-redhat mentioned this pull request Apr 5, 2025

[New Model]: Llama4 Support #16106

Closed

1 task

ywang96 added 2 commits April 5, 2025 15:55

skip llama4 standalone

b4533e3

Signed-off-by: Roger Wang <ywang@roblox.com>

add marks

0587bc7

Signed-off-by: Roger Wang <ywang@roblox.com>

yeqcharlotte reviewed Apr 5, 2025

View reviewed changes

vllm/model_executor/models/registry.py Show resolved Hide resolved

houseroad and others added 4 commits April 5, 2025 16:14

Add apply_router_weight_on_input to CompressedTensorsWNA16MoEMethod's…

c0ca739

… apply (#4) Signed-off-by: Lu Fang <lufang@fb.com>

precommit

866b94a

Signed-off-by: Roger Wang <ywang@roblox.com>

Add apply_router_weight_on_input to all FusedMoEMethodBase classes (#5)

1b8b67a

* Add missing apply_router_weight_on_input arg to all FusedMoEMethodBase classes * Make linter happy * More lint fixes * Revert "More lint fixes" This reverts commit 675b3c1.

fix basic model test

4e45bfc

Signed-off-by: Roger Wang <ywang@roblox.com>

AlekseyKorshuk reviewed Apr 6, 2025

View reviewed changes

Michaelvll mentioned this pull request Apr 6, 2025

[LLM] Add llama 4 example skypilot-org/skypilot#5125

Merged

5 tasks

DarkLight1337 reviewed Apr 6, 2025

View reviewed changes

docs/source/models/supported_models.md Show resolved Hide resolved

ywang96 approved these changes Apr 6, 2025

View reviewed changes

simon-mo merged commit c575232 into vllm-project:v0.8.3 Apr 6, 2025
60 of 67 checks passed

wenmengzhou reviewed Apr 6, 2025

View reviewed changes

houseroad mentioned this pull request Apr 6, 2025

Upstream Llama4 Support to Main #16113

Merged

yeqcharlotte mentioned this pull request Apr 9, 2025

[Model] Remove image mm limit for LLaMa4 #16365

Merged

xjpang pushed a commit to xjpang/vllm that referenced this pull request Apr 14, 2025

[Model] Support Llama4 in vLLM (vllm-project#16104)

1d3fdeb

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025

[Model] Support Llama4 in vLLM (vllm-project#16104)

ceb68de

		assert topk == 1, \
		"apply_router_weight_on_input is currently only implemented for topk=1"

Uh oh!

[Model] Support Llama4 in vLLM #16104

[Model] Support Llama4 in vLLM #16104

Uh oh!

Conversation

houseroad commented Apr 5, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Apr 5, 2025

Uh oh!

robertgshaw2-redhat commented Apr 5, 2025

Uh oh!

ywang96 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dsingal0 commented Apr 5, 2025

Uh oh!

ywang96 commented Apr 5, 2025

Uh oh!

dsingal0 commented Apr 5, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tlrmchlsmth Apr 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dsingal0 commented Apr 5, 2025

Uh oh!

ywang96 commented Apr 5, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ywang96 Apr 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yeqcharlotte Apr 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ywang96 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ywang96 Apr 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fsaudm commented Apr 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

houseroad commented Apr 5, 2025 •

edited by github-actions bot

Loading

tlrmchlsmth Apr 5, 2025 •

edited

Loading

ywang96 Apr 6, 2025 •

edited

Loading

yeqcharlotte Apr 6, 2025 •

edited

Loading

ywang96 Apr 6, 2025 •

edited

Loading