[Model] Make llama3.2 support multiple and interleaved images #9095

xiangxu-google · 2024-10-05T18:45:50Z

FILL IN THE PR DESCRIPTION HERE

FIX #xxxx (link existing issues this PR will resolve)

BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE

PR Checklist (Click to Expand)

Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

[Bugfix] for bug fixes.
[CI/Build] for build or continuous integration improvements.
[Doc] for documentation fixes and improvements.
[Model] for adding a new model or improving an existing model. Model name should appear in the title.
[Frontend] For changes on the vLLM frontend (e.g., OpenAI API server, LLM class, etc.)
[Kernel] for changes affecting CUDA kernels or other compute kernels.
[Core] for changes in the core vLLM logic (e.g., LLMEngine, AsyncLLMEngine, Scheduler, etc.)
[Hardware][Vendor] for hardware-specific changes. Vendor name should appear in the prefix (e.g., [Hardware][AMD]).
[Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

We adhere to Google Python style guide and Google C++ style guide.
Pass all linter checks. Please use format.sh to format your code.
The code need to be well-documented to ensure future contributors can easily understand the code.
Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.
Please add documentation to docs/source/ if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.

Adding or changing kernels

Each custom kernel needs a schema and one or more implementations to be registered with PyTorch.

Make sure custom ops are registered following PyTorch guidelines: Custom C++ and CUDA Operators and The Custom Operators Manual
Custom operations that return Tensors require meta-functions. Meta-functions should be implemented and registered in python so that dynamic dims can be handled automatically. See above documents for a description of meta-functions.
Use torch.libary.opcheck() to test the function registration and meta-function for any registered ops. See tests/kernels for examples.
When changing the C++ signature of an existing op, the schema must be updated to reflect the changes.
If a new custom type is needed, see the following document: Custom Class Support in PT2.

Notes for Large Changes

Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with rfc-required and might not go through the PR.

What to Expect for the Reviews

The goal of the vLLM team is to be a transparent reviewing machine. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process:

After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.
After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.
After the review, the reviewer will put an action-required label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.
Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion.

Thank You

Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone!

github-actions · 2024-10-05T18:46:02Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

heheda12345

Can you add some unit tests on multiple image & interleaved image?

vllm/model_executor/models/mllama.py

heheda12345 · 2024-10-05T21:25:09Z

Based on personal discussion with @xiangxu-google , we still need the following steps:

verify the correctness
change vllm/entrypoints/chat_utils.py to support OpenAI chat template API -> huggingface chat template API conversion
add unit tests
add mllama to examples/offline_inference_vision_language_multi_image.py and update the document to show mllama supports multiple image now.

ywang96 · 2024-10-06T00:37:25Z

I talked to @heheda12345 offline about the following steps for this model:

Let's focus on offline inference with llm.generate for this PR, and leave changes on vllm/entrypoints/chat_utils.py in a separate follow-up PR. I have verified that we can leverage tokenizer to format the input prompt in the offline example for multi images.

from transformers import AutoTokenizer

model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "Describe this image"}
    ]},
    {"role": "assistant", "content": [
        {"type": "text", "text": "This image is xxx"}
    ]},
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "How about this image"}
    ]}
]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
print(input_text)

Output:

<|begin_of_text|><|start_header_id|>user<|end_header_id|>

<|image|>Describe this image<|eot_id|><|start_header_id|>assistant<|end_header_id|>

This image is xxx<|eot_id|><|start_header_id|>user<|end_header_id|>

<|image|>How about this image<|eot_id|><|start_header_id|>assistant<|end_header_id|>

We also need to add the logic to compute the cross-attention mask for each individual request (For this we can reference to get_cross_attention_token_mask from transformers). IMO this should be done in input_processor_for_mllama, and we can store the result mask in llm_inputs["encoder_multi_modal_data"]["per_request_cross_attention_mask"] to be retrieved later.

xiangxu-google · 2024-10-06T01:56:27Z

We also need to add the logic to compute the cross-attention mask for each individual request...

Curious why do we need the logic for each request? This PR has implemented computing masks for a batch of requests.

ywang96 · 2024-10-06T03:53:57Z

We also need to add the logic to compute the cross-attention mask for each individual request...

Curious why do we need the logic for each request? This PR has implemented computing masks for a batch of requests.

You're right, and it's really a matter of where we put that mask computation logic. IMO it's cleaner if we decouple the mask computation for each individual requests and what's needed for the batch in the forward pass, since this is more or less still a CPU operation that we should put in input_processor_mllama. WDYT?

I'm also okay with leaving this logic as a whole for the batch inside forward for now if it's too much work to decouple :)

mgoin · 2024-10-07T14:19:25Z

FYI I tried using this PR to evaluate Llama 3.2 Vision 11B on MMMU and ran into a CUDA error during attention

CUDA_LAUNCH_BLOCKING=1 CUDA_VISIBLE_DEVICES=0 lm_eval --model vllm-vlm --model_args pretrained=meta-llama/Llama-3.2-11B-Vision-Instruct,enforce_eager=True,max_num_seqs=4,max_model_len=10000 --tasks mmmu_val --batch_size 4
2024-10-07:14:09:39,030 INFO     [__main__.py:279] Verbosity set to INFO
2024-10-07:14:09:39,061 INFO     [__init__.py:449] `group` and `group_alias` keys in TaskConfigs are deprecated and will be removed in v0.4.5 of lm_eval. The new `tag` field will be used to allow for a shortcut to a group of tasks one does not wish to aggregate metrics across. `group`s which aggregate across subtasks must be only defined in a separate group config file, which will be the official way to create groups that support cross-task aggregation as in `mmlu`. Please see the v0.4.4 patch notes and our documentation: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#advanced-group-configs for more information.
2024-10-07:14:09:43,245 INFO     [__main__.py:376] Selected Tasks: ['mmmu_val']
2024-10-07:14:09:43,246 INFO     [evaluator.py:161] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
[['pretrained', 'meta-llama/Llama-3.2-11B-Vision-Instruct'], ['enforce_eager', 'True'], ['max_num_seqs', '4'], ['max_model_len', '10000']]
2024-10-07:14:09:43,246 INFO     [evaluator.py:198] Initializing vllm-vlm model, with arguments: {'pretrained': 'meta-llama/Llama-3.2-11B-Vision-Instruct', 'enforce_eager': True, 'max_num_seqs': 4, 'max_model_len': 10000}
[['pretrained', 'meta-llama/Llama-3.2-11B-Vision-Instruct'], ['enforce_eager', 'True'], ['max_num_seqs', '4'], ['max_model_len', '10000']]
[['image', '5']]
WARNING 10-07 14:09:48 arg_utils.py:907] Block Manager v2 does not support encoder-decoder models currently. Using Block Manager v1 as fallback.
WARNING 10-07 14:09:48 config.py:365] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 10-07 14:09:48 llm_engine.py:237] Initializing an LLM engine (v0.6.3.dev22+g97c00c6e) with config: model='meta-llama/Llama-3.2-11B-Vision-Instruct', speculative_config=None, tokenizer='meta-llama/Llama-3.2-11B-Vision-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=10000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=1234, served_model_name=meta-llama/Llama-3.2-11B-Vision-Instruct, use_v2_block_manager=False, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=False, mm_processor_kwargs=None)
INFO 10-07 14:09:48 enc_dec_model_runner.py:141] EncoderDecoderModelRunner requires XFormers backend; overriding backend auto-selection and forcing XFormers.
INFO 10-07 14:09:48 selector.py:116] Using XFormers backend.
/home/mgoin/venvs/vllm/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_fwd")
/home/mgoin/venvs/vllm/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_bwd")
INFO 10-07 14:09:49 model_runner.py:1049] Starting to load model meta-llama/Llama-3.2-11B-Vision-Instruct...
INFO 10-07 14:09:49 selector.py:116] Using XFormers backend.
INFO 10-07 14:09:49 weight_utils.py:242] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:00<00:03,  1.03it/s]
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:02<00:03,  1.08s/it]
Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:03<00:02,  1.10s/it]
Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:04<00:01,  1.09s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:04<00:00,  1.16it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:04<00:00,  1.05it/s]

INFO 10-07 14:09:55 model_runner.py:1060] Loading model weights took 19.9073 GB
INFO 10-07 14:09:55 enc_dec_model_runner.py:301] Starting profile run for multi-modal models.
INFO 10-07 14:10:00 gpu_executor.py:122] # GPU blocks: 12935, # CPU blocks: 1638
2024-10-07:14:10:59,014 INFO     [evaluator.py:279] Setting fewshot random generator seed to 1234
2024-10-07:14:10:59,016 WARNING  [model.py:422] model.chat_template was called with the chat_template set to False or None. Therefore no chat template will be applied. Make sure this is an intended behavior.
2024-10-07:14:11:08,650 INFO     [evaluator.py:485] Running generate_until requests
Running generate_until requests with text+image input:   0%|                                                                                                                                                      | 0/900 [00:00<?, ?it/s]WARNING 10-07 14:11:08 preprocess.py:87] Falling back on <BOS> for decoder start token id because decoder start token id is not available.
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/mgoin/venvs/vllm/bin/lm_eval", line 8, in <module>
[rank0]:     sys.exit(cli_evaluate())
[rank0]:   File "/home/mgoin/code/lm-evaluation-harness/lm_eval/__main__.py", line 382, in cli_evaluate
[rank0]:     results = evaluator.simple_evaluate(
[rank0]:   File "/home/mgoin/code/lm-evaluation-harness/lm_eval/utils.py", line 398, in _wrapper
[rank0]:     
[rank0]:   File "/home/mgoin/code/lm-evaluation-harness/lm_eval/evaluator.py", line 301, in simple_evaluate
[rank0]:     results = evaluate(
[rank0]:   File "/home/mgoin/code/lm-evaluation-harness/lm_eval/utils.py", line 398, in _wrapper
[rank0]:     
[rank0]:   File "/home/mgoin/code/lm-evaluation-harness/lm_eval/evaluator.py", line 496, in evaluate
[rank0]:     resps = getattr(lm, reqtype)(cloned_reqs)
[rank0]:   File "/home/mgoin/code/lm-evaluation-harness/lm_eval/models/vllm_vlms.py", line 266, in generate_until
[rank0]:     cont = self._model_generate(inputs, stop=until, generate=True, **kwargs)
[rank0]:   File "/home/mgoin/code/lm-evaluation-harness/lm_eval/models/vllm_vlms.py", line 123, in _model_generate
[rank0]:     outputs = self.model.generate(
[rank0]:   File "/home/mgoin/code/vllm/vllm/utils.py", line 1051, in inner
[rank0]:     return fn(*args, **kwargs)
[rank0]:   File "/home/mgoin/code/vllm/vllm/entrypoints/llm.py", line 391, in generate
[rank0]:     outputs = self._run_engine(use_tqdm=use_tqdm)
[rank0]:   File "/home/mgoin/code/vllm/vllm/entrypoints/llm.py", line 899, in _run_engine
[rank0]:     step_outputs = self.llm_engine.step()
[rank0]:   File "/home/mgoin/code/vllm/vllm/engine/llm_engine.py", line 1404, in step
[rank0]:     outputs = self.model_executor.execute_model(
[rank0]:   File "/home/mgoin/code/vllm/vllm/executor/gpu_executor.py", line 130, in execute_model
[rank0]:     output = self.driver_worker.execute_model(execute_model_req)
[rank0]:   File "/home/mgoin/code/vllm/vllm/worker/worker_base.py", line 327, in execute_model
[rank0]:     output = self.model_runner.execute_model(
[rank0]:   File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/home/mgoin/code/vllm/vllm/worker/enc_dec_model_runner.py", line 203, in execute_model
[rank0]:     hidden_or_intermediate_states = model_executable(
[rank0]:   File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/mgoin/code/vllm/vllm/model_executor/models/mllama.py", line 1252, in forward
[rank0]:     outputs = self.language_model(
[rank0]:   File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/mgoin/code/vllm/vllm/model_executor/models/mllama.py", line 956, in forward
[rank0]:     hidden_states = self.model(
[rank0]:   File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/mgoin/code/vllm/vllm/model_executor/models/mllama.py", line 896, in forward
[rank0]:     hidden_states = decoder_layer(
[rank0]:   File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/mgoin/code/vllm/vllm/model_executor/models/mllama.py", line 826, in forward
[rank0]:     hidden_states = self.cross_attn(
[rank0]:   File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/mgoin/code/vllm/vllm/model_executor/models/mllama.py", line 725, in forward
[rank0]:     output = self.attention_with_mask(q, k, v, kv_cache,
[rank0]:   File "/home/mgoin/code/vllm/vllm/model_executor/models/mllama.py", line 775, in attention_with_mask
[rank0]:     output = F.scaled_dot_product_attention(q,
[rank0]: RuntimeError: CUDA error: invalid configuration argument
[rank0]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Running generate_until requests with text+image input:   0%|                                                                                                                                                      | 0/900 [00:01<?, ?it/s]

xiangxu-google · 2024-10-07T19:24:36Z

FYI I tried using this PR to evaluate Llama 3.2 Vision 11B on MMMU and ran into a CUDA error during attention

What is your eval environment, like torch version, vllm version, GPU version, CUDA version. I can't reproduce the issue on H100 CUDA 12.4 with vllm installed from source.
Since F.scaled_dot_product_attention is a torch API, users can only make sure the shapes of input tensors are correct. This PR generates correct input shapes with:

        q = q.transpose(0, 1).view(self.num_local_key_value_heads,
                                   self.num_key_value_groups, q_len,
                                   self.head_dim)
        k = k.transpose(0,
                        1)[:,
                           None, :, :].expand(self.num_local_key_value_heads,
                                              self.num_key_value_groups,
                                              kv_len, self.head_dim)
        v = v.transpose(0,
                        1)[:,
                           None, :, :].expand(self.num_local_key_value_heads,
                                              self.num_key_value_groups,
                                              kv_len, self.head_dim)
        attention_mask = attention_mask.view(1, 1, q_len, kv_len)
        output = F.scaled_dot_product_attention(q,
                                                k,
                                                v,
                                                attn_mask=attention_mask,
                                                is_causal=False)

xiangxu-google · 2024-10-07T19:40:12Z

We also need to add the logic to compute the cross-attention mask for each individual request...

Curious why do we need the logic for each request? This PR has implemented computing masks for a batch of requests.

You're right, and it's really a matter of where we put that mask computation logic. IMO it's cleaner if we decouple the mask computation for each individual requests and what's needed for the batch in the forward pass, since this is more or less still a CPU operation that we should put in input_processor_mllama. WDYT?

I'm also okay with leaving this logic as a whole for the batch inside forward for now if it's too much work to decouple :)

Thanks for the suggestion! I need to make sure the correctness of this implementation first for a controllable landing. We can refactor it in a separate PR to move the CPU logic to input_processor_mllama later.

xiangxu-google · 2024-10-09T00:15:06Z

@heheda12345 @ywang96 Unit tests and examples have been added.

heheda12345

We need to update the document to show that mllama supports multi-image now.

vllm/model_executor/models/mllama.py

examples/offline_inference_vision_language_multi_image.py

tests/models/encoder_decoder/vision_language/test_mllama.py

vllm/model_executor/models/mllama.py

heheda12345 · 2024-10-10T05:44:04Z

Some following-up of this PR:

We need to change vllm/entrypoints/chat_utils.py to provide full API Server support of mllama mulit-image.
It's better to prepare the cross-attention mask for each individual request in input_processor_for_mllama, and we can store the result mask in llm_inputs["encoder_multi_modal_data"]["per_request_cross_attention_mask"] to be retrieved later.
Take a deeper look on how to do cross attention. Currently, each text does cross attention to images in ALL batches, and the complexity is $\sum{len(q_i)} * \sum{len(k_i) * hidden}$. Another choice is to pad the matrix to $(batchsize, max(len(q_i)), hidden)$ and $(batchsize, max(len(k_i)), hidden)$ to do the cross attention.

CC @xiangxu-google @ywang96

ywang96 · 2024-10-10T06:24:00Z

Really appreciate the contribution @xiangxu-google and the review @heheda12345!

I will give this PR a final pass and some testing tomorrow, and will approve by EoD if everything looks good!

ywang96

Hello! I have tested this PR and confirmed functionally wise it's working. However, I'm getting a different result from @heheda12345 - not sure if this is an issue of my machine.

I'm giving approval for now but will let Chen verify separately the final correctness of this PR!

ywang96 · 2024-10-14T16:43:05Z

@heheda12345 As we discussed offline - please ping me for final merge once you have verified the correctness of this PR. Thank you!

heheda12345 · 2024-10-14T22:12:34Z

I've tried 2 different machines and get the same result as huggingface. Though I still fail to reproduce Roger's problem, I think this pr can be merged.

…roject#9095) Signed-off-by: charlifu <charlifu@amd.com>

…roject#9095) Signed-off-by: Alvant <alvasian@yandex.ru>

…roject#9095) Signed-off-by: Amit Garg <mitgarg17495@gmail.com>

xiangxu-google force-pushed the llama3.2 branch from c57ce74 to 64acbc5 Compare October 5, 2024 18:51

heheda12345 self-requested a review October 5, 2024 19:04

heheda12345 suggested changes Oct 5, 2024

View reviewed changes

heheda12345 mentioned this pull request Oct 5, 2024

Llama3.2 Vision Model: Guides and Issues #8826

Open

xiangxu-google force-pushed the llama3.2 branch 2 times, most recently from e9af2a8 to 6aa5a8d Compare October 5, 2024 22:25

DarkLight1337 mentioned this pull request Oct 7, 2024

[RFC]: Multi-modality Support Refactoring #4194

Open

69 tasks

xiangxu-google force-pushed the llama3.2 branch 2 times, most recently from 61d6a25 to 864c90e Compare October 8, 2024 20:53

xiangxu-google requested a review from DarkLight1337 as a code owner October 8, 2024 20:53

xiangxu-google force-pushed the llama3.2 branch from 864c90e to 463d2a5 Compare October 8, 2024 20:55

heheda12345 suggested changes Oct 9, 2024

View reviewed changes

xiangxu-google force-pushed the llama3.2 branch from 0cea2ee to a69f7f9 Compare October 10, 2024 00:50

xiangxu-google added 6 commits October 9, 2024 21:20

Make llama3.2 support multiple and interleaved iamges

6a61014

Fix attention and mask

d88aded

add unit tests

9cdef06

add comment for profiling run

1de49b2

Add multi image example

44b1e5b

resovle comments

61284da

xiangxu-google force-pushed the llama3.2 branch 2 times, most recently from adcb843 to 80916c7 Compare October 10, 2024 04:21

fix unstable prompts

9bb8b71

xiangxu-google force-pushed the llama3.2 branch from 80916c7 to 9bb8b71 Compare October 10, 2024 04:48

heheda12345 approved these changes Oct 10, 2024

View reviewed changes

ywang96 approved these changes Oct 14, 2024

View reviewed changes

ywang96 added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 14, 2024

ywang96 merged commit f0fe4fe into vllm-project:main Oct 14, 2024
69 checks passed

alex-jw-brooks mentioned this pull request Oct 15, 2024

[Frontend] Enable Online Multi-image Support for MLlama #9393

Merged

charlifu pushed a commit to charlifu/vllm that referenced this pull request Oct 23, 2024

[Model] Make llama3.2 support multiple and interleaved images (vllm-p…

3b94456

…roject#9095) Signed-off-by: charlifu <charlifu@amd.com>

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024

[Model] Make llama3.2 support multiple and interleaved images (vllm-p…

e41c099

…roject#9095) Signed-off-by: Alvant <alvasian@yandex.ru>

garg-amit pushed a commit to garg-amit/vllm that referenced this pull request Oct 28, 2024

[Model] Make llama3.2 support multiple and interleaved images (vllm-p…

6fe2b42

…roject#9095) Signed-off-by: Amit Garg <mitgarg17495@gmail.com>

DarkLight1337 mentioned this pull request Nov 1, 2024

[Doc] Update multi-input support #9906

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Model] Make llama3.2 support multiple and interleaved images #9095

[Model] Make llama3.2 support multiple and interleaved images #9095

xiangxu-google commented Oct 5, 2024

github-actions bot commented Oct 5, 2024

heheda12345 left a comment

heheda12345 commented Oct 5, 2024 •

edited

Loading

ywang96 commented Oct 6, 2024 •

edited

Loading

xiangxu-google commented Oct 6, 2024 •

edited

Loading

ywang96 commented Oct 6, 2024

mgoin commented Oct 7, 2024

xiangxu-google commented Oct 7, 2024 •

edited

Loading

xiangxu-google commented Oct 7, 2024

xiangxu-google commented Oct 9, 2024

heheda12345 left a comment

heheda12345 commented Oct 10, 2024

ywang96 commented Oct 10, 2024

ywang96 left a comment •

edited

Loading

ywang96 commented Oct 14, 2024

heheda12345 commented Oct 14, 2024

[Model] Make llama3.2 support multiple and interleaved images #9095

[Model] Make llama3.2 support multiple and interleaved images #9095

Conversation

xiangxu-google commented Oct 5, 2024

PR Title and Classification

Code Quality

Adding or changing kernels

Notes for Large Changes

What to Expect for the Reviews

Thank You

github-actions bot commented Oct 5, 2024

heheda12345 left a comment

Choose a reason for hiding this comment

heheda12345 commented Oct 5, 2024 • edited Loading

ywang96 commented Oct 6, 2024 • edited Loading

xiangxu-google commented Oct 6, 2024 • edited Loading

ywang96 commented Oct 6, 2024

mgoin commented Oct 7, 2024

xiangxu-google commented Oct 7, 2024 • edited Loading

xiangxu-google commented Oct 7, 2024

xiangxu-google commented Oct 9, 2024

heheda12345 left a comment

Choose a reason for hiding this comment

heheda12345 commented Oct 10, 2024

ywang96 commented Oct 10, 2024

ywang96 left a comment • edited Loading

Choose a reason for hiding this comment

ywang96 commented Oct 14, 2024

heheda12345 commented Oct 14, 2024

heheda12345 commented Oct 5, 2024 •

edited

Loading

ywang96 commented Oct 6, 2024 •

edited

Loading

xiangxu-google commented Oct 6, 2024 •

edited

Loading

xiangxu-google commented Oct 7, 2024 •

edited

Loading

ywang96 left a comment •

edited

Loading