Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add support for Llama 3.2-Vision models #2376

Merged
merged 10 commits into from
Nov 5, 2024
10 changes: 10 additions & 0 deletions doc/source/models/builtin/llm/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -240,6 +240,16 @@ The following is a list of built-in LLM in Xinference:
- chat, tools
- 131072
- The Llama 3.1 instruction tuned models are optimized for dialogue use cases and outperform many of the available open source chat models on common industry benchmarks..

* - :ref:`llama-3.2-vision <models_llm_llama-3.2-vision>`
- generate, vision
- 131072
- The Llama 3.2-Vision collection of multimodal large language models (LLMs) is a collection of pretrained and instruction-tuned image reasoning generative models in 11B and 90B sizes (text + images in / text out)...

* - :ref:`llama-3.2-vision-instruct <models_llm_llama-3.2-vision-instruct>`
- chat, vision
- 131072
- The Llama 3.2-Vision-instruct instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks...

* - :ref:`minicpm-2b-dpo-bf16 <models_llm_minicpm-2b-dpo-bf16>`
- chat
Expand Down
47 changes: 47 additions & 0 deletions doc/source/models/builtin/llm/llama-3.2-vision-instruct.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
.. _models_llm_llama-3.2-vision-instruct:

========================================
llama-3.2-vision-instruct
========================================

- **Context Length:** 131072
- **Model Name:** llama-3.2-vision-instruct
- **Languages:** en, de, fr, it, pt, hi, es, th
- **Abilities:** chat, vision
- **Description:** The Llama 3.2-Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks...

Specifications
^^^^^^^^^^^^^^

Model Spec 1 (pytorch, 11 Billion)
++++++++++++++++++++++++++++++++++++++++

- **Model Format:** pytorch
- **Model Size (in billions):** 11
- **Quantizations:** none
- **Engines**: vLLM, Transformers
- **Model ID:** meta-llama/Meta-Llama-3.2-11B-Vision-Instruct
- **Model Hubs**: `Hugging Face <https://huggingface.co/meta-llama/Meta-Llama-3.2-11B-Vision-Instruct>`__, `ModelScope <https://modelscope.cn/models/LLM-Research/Meta-Llama-3.2-11B-Vision-Instruct>`__

Execute the following command to launch the model, remember to replace ``${quantization}`` with your
chosen quantization method from the options listed above::

xinference launch --model-engine transformers --model-name llama-3.2-vision-instruct --size-in-billions 11 --model-format pytorch --quantization ${quantization}
xinference launch --model-engine vllm --enforce_eager --max_num_seqs 16 --model-name llama-3.2-vision-instruct --size-in-billions 11 --model-format pytorch

Model Spec 2 (pytorch, 90 Billion)
++++++++++++++++++++++++++++++++++++++++

- **Model Format:** pytorch
- **Model Size (in billions):** 90
- **Quantizations:** none
- **Engines**: vLLM, Transformers
- **Model ID:** meta-llama/Meta-Llama-3.2-90B-Vision-Instruct
- **Model Hubs**: `Hugging Face <https://huggingface.co/meta-llama/Meta-Llama-3.2-90B-Vision-Instruct>`__, `ModelScope <https://modelscope.cn/models/LLM-Research/Meta-Llama-3.2-90B-Vision-Instruct>`__

Execute the following command to launch the model, remember to replace ``${quantization}`` with your
chosen quantization method from the options listed above::

xinference launch --model-engine transformers --model-name llama-3.2-vision-instruct --size-in-billions 90 --model-format pytorch --quantization ${quantization}
xinference launch --model-engine vllm --enforce_eager --max_num_seqs 16 --model-name llama-3.2-vision-instruct --size-in-billions 90 --model-format pytorch

47 changes: 47 additions & 0 deletions doc/source/models/builtin/llm/llama-3.2-vision.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
.. _models_llm_llama-3.2-vision:

================
llama-3.2-vision
================

- **Context Length:** 131072
- **Model Name:** llama-3.2-vision
- **Languages:** en, de, fr, it, pt, hi, es, th
- **Abilities:** generate, vision
- **Description:** The Llama 3.2-Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks...

Specifications
^^^^^^^^^^^^^^

Model Spec 1 (pytorch, 11 Billion)
++++++++++++++++++++++++++++++++++++++++

- **Model Format:** pytorch
- **Model Size (in billions):** 11
- **Quantizations:** none
- **Engines**: vLLM, Transformers
- **Model ID:** meta-llama/Meta-Llama-3.2-11B-Vision
- **Model Hubs**: `Hugging Face <https://huggingface.co/meta-llama/Meta-Llama-3.2-11B-Vision>`__, `ModelScope <https://modelscope.cn/models/LLM-Research/Meta-Llama-3.2-11B-Vision>`__

Execute the following command to launch the model, remember to replace ``${quantization}`` with your
chosen quantization method from the options listed above::

xinference launch --model-engine transformers --model-name llama-3.2-vision --size-in-billions 11 --model-format pytorch --quantization ${quantization}
xinference launch --model-engine vllm --enforce_eager --max_num_seqs 16 --model-name llama-3.2-vision --size-in-billions 11 --model-format pytorch

Model Spec 2 (pytorch, 90 Billion)
++++++++++++++++++++++++++++++++++++++++

- **Model Format:** pytorch
- **Model Size (in billions):** 90
- **Quantizations:** none
- **Engines**: vLLM, Transformers
- **Model ID:** meta-llama/Meta-Llama-3.2-90B-Vision
- **Model Hubs**: `Hugging Face <https://huggingface.co/meta-llama/Meta-Llama-3.2-90B-Vision>`__, `ModelScope <https://modelscope.cn/models/LLM-Research/Meta-Llama-3.2-90B-Vision>`__

Execute the following command to launch the model, remember to replace ``${quantization}`` with your
chosen quantization method from the options listed above::

xinference launch --model-engine transformers --model-name llama-3.2-vision --size-in-billions 90 --model-format pytorch --quantization ${quantization}
xinference launch --model-engine vllm --enforce_eager --max_num_seqs 16 --model-name llama-3.2-vision --size-in-billions 90 --model-format pytorch

2 changes: 2 additions & 0 deletions doc/source/models/model_abilities/vision.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,8 @@ The ``vision`` ability is supported with the following models in Xinference:
* :ref:`MiniCPM-Llama3-V 2.6 <models_llm_minicpm-v-2.6>`
* :ref:`internvl2 <models_llm_internvl2>`
* :ref:`qwen2-vl-instruct <models_llm_qwen2-vl-instruct>`
* :ref:`llama-3.2-vision <models_llm_llama-3.2-vision>`
* :ref:`llama-3.2-vision-instruct <models_llm_llama-3.2-vision-instruct>`


Quickstart
Expand Down
87 changes: 87 additions & 0 deletions xinference/model/llm/llm_family.json
Original file line number Diff line number Diff line change
Expand Up @@ -1312,6 +1312,93 @@
"<|eom_id|>"
]
},
{
"version": 1,
"context_length": 131072,
"model_name": "llama-3.2-vision-instruct",
"model_lang": [
"en",
"de",
"fr",
"it",
"pt",
"hi",
"es",
"th"
],
"model_ability": [
"chat",
"vision"
],
"model_description": "Llama 3.2-Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image...",
"model_specs": [
{
"model_format": "pytorch",
"model_size_in_billions": 11,
"quantizations": [
"none"
],
"model_id": "meta-llama/Llama-3.2-11B-Vision-Instruct"
},
{
"model_format": "pytorch",
"model_size_in_billions": 90,
"quantizations": [
"none"
],
"model_id": "meta-llama/Llama-3.2-90B-Vision-Instruct"
}
],
"chat_template": "{% for message in messages %}{% if loop.index0 == 0 %}{{ bos_token }}{% endif %}{{ '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n' }}{% if message['content'] is string %}{{ message['content'] }}{% else %}{% for content in message['content'] %}{% if content['type'] == 'image' %}{{ '<|image|>' }}{% elif content['type'] == 'text' %}{{ content['text'] }}{% endif %}{% endfor %}{% endif %}{{ '<|eot_id|>' }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}{% endif %}",
"stop_token_ids": [
128001,
128008,
128009
],
"stop": [
"<|end_of_text|>",
"<|eot_id|>",
"<|eom_id|>"
]
},
{
"version": 1,
"context_length": 131072,
"model_name": "llama-3.2-vision",
"model_lang": [
"en",
"de",
"fr",
"it",
"pt",
"hi",
"es",
"th"
],
"model_ability": [
"generate",
"vision"
],
"model_description": "The Llama 3.2-Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image...",
"model_specs": [
{
"model_format": "pytorch",
"model_size_in_billions": 11,
"quantizations": [
"none"
],
"model_id": "meta-llama/Meta-Llama-3.2-11B-Vision"
},
{
"model_format": "pytorch",
"model_size_in_billions": 90,
"quantizations": [
"none"
],
"model_id": "meta-llama/Meta-Llama-3.2-90B-Vision"
}
]
},
{
"version": 1,
"context_length": 2048,
Expand Down
91 changes: 91 additions & 0 deletions xinference/model/llm/llm_family_modelscope.json
Original file line number Diff line number Diff line change
Expand Up @@ -363,6 +363,97 @@
"<|eom_id|>"
]
},
{
"version": 1,
"context_length": 131072,
"model_name": "llama-3.2-vision-instruct",
"model_lang": [
"en",
"de",
"fr",
"it",
"pt",
"hi",
"es",
"th"
],
"model_ability": [
"chat",
"vision"
],
"model_description": "Llama 3.2-Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image...",
"model_specs": [
{
"model_format": "pytorch",
"model_size_in_billions": 11,
"quantizations": [
"none"
],
"model_id": "LLM-Research/Llama-3.2-11B-Vision-Instruct",
"model_hub": "modelscope"
},
{
"model_format": "pytorch",
"model_size_in_billions": 90,
"quantizations": [
"none"
],
"model_id": "LLM-Research/Llama-3.2-90B-Vision-Instruct",
"model_hub": "modelscope"
}
],
"chat_template": "{% for message in messages %}{% if loop.index0 == 0 %}{{ bos_token }}{% endif %}{{ '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n' }}{% if message['content'] is string %}{{ message['content'] }}{% else %}{% for content in message['content'] %}{% if content['type'] == 'image' %}{{ '<|image|>' }}{% elif content['type'] == 'text' %}{{ content['text'] }}{% endif %}{% endfor %}{% endif %}{{ '<|eot_id|>' }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}{% endif %}",
"stop_token_ids": [
128001,
128008,
128009
],
"stop": [
"<|end_of_text|>",
"<|eot_id|>",
"<|eom_id|>"
]
},
{
"version": 1,
"context_length": 131072,
"model_name": "llama-3.2-vision",
"model_lang": [
"en",
"de",
"fr",
"it",
"pt",
"hi",
"es",
"th"
],
"model_ability": [
"generate",
"vision"
],
"model_description": "The Llama 3.2-Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image...",
"model_specs": [
{
"model_format": "pytorch",
"model_size_in_billions": 11,
"quantizations": [
"none"
],
"model_id": "LLM-Research/Llama-3.2-11B-Vision",
"model_hub": "modelscope"
},
{
"model_format": "pytorch",
"model_size_in_billions": 90,
"quantizations": [
"none"
],
"model_id": "LLM-Research/Llama-3.2-90B-Vision",
"model_hub": "modelscope"
}
]
},
{
"version": 1,
"context_length": 2048,
Expand Down
3 changes: 2 additions & 1 deletion xinference/model/llm/vllm/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -163,7 +163,6 @@ class VLLMGenerateConfig(TypedDict, total=False):
VLLM_SUPPORTED_CHAT_MODELS.append("deepseek-v2-chat-0628")
VLLM_SUPPORTED_CHAT_MODELS.append("deepseek-v2.5")


if VLLM_INSTALLED and vllm.__version__ >= "0.5.3":
VLLM_SUPPORTED_CHAT_MODELS.append("gemma-2-it")
VLLM_SUPPORTED_CHAT_MODELS.append("mistral-nemo-instruct")
Expand All @@ -177,6 +176,8 @@ class VLLMGenerateConfig(TypedDict, total=False):
VLLM_SUPPORTED_VISION_MODEL_LIST.append("internvl2")

if VLLM_INSTALLED and vllm.__version__ >= "0.6.3":
VLLM_SUPPORTED_MODELS.append("llama-3.2-vision")
VLLM_SUPPORTED_VISION_MODEL_LIST.append("llama-3.2-vision-instruct")
VLLM_SUPPORTED_VISION_MODEL_LIST.append("qwen2-vl-instruct")


Expand Down
Loading