Skip to content

Commit

Permalink
qwen2vl notebook (#2380)
Browse files Browse the repository at this point in the history
  • Loading branch information
eaidova authored Sep 11, 2024
1 parent 4dc1f9f commit 09167f8
Show file tree
Hide file tree
Showing 8 changed files with 1,605 additions and 74 deletions.
1 change: 1 addition & 0 deletions .ci/ignore_treon_docker.txt
Original file line number Diff line number Diff line change
Expand Up @@ -78,3 +78,4 @@ notebooks/triposr-3d-reconstruction/triposr-3d-reconstruction.ipynb
notebooks/llm-agent-react/llm-agent-rag-llamaindex.ipynb
notebooks/stable-audio/stable-audio.ipynb
notebooks/internvl2/internvl2.ipynb
notebooks/qwen2-vl/qwen2-vl.ipynb
7 changes: 7 additions & 0 deletions .ci/skipped_notebooks.yml
Original file line number Diff line number Diff line change
Expand Up @@ -567,3 +567,10 @@
- ubuntu-20.04
- ubuntu-22.04
- windows-2019
- notebook: notebooks/qwen2-vl/qwen2-vl.ipynb
skips:
- os:
- macos-12
- ubuntu-20.04
- ubuntu-22.04
- windows-2019
14 changes: 14 additions & 0 deletions .ci/spellcheck/.pyspelling.wordlist.txt
Original file line number Diff line number Diff line change
Expand Up @@ -283,6 +283,7 @@ Gemma
gemma
genai
genAI
GenerationMixin
Girshick
Gitee
GitHub
Expand Down Expand Up @@ -454,6 +455,7 @@ mathbf
MathVista
MatMul
MBs
md
MediaPipe
mel
Mels
Expand Down Expand Up @@ -496,6 +498,7 @@ mpnet
mpt
MPT
MRPC
MTVQA
multiarchitecture
Multiclass
multiclass
Expand Down Expand Up @@ -573,12 +576,15 @@ Orca
OVC
overfitting
overlayed
ov
OV
OVC
OVModel
OVModelForCausalLM
OVModelForXXX
OVModelForXxx
OVMS
OVQwen
OVStableDiffusionPipeline
OVStableDiffusionInpaintPipeline
OvStableDiffusionInpaintingPipeline
Expand Down Expand Up @@ -616,6 +622,7 @@ PIXART
pixelwise
PIL
PNDM
png
Pointilism
PointNet
Postfuse
Expand Down Expand Up @@ -677,6 +684,7 @@ quantizing
QuartzNet
qwen
Qwen
QwenVL
Radiopaedia
Radosavovic
Raito
Expand All @@ -685,10 +693,12 @@ Ranftl
RASPP
rcnn
ReAct
README
RealSense
RealSR
Realtime
realtime
RealWorldQA
rebase
ReciproCAM
redistributable
Expand Down Expand Up @@ -776,6 +786,7 @@ softmax
softvc
SoftVC
SOTA
SoTA
Sovits
sparsity
Sparisty
Expand Down Expand Up @@ -920,6 +931,9 @@ VITReciproCAM
vits
VITS
vitt
VL
vl
VLModel
VM
Vladlen
VOC
Expand Down
79 changes: 5 additions & 74 deletions notebooks/internvl2/internvl2.ipynb
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand Down Expand Up @@ -36,7 +35,6 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand All @@ -50,7 +48,7 @@
"metadata": {},
"outputs": [],
"source": [
"%pip install -q \"transformers>4.36\" \"torch>=2.1\" \"torchvision\" \"einops\" \"timm\" \"Pillow\" \"gradio>=4.36\" --extra-index-url https://download.pytorch.org/whl/cpu\n",
"%pip install -q \"transformers>4.36,<4.45\" \"torch>=2.1\" \"torchvision\" \"einops\" \"timm\" \"Pillow\" \"gradio>=4.36\" --extra-index-url https://download.pytorch.org/whl/cpu\n",
"%pip install -q \"openvino>=2024.3.0\" \"nncf>=2.12.0\""
]
},
Expand Down Expand Up @@ -81,14 +79,13 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Select model\n",
"[back to top ⬆️](#Table-of-contents:)\n",
"\n",
"There are multiple InternVL2 model available in [models collection](https://huggingface.co/collections/OpenGVLab/internvl-20-667d3961ab5eb12c7ed1463e). You can select one of them for conversion and optimization in notebook using widget bellow:"
"There are multiple InternVL2 models available in [models collection](https://huggingface.co/collections/OpenGVLab/internvl-20-667d3961ab5eb12c7ed1463e). You can select one of them for conversion and optimization in notebook using widget bellow:"
]
},
{
Expand Down Expand Up @@ -147,7 +144,6 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand Down Expand Up @@ -220,16 +216,15 @@
"import nncf\n",
"\n",
"compression_configuration = {\n",
" \"mode\": nncf.CompressWeightsMode.INT4_SYM,\n",
" \"group_size\": 64,\n",
" \"mode\": nncf.CompressWeightsMode.INT4_ASYM,\n",
" \"group_size\": 128,\n",
" \"ratio\": 1.0,\n",
"}\n",
"\n",
"convert_internvl2_model(pt_model_id, model_dir, compression_configuration)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand Down Expand Up @@ -267,7 +262,6 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"tags": []
Expand Down Expand Up @@ -304,7 +298,6 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand Down Expand Up @@ -409,7 +402,6 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand Down Expand Up @@ -474,68 +466,7 @@
},
"widgets": {
"application/vnd.jupyter.widget-state+json": {
"state": {
"05db20e075bf4842bfb0ef80736dc837": {
"model_module": "@jupyter-widgets/base",
"model_module_version": "2.0.0",
"model_name": "LayoutModel",
"state": {}
},
"1e4ea3abea4b4bc49f93b4975027c4f5": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "2.0.0",
"model_name": "DescriptionStyleModel",
"state": {
"description_width": ""
}
},
"7d77646f3f6743c682d04e7eb7998211": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "2.0.0",
"model_name": "DropdownModel",
"state": {
"_options_labels": [
"OpenGVLab/InternVL2-1B",
"OpenGVLab/InternVL2-2B",
"OpenGVLab/InternVL2-4B",
"OpenGVLab/InternVL2-8B"
],
"description": "Model:",
"index": 0,
"layout": "IPY_MODEL_b4db0aacc7f1481c87c41473555d3635",
"style": "IPY_MODEL_1e4ea3abea4b4bc49f93b4975027c4f5"
}
},
"b4db0aacc7f1481c87c41473555d3635": {
"model_module": "@jupyter-widgets/base",
"model_module_version": "2.0.0",
"model_name": "LayoutModel",
"state": {}
},
"e17c3b92263a488184f405054b876878": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "2.0.0",
"model_name": "DropdownModel",
"state": {
"_options_labels": [
"CPU",
"AUTO"
],
"description": "Device:",
"index": 1,
"layout": "IPY_MODEL_05db20e075bf4842bfb0ef80736dc837",
"style": "IPY_MODEL_e3d29fef6d73445fbbb766698ebafca8"
}
},
"e3d29fef6d73445fbbb766698ebafca8": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "2.0.0",
"model_name": "DescriptionStyleModel",
"state": {
"description_width": ""
}
}
},
"state": {},
"version_major": 2,
"version_minor": 0
}
Expand Down
50 changes: 50 additions & 0 deletions notebooks/qwen2-vl/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# Visual-language assistant with Qwen2VL and OpenVINO

Qwen2VL is the latest addition to the QwenVL series of multimodal large language models.

**Key Enhancements of Qwen2VL:**
* **SoTA understanding of images of various resolution & ratio**: Qwen2-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc.
* **Understanding videos of 20min+**: Qwen2-VL can understand videos over 20 minutes for high-quality video-based question answering, dialog, content creation, etc.
* **Agent that can operate your mobiles, robots, etc.:** with the abilities of complex reasoning and decision making, Qwen2-VL can be integrated with devices like mobile phones, robots, etc., for automatic operation based on visual environment and text instructions.
* **Multilingual Support:** to serve global users, besides English and Chinese, Qwen2-VL now supports the understanding of texts in different languages inside images, including most European languages, Japanese, Korean, Arabic, Vietnamese, etc.


**Model Architecture Details:**

* **Naive Dynamic Resolution**: Qwen2-VL can handle arbitrary image resolutions, mapping them into a dynamic number of visual tokens, offering a more human-like visual processing experience.

<p align="center">
<img src="https://qianwen-res.oss-accelerate-overseas.aliyuncs.com/Qwen2-VL/qwen2_vl.jpg" width="50%"/>
<p>

* **Multimodal Rotary Position Embedding (M-ROPE)**: Decomposes positional embedding into parts to capture 1D textual, 2D visual, and 3D video positional information, enhancing its multimodal processing capabilities.

<p align="center">
<img src="http://qianwen-res.oss-accelerate-overseas.aliyuncs.com/Qwen2-VL/mrope.png" width="50%"/>
<p>



More details about model can be found in [model card](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct), [blog](https://qwenlm.github.io/blog/qwen2-vl/) and original [repo](https://github.com/QwenLM/Qwen2-VL).

In this tutorial we consider how to convert and optimize Qwen2VL model for creating multimodal chatbot. Additionally, we demonstrate how to apply stateful transformation on LLM part and model optimization techniques like weights compression using [NNCF](https://github.com/openvinotoolkit/nncf)

## Notebook contents
The tutorial consists from following steps:

- Install requirements
- Convert and Optimize model
- Run OpenVINO model inference
- Launch Interactive demo

In this demonstration, you'll create interactive chatbot that can answer questions about provided image's content.

The image bellow illustrates example of input prompt and model answer.
![example.png](https://github.com/user-attachments/assets/7e12ac6c-12f8-43d8-9c0a-b63d6ecaf20b)

## Installation instructions
This is a self-contained example that relies solely on its own code.</br>
We recommend running the notebook in a virtual environment. You only need a Jupyter server to start.
For details, please refer to [Installation Guide](../../README.md).

<img referrerpolicy="no-referrer-when-downgrade" src="https://static.scarf.sh/a.png?x-pxid=5b5a4db0-7875-4bfb-bdbd-01698b5b1a77&file=notebooks/qwen2-vl/README.md" />
Loading

0 comments on commit 09167f8

Please sign in to comment.