qwen2vl notebook (#2380)

openvinotoolkit · Sep 11, 2024 · 09167f8 · 09167f8
1 parent 4dc1f9f
commit 09167f8
Show file tree

Hide file tree

Showing 8 changed files with 1,605 additions and 74 deletions.
diff --git a/.ci/ignore_treon_docker.txt b/.ci/ignore_treon_docker.txt
@@ -78,3 +78,4 @@ notebooks/triposr-3d-reconstruction/triposr-3d-reconstruction.ipynb
 notebooks/llm-agent-react/llm-agent-rag-llamaindex.ipynb
 notebooks/stable-audio/stable-audio.ipynb
 notebooks/internvl2/internvl2.ipynb
+notebooks/qwen2-vl/qwen2-vl.ipynb
diff --git a/.ci/skipped_notebooks.yml b/.ci/skipped_notebooks.yml
@@ -567,3 +567,10 @@
         - ubuntu-20.04
         - ubuntu-22.04
         - windows-2019
+- notebook: notebooks/qwen2-vl/qwen2-vl.ipynb
+  skips:
+    - os:
+        - macos-12
+        - ubuntu-20.04
+        - ubuntu-22.04
+        - windows-2019
diff --git a/.ci/spellcheck/.pyspelling.wordlist.txt b/.ci/spellcheck/.pyspelling.wordlist.txt
@@ -283,6 +283,7 @@ Gemma
 gemma
 genai
 genAI
+GenerationMixin
 Girshick
 Gitee
 GitHub
@@ -454,6 +455,7 @@ mathbf
 MathVista
 MatMul
 MBs
+md
 MediaPipe
 mel
 Mels
@@ -496,6 +498,7 @@ mpnet
 mpt
 MPT
 MRPC
+MTVQA
 multiarchitecture
 Multiclass
 multiclass
@@ -573,12 +576,15 @@ Orca
 OVC
 overfitting
 overlayed
+ov
 OV
 OVC
 OVModel
+OVModelForCausalLM
 OVModelForXXX
 OVModelForXxx
 OVMS
+OVQwen
 OVStableDiffusionPipeline
 OVStableDiffusionInpaintPipeline
 OvStableDiffusionInpaintingPipeline
@@ -616,6 +622,7 @@ PIXART
 pixelwise
 PIL
 PNDM
+png
 Pointilism
 PointNet
 Postfuse
@@ -677,6 +684,7 @@ quantizing
 QuartzNet
 qwen
 Qwen
+QwenVL
 Radiopaedia
 Radosavovic
 Raito
@@ -685,10 +693,12 @@ Ranftl
 RASPP
 rcnn
 ReAct
+README
 RealSense
 RealSR
 Realtime
 realtime
+RealWorldQA
 rebase
 ReciproCAM
 redistributable
@@ -776,6 +786,7 @@ softmax
 softvc
 SoftVC
 SOTA
+SoTA
 Sovits
 sparsity
 Sparisty
@@ -920,6 +931,9 @@ VITReciproCAM
 vits
 VITS
 vitt
+VL
+vl
+VLModel
 VM
 Vladlen
 VOC

diff --git a/notebooks/internvl2/internvl2.ipynb b/notebooks/internvl2/internvl2.ipynb
@@ -1,7 +1,6 @@
 {
  "cells": [
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -36,7 +35,6 @@
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -50,7 +48,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "%pip install -q \"transformers>4.36\" \"torch>=2.1\" \"torchvision\" \"einops\" \"timm\" \"Pillow\" \"gradio>=4.36\" --extra-index-url https://download.pytorch.org/whl/cpu\n",
+    "%pip install -q \"transformers>4.36,<4.45\" \"torch>=2.1\" \"torchvision\" \"einops\" \"timm\" \"Pillow\" \"gradio>=4.36\" --extra-index-url https://download.pytorch.org/whl/cpu\n",
     "%pip install -q \"openvino>=2024.3.0\" \"nncf>=2.12.0\""
    ]
   },
@@ -81,14 +79,13 @@
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## Select model\n",
     "[back to top ⬆️](#Table-of-contents:)\n",
     "\n",
-    "There are multiple InternVL2 model available in [models collection](https://huggingface.co/collections/OpenGVLab/internvl-20-667d3961ab5eb12c7ed1463e). You can select one of them for conversion and optimization in notebook using widget bellow:"
+    "There are multiple InternVL2 models available in [models collection](https://huggingface.co/collections/OpenGVLab/internvl-20-667d3961ab5eb12c7ed1463e). You can select one of them for conversion and optimization in notebook using widget bellow:"
    ]
   },
   {
@@ -147,7 +144,6 @@
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -220,16 +216,15 @@
     "import nncf\n",
     "\n",
     "compression_configuration = {\n",
-    "    \"mode\": nncf.CompressWeightsMode.INT4_SYM,\n",
-    "    \"group_size\": 64,\n",
+    "    \"mode\": nncf.CompressWeightsMode.INT4_ASYM,\n",
+    "    \"group_size\": 128,\n",
     "    \"ratio\": 1.0,\n",
     "}\n",
     "\n",
     "convert_internvl2_model(pt_model_id, model_dir, compression_configuration)"
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -267,7 +262,6 @@
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {
     "tags": []
@@ -304,7 +298,6 @@
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -409,7 +402,6 @@
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -474,68 +466,7 @@
   },
   "widgets": {
    "application/vnd.jupyter.widget-state+json": {
-    "state": {
-     "05db20e075bf4842bfb0ef80736dc837": {
-      "model_module": "@jupyter-widgets/base",
-      "model_module_version": "2.0.0",
-      "model_name": "LayoutModel",
-      "state": {}
-     },
-     "1e4ea3abea4b4bc49f93b4975027c4f5": {
-      "model_module": "@jupyter-widgets/controls",
-      "model_module_version": "2.0.0",
-      "model_name": "DescriptionStyleModel",
-      "state": {
-       "description_width": ""
-      }
-     },
-     "7d77646f3f6743c682d04e7eb7998211": {
-      "model_module": "@jupyter-widgets/controls",
-      "model_module_version": "2.0.0",
-      "model_name": "DropdownModel",
-      "state": {
-       "_options_labels": [
-        "OpenGVLab/InternVL2-1B",
-        "OpenGVLab/InternVL2-2B",
-        "OpenGVLab/InternVL2-4B",
-        "OpenGVLab/InternVL2-8B"
-       ],
-       "description": "Model:",
-       "index": 0,
-       "layout": "IPY_MODEL_b4db0aacc7f1481c87c41473555d3635",
-       "style": "IPY_MODEL_1e4ea3abea4b4bc49f93b4975027c4f5"
-      }
-     },
-     "b4db0aacc7f1481c87c41473555d3635": {
-      "model_module": "@jupyter-widgets/base",
-      "model_module_version": "2.0.0",
-      "model_name": "LayoutModel",
-      "state": {}
-     },
-     "e17c3b92263a488184f405054b876878": {
-      "model_module": "@jupyter-widgets/controls",
-      "model_module_version": "2.0.0",
-      "model_name": "DropdownModel",
-      "state": {
-       "_options_labels": [
-        "CPU",
-        "AUTO"
-       ],
-       "description": "Device:",
-       "index": 1,
-       "layout": "IPY_MODEL_05db20e075bf4842bfb0ef80736dc837",
-       "style": "IPY_MODEL_e3d29fef6d73445fbbb766698ebafca8"
-      }
-     },
-     "e3d29fef6d73445fbbb766698ebafca8": {
-      "model_module": "@jupyter-widgets/controls",
-      "model_module_version": "2.0.0",
-      "model_name": "DescriptionStyleModel",
-      "state": {
-       "description_width": ""
-      }
-     }
-    },
+    "state": {},
     "version_major": 2,
     "version_minor": 0
    }

diff --git a/notebooks/qwen2-vl/README.md b/notebooks/qwen2-vl/README.md
@@ -0,0 +1,50 @@
+# Visual-language assistant with Qwen2VL and OpenVINO
+
+Qwen2VL is the latest addition to the QwenVL series of multimodal large language models.
+
+**Key Enhancements of Qwen2VL:**
+* **SoTA understanding of images of various resolution & ratio**: Qwen2-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc.
+* **Understanding videos of 20min+**: Qwen2-VL can understand videos over 20 minutes for high-quality video-based question answering, dialog, content creation, etc.
+* **Agent that can operate your mobiles, robots, etc.:** with the abilities of complex reasoning and decision making, Qwen2-VL can be integrated with devices like mobile phones, robots, etc., for automatic operation based on visual environment and text instructions.
+* **Multilingual Support:** to serve global users, besides English and Chinese, Qwen2-VL now supports the understanding of texts in different languages inside images, including most European languages, Japanese, Korean, Arabic, Vietnamese, etc.
+
+
+**Model Architecture Details:**
+
+* **Naive Dynamic Resolution**: Qwen2-VL can handle arbitrary image resolutions, mapping them into a dynamic number of visual tokens, offering a more human-like visual processing experience.
+
+<p align="center">
+    <img src="https://qianwen-res.oss-accelerate-overseas.aliyuncs.com/Qwen2-VL/qwen2_vl.jpg" width="50%"/>
+<p>
+
+* **Multimodal Rotary Position Embedding (M-ROPE)**: Decomposes positional embedding into parts to capture 1D textual, 2D visual, and 3D video positional information, enhancing its multimodal processing capabilities.
+
+<p align="center">
+    <img src="http://qianwen-res.oss-accelerate-overseas.aliyuncs.com/Qwen2-VL/mrope.png" width="50%"/>
+<p>
+
+
+
+More details about model can be found in [model card](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct), [blog](https://qwenlm.github.io/blog/qwen2-vl/) and original [repo](https://github.com/QwenLM/Qwen2-VL).
+
+In this tutorial we consider how to convert and optimize Qwen2VL model for creating multimodal chatbot. Additionally, we demonstrate how to apply stateful transformation on LLM part and model optimization techniques like weights compression using [NNCF](https://github.com/openvinotoolkit/nncf)
+
+## Notebook contents
+The tutorial consists from following steps:
+
+- Install requirements
+- Convert and Optimize model
+- Run OpenVINO model inference
+- Launch Interactive demo
+
+In this demonstration, you'll create interactive chatbot that can answer questions about provided image's content.
+
+The image bellow illustrates example of input prompt and model answer.
+![example.png](https://github.com/user-attachments/assets/7e12ac6c-12f8-43d8-9c0a-b63d6ecaf20b)
+
+## Installation instructions
+This is a self-contained example that relies solely on its own code.</br>
+We recommend running the notebook in a virtual environment. You only need a Jupyter server to start.
+For details, please refer to [Installation Guide](../../README.md).
+
+<img referrerpolicy="no-referrer-when-downgrade" src="https://static.scarf.sh/a.png?x-pxid=5b5a4db0-7875-4bfb-bdbd-01698b5b1a77&file=notebooks/qwen2-vl/README.md" />