google-research · hanxiao · May 24, 2020
diff --git a/colabs/bit_video2video.ipynb b/colabs/bit_video2video.ipynb
@@ -0,0 +1,450 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "name": "bit_video2video.ipynb",
+      "provenance": [],
+      "toc_visible": true
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "bmQ4NyuERkov",
+        "colab_type": "text"
+      },
+      "source": [
+        "# BigTransfer (BiT) for Video Semantic Search in Large Scale\n",
+        "\n",
+        "By Han Xiao\n",
+        "\n",
+        "In this colab, we will show you how to load one of the BiT models (a ResNet50 trained on ImageNet-21k) and [Jina](https://get.jina.ai) to build a video semantic search system. \n",
+        "\n",
+        "## Dataset\n",
+        "\n",
+        "The data we are using is [Tumblr GIF (TGIF) dataset](http://raingo.github.io/TGIF-Release/), which contains 100K animated GIFs and 120K sentences describing visual contents. Our problem is the following: **given a video database and a query video, find the top-k semantically related videos from the database.**"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "CRfFigjWTSp3",
+        "colab_type": "text"
+      },
+      "source": [
+        "|  |   | |\n",
+        "|:---:|:---:|:---:|\n",
+        "| ![](https://hanxiao.io/2019/11/22/Video-Semantic-Search-in-Large-Scale-using-GNES-and-TF-2-0/tumblr_njqj3bMKQF1unc0x7o1_250.gif)| ![](https://hanxiao.io/2019/11/22/Video-Semantic-Search-in-Large-Scale-using-GNES-and-TF-2-0/tumblr_ni35trgNe41tmk5mfo1_400.gif) | ![](https://hanxiao.io/2019/11/22/Video-Semantic-Search-in-Large-Scale-using-GNES-and-TF-2-0/tumblr_nb2mucKMeU1tkz79uo1_250.gif) |\n",
+        "|A well-dressed young guy with gelled red hair glides across a room and scans it with his eyes. | a woman in a car is singing. | a man wearing a suit smiles at something in the distance. |"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "q7Ja3C1gUN0y",
+        "colab_type": "text"
+      },
+      "source": [
+        "## Problem Formulation \n",
+        "\n",
+        "“Semantic” is a casual and ambiguous word, I know. Depending on your applications and scenarios, it could mean motion-wise similar (sports video), emotional similar (e.g. memes), etc. Right now I will just consider semantically-related as as visually similar.\n",
+        "\n",
+        "Text descriptions of the videos, though potentially can be very useful, are ignored at the moment. We are not building a cross-modality search solution (e.g. from text to video or vice versa), we also do not leverage textual information when building the video search solution. Nonetheless, those text descriptions can be used to evaluate/compare the effectiveness of the system in a quantitative manner.\n",
+        "\n",
+        "Putting the problem into the neural search framework, this breaks down into the following steps:\n",
+        "\n",
+        "> **Index time**\n",
+        "1. segment each video into workable semantic units (aka [\"Chunk\"](https://github.com/jina-ai/jina/tree/master/docs/chapters/101#document--chunk));\n",
+        "2. encode each chunk as a fixed-length vector;\n",
+        "3. store all vector representations in a vector database.\n",
+        "\n",
+        "> **Query time**\n",
+        "1. do steps `1`,`2` in the index time for each incoming query;\n",
+        "2. retrieve relevant chunks from database;\n",
+        "3. aggregate the chunk-level score back to document-level;\n",
+        "4. return the top-k results to users.   \n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "mFZlvtyvWwvn",
+        "colab_type": "code",
+        "colab": {}
+      },
+      "source": [
+        "#@title Imports\n",
+        "import tensorflow as tf\n",
+        "import tensorflow_hub as hub\n",
+        "\n",
+        "import tensorflow_datasets as tfds\n",
+        "\n",
+        "import time\n",
+        "\n",
+        "from PIL import Image\n",
+        "import requests\n",
+        "from io import BytesIO\n",
+        "\n",
+        "import matplotlib.pyplot as plt\n",
+        "import numpy as np\n",
+        "\n",
+        "import os\n",
+        "import pathlib"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "0vu-Wgw8UOMP",
+        "colab_type": "text"
+      },
+      "source": [
+        "## Preprocessing Videos\n",
+        "\n",
+        "A good neural search is only possible when document and query are comparable semantic units. The preprocessor serves exactly this purpose. It segments a document into a list of semantic units, each of which is called a \"chunk\" in Jina. For video, a meaningful unary chunk could a *frame* or a *shot* (i.e. a series of frames that runs for an uninterrupted period of time). In Tumblr GIF dataset, most of the animations have less than three shots. Thus, I will simply use frame as chunk to represent document. \n",
+        "\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "JYXtSkRgWKYy",
+        "colab_type": "code",
+        "colab": {}
+      },
+      "source": [
+        "import io\n",
+        "from typing import Dict\n",
+        "\n",
+        "import numpy as np\n",
+        "from PIL import Image\n",
+        "from gif_reader import get_frames\n",
+        "from jina.executors.crafters import BaseDocCrafter\n",
+        "from jina.executors.crafters import BaseSegmenter\n",
+        "\n",
+        "\n",
+        "class GifNameRawSplit(BaseDocCrafter):\n",
+        "\n",
+        "    def craft(self, raw_bytes, *args, **kwargs) -> Dict:\n",
+        "        file_name, raw_bytes = raw_bytes.split(b'JINA_DELIM')\n",
+        "        return dict(raw_bytes=raw_bytes, meta_info=file_name)\n",
+        "\n",
+        "\n",
+        "class GifPreprocessor(BaseSegmenter):\n",
+        "    def __init__(self, img_shape: int = 96, every_k_frame: int = 1, max_frame: int = None, from_bytes: bool = False,\n",
+        "                 *args, **kwargs):\n",
+        "        super().__init__(*args, **kwargs)\n",
+        "        self.img_shape = img_shape\n",
+        "        self.every_k_frame = every_k_frame\n",
+        "        self.max_frame = max_frame\n",
+        "        self.from_bytes = from_bytes\n",
+        "\n",
+        "    def craft(self, raw_bytes, doc_id):\n",
+        "        result = []\n",
+        "        try:\n",
+        "            if self.from_bytes:\n",
+        "                im = Image.open(io.BytesIO(raw_bytes))\n",
+        "            else:\n",
+        "                im = Image.open(raw_bytes.decode())\n",
+        "            idx = 0\n",
+        "            for frame in get_frames(im):\n",
+        "                try:\n",
+        "\n",
+        "                    if idx % self.every_k_frame == 0 and (\n",
+        "                            (self.max_frame is not None and idx < self.max_frame) or self.max_frame is None):\n",
+        "                        new_frame = frame.convert('RGB').resize([self.img_shape, ] * 2)\n",
+        "                        img = (np.array(new_frame) / 255).astype(np.float32)\n",
+        "                        # build chunk next, if the previous fail, then no chunk will be add\n",
+        "                        result.append(dict(doc_id=doc_id, offset=idx,\n",
+        "                                           weight=1., blob=img))\n",
+        "                except Exception as ex:\n",
+        "                    self.logger.error(ex)\n",
+        "                finally:\n",
+        "                    idx = idx + 1\n",
+        "\n",
+        "            return result\n",
+        "\n",
+        "        except Exception as ex:\n",
+        "            self.logger.error(ex)\n"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "Y3YqWUfWWP61"
+      },
+      "source": [
+        "This preprocessor loads the animation, reads its frames into RGB format, resizes each of them to 96x96 and stores in `doc.chunks.blob` as `numpy.ndarray`. At the moment we don't implement any keyframe detection in the preprocessor, so every chunk has a uniform weight, i.e. `c.weight=1`. \n",
+        "\n",
+        "![](https://hanxiao.io/2019/11/22/Video-Semantic-Search-in-Large-Scale-using-GNES-and-TF-2-0/tumblr_njqj3bMKQF1unc0x7o1_250.gif.jpg)\n",
+        "![](https://hanxiao.io/2019/11/22/Video-Semantic-Search-in-Large-Scale-using-GNES-and-TF-2-0/tumblr_ni35trgNe41tmk5mfo1_400.gif.jpg)\n",
+        "![](https://hanxiao.io/2019/11/22/Video-Semantic-Search-in-Large-Scale-using-GNES-and-TF-2-0/tumblr_nb2mucKMeU1tkz79uo1_250.gif.jpg)\n",
+        "\n",
+        "One may think of more sophisticated preprocessors. For example, smart sub-sampling to reduce the number of near-duplicated frames; using [seam carving](http://en.wikipedia.org/wiki/Seam_carving) for better cropping and resizing frames; or adding image effects and enhancements. Everything is possible and I will leave these possibilities to the readers."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5_W7Wf0hYx5w",
+        "colab_type": "text"
+      },
+      "source": [
+        "## Using BiT to Encode Chunks into Vectors\n",
+        "\n",
+        "In the encoding step, we want to represent each chunk by a fixed-length vector. This can be easily done with the [pretrained models provided by BiT](https://github.com/google-research/big_transfer).\n",
+        "\n",
+        "Jina already supports BiT, its implementation is as simple as below:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "4O-Z4nonTng0",
+        "colab_type": "code",
+        "colab": {}
+      },
+      "source": [
+        "class BiTImageEncoder(BaseCVTFEncoder):\n",
+        "    def __init__(self, model_path: str, channel_axis: int = -1, *args, **kwargs):\n",
+        "        \"\"\"\n",
+        "        :param model_path: the path of the model in the `SavedModel` format. `model_path` should be a directory path,\n",
+        "            which has the following structure. The pretrained model can be downloaded at\n",
+        "            wget https://storage.googleapis.com/bit_models/Imagenet21k/[model_name]/feature_vectors/saved_model.pb\n",
+        "            wget https://storage.googleapis.com/bit_models/Imagenet21k/[model_name]/feature_vectors/variables/variables.data-00000-of-00001\n",
+        "            wget https://storage.googleapis.com/bit_models/Imagenet21k/[model_name]/feature_vectors/variables/variables.index\n",
+        "\n",
+        "            ``[model_name]`` includes `R50x1`, `R101x1`, `R50x3`, `R101x3`, `R152x4`\n",
+        "\n",
+        "            .. highlight:: bash\n",
+        "            .. code-block:: bash\n",
+        "\n",
+        "                .\n",
+        "                ├── saved_model.pb\n",
+        "                └── variables\n",
+        "                    ├── variables.data-00000-of-00001\n",
+        "                    └── variables.index\n",
+        "\n",
+        "        :param channel_axis: the axis id of the channel, -1 indicate the color channel info at the last axis.\n",
+        "                If given other, then ``np.moveaxis(data, channel_axis, -1)`` is performed before :meth:`encode`.\n",
+        "        \"\"\"\n",
+        "        super().__init__(*args, **kwargs)\n",
+        "        self.channel_axis = channel_axis\n",
+        "        self.model_path = model_path\n",
+        "\n",
+        "    def post_init(self):\n",
+        "        self.to_device()\n",
+        "        import tensorflow as tf\n",
+        "        _model = tf.saved_model.load(self.model_path)\n",
+        "        self.model = _model.signatures['serving_default']\n",
+        "        self._get_input = tf.convert_to_tensor\n",
+        "\n",
+        "    @batching\n",
+        "    @as_ndarray\n",
+        "    def encode(self, data: 'np.ndarray', *args, **kwargs) -> 'np.ndarray':\n",
+        "        if self.channel_axis != -1:\n",
+        "            data = np.moveaxis(data, self.channel_axis, -1)\n",
+        "        _output = self.model(self._get_input(data.astype(np.float32)))\n",
+        "        return _output['output_1'].numpy()\n"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5c_k30DZc-7m",
+        "colab_type": "text"
+      },
+      "source": [
+        "The key function `encode()` is simply calling the model to extract features. The `batching` decorator is a very handy helper to control the size of the data flowing into the encoder. After all, OOM error is the last thing you want to see. "
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "WS4Dp_16bsEw",
+        "colab_type": "text"
+      },
+      "source": [
+        "### Download Pre-trained BiT model\n",
+        "\n",
+        "In this tutorial, we will use a ResNet50x1 model trained on ImageNet-21k. And we are using the feature extractor version of it.\n",
+        "\n",
+        "```bash\n",
+        "#!/usr/bin/env bash\n",
+        "\n",
+        "MODEL_NAME=\"R50x1\"\n",
+        "MODEL_DIR=\"pretrained\"\n",
+        "MODEL_VAR_DIR=$MODEL_DIR/variables\n",
+        "mkdir -p ${MODEL_DIR}\n",
+        "mkdir -p ${MODEL_VAR_DIR}\n",
+        "\n",
+        "curl https://storage.googleapis.com/bit_models/Imagenet21k/${MODEL_NAME}/feature_vectors/saved_model.pb --output ${MODEL_DIR}/saved_model.pb\n",
+        "\n",
+        "curl https://storage.googleapis.com/bit_models/Imagenet21k/${MODEL_NAME}/feature_vectors/variables/variables.data-00000-of-00001 --output ${MODEL_VAR_DIR}/variables.data-00000-of-00001\n",
+        "\n",
+        "curl https://storage.googleapis.com/bit_models/Imagenet21k/${MODEL_NAME}/feature_vectors/variables/variables.index --output ${MODEL_VAR_DIR}/variables.index\n",
+        "```"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "wbx23IZRbVSS",
+        "colab_type": "text"
+      },
+      "source": [
+        "Note that now all necessary piecese are in `pretrained`. Using BiT as encoder is extremely easy, you only need to write `encode.yml` file as follows:\n",
+        "```yaml\n",
+        "!BiTImageEncoder\n",
+        "with:\n",
+        "  model_path: pretrained\n",
+        "  pool_strategy: avg\n",
+        "```"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "f31Lpf7ZfPJv",
+        "colab_type": "text"
+      },
+      "source": [
+        "## Indexing Chunks and Documents\n",
+        "\n",
+        "For indexing, I will use the built-in chunk indexers and document indexers of Jina. Chunk indexing is essentially vector indexing, we need to store a map of chunk ids and their corresponding vector representations. Simply write a YAML config `chunk.yml` as follows:\n",
+        "```yaml\n",
+        "!ChunkIndexer\n",
+        "components:\n",
+        "  - !NumpyIndexer\n",
+        "    with:\n",
+        "      index_filename: vec.gz\n",
+        "    metas:\n",
+        "      name: vecidx  # a customized name\n",
+        "      workspace: $TEST_WORKDIR\n",
+        "  - !BasePbIndexer\n",
+        "    with:\n",
+        "      index_filename: chunk.gz\n",
+        "    metas:\n",
+        "      name: chunkidx\n",
+        "      workspace: $TEST_WORKDIR\n",
+        "metas:\n",
+        "  name: chunk_compound_indexer\n",
+        "  workspace: $TEST_WORKDIR\n",
+        "```\n",
+        "\n",
+        "As eventually in the query time, we are interested in documents not chunks, hence the map of doc id and chunk ids should be also stored. This is essentially a key-value database, and a simple Python `Dict` structure will do the job. Again, only a YAML config `doc.yml` is required:\n",
+        "\n",
+        "```yaml\n",
+        "!DocPbIndexer\n",
+        "with:\n",
+        "  index_filename: doc.gzip\n",
+        "metas:\n",
+        "  name: doc_indexer  # a customized name\n",
+        "  workspace: $TEST_WORKDIR\n",
+        "```\n",
+        "\n",
+        "Note that the doc indexer does not require the encoding step, thus it can be done in parallel with the chunk indexer."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "JQqXTSAAg79e",
+        "colab_type": "text"
+      },
+      "source": [
+        "## Putting Everything Together\n",
+        "\n",
+        "### Index Flow\n",
+        "\n",
+        "```yaml\n",
+        "!Flow\n",
+        "with:\n",
+        "  logserver: true\n",
+        "pods:\n",
+        "  chunk_seg:\n",
+        "    yaml_path: craft/index-craft.yml\n",
+        "    replicas: $REPLICAS\n",
+        "    read_only: true\n",
+        "  doc_idx:\n",
+        "    yaml_path: index/doc.yml\n",
+        "  tf_encode:\n",
+        "    yaml_path: encode/encode.yml\n",
+        "    needs: chunk_seg\n",
+        "    replicas: $REPLICAS\n",
+        "    read_only: true\n",
+        "  chunk_idx:\n",
+        "    yaml_path: index/chunk.yml\n",
+        "    replicas: $SHARDS\n",
+        "    separated_workspace: true\n",
+        "  join_all:\n",
+        "    yaml_path: _merge\n",
+        "    needs: [doc_idx, chunk_idx]\n",
+        "    read_only: true\n",
+        "```\n",
+        "\n",
+        "### Query Flow\n",
+        "\n",
+        "```yaml\n",
+        "!Flow\n",
+        "with:\n",
+        "  logserver: true\n",
+        "  read_only: true  # better add this in the query time\n",
+        "pods:\n",
+        "  chunk_seg:\n",
+        "    yaml_path: craft/index-craft.yml\n",
+        "    replicas: $REPLICAS\n",
+        "  tf_encode:\n",
+        "    yaml_path: encode/encode.yml\n",
+        "    replicas: $REPLICAS\n",
+        "  chunk_idx:\n",
+        "    yaml_path: index/chunk.yml\n",
+        "    replicas: $SHARDS\n",
+        "    separated_workspace: true\n",
+        "    polling: all\n",
+        "    reducing_yaml_path: _merge_topk_chunks\n",
+        "    timeout_ready: 100000 # larger timeout as in query time will read all the data\n",
+        "  ranker:\n",
+        "    yaml_path: BiMatchRanker\n",
+        "  doc_idx:\n",
+        "    yaml_path: index/doc.yml\n",
+        "```"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "AUfUBnLQif3m",
+        "colab_type": "text"
+      },
+      "source": [
+        "## Full Example\n",
+        "\n",
+        "The full example and results can be [found in here](https://github.com/jina-ai/examples/tree/master/tumblr-gif-search)."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "WGf52j1Mg7Ah",
+        "colab_type": "text"
+      },
+      "source": [
+        ""
+      ]
+    }
+  ]
+}