Name	Name	Last commit message	Last commit date
parent directory ..
datasets	datasets
scripts	scripts
utils	utils
README.md	README.md
extract_video_tokens.py	extract_video_tokens.py
load_dalle.py	load_dalle.py

Video2Token

In this directory, we provide code and instructions on how to prepare datasets for training. In particular, we cover the following aspects: (1) prepare and preprocess videos and annotations with MMAction2; (2) discretize videos using DALL-E VQ-VAE.

Prepare and Preprocess Videos and Annotations

In this section, we will go through the process of downloading videos and annotations, and preprocessing them into required formats. This section will not use any code in this repo, we fully rely on MMAction2.

Requirements

Install MMAction2 following the official instructions. For your convenience, we summarize the procedure below. If you encounter any issue, please refer the official instructions for additional information.

# 1, create conda environment
conda create -n open-mmlab python=3.7 -y
conda activate open-mmlab
# 2, install PyTorch 
conda install pytorch torchvision -c pytorch
# 3, install MMAction2 and its dependencies via mim package manager
# this may take a while
pip install git+https://github.com/open-mmlab/mim.git
mim install mmaction2
# 4, clone MMAction2, and cd into the project directory. 
# This would be your default working directory.
git clone https://github.com/open-mmlab/mmaction2.git
cd mmaction2
git checkout 0a6fde1abb8403f1f68b568f5b4694c6f828e27e .

Getting Started

All the downstream datasets used in our project, including Sthv2 (SSV2), Kinetics-400, Diving48, HMDB51, UCF101 can be found in the Supported Datasets section. Please follow these instructions to prepare the videos and annotations. Here we use UCF101 as an example to illustrate the process.

Download videos and annotations
```
cd tools/data/ucf101/
bash download_annotations.sh
bash download_videos.sh  
```
You may need to install unrar first for decompressing the video file, sudo apt-get install unrar. The annotations and videos will be stored at $MMAction2/data/ucf101 by default. If you desire to store the data to somewhere else, you may soft link a directory to be $MMAction2/data with ln -s /path/to/your/dir $MMAction2/data.
Generate video file list
```
bash generate_videos_filelist.sh
```
This generates files that contain list of paths to the video files and their labels. Note that our video data loader directly loads raw videos, there is no need to extract RGB frames.

Check Directory Structure

After the first two steps, you should expect to see the following folder structure:

mmaction2
├── mmaction
├── tools
├── configs
├── data
│   ├── ucf101
│   │   ├── ucf101_{train,val}_split_{1,2,3}_rawframes.txt
│   │   ├── ucf101_{train,val}_split_{1,2,3}_videos.txt
│   │   ├── annotations
│   │   ├── videos
│   │   │   ├── ApplyEyeMakeup
│   │   │   │   ├── v_ApplyEyeMakeup_g01_c01.avi
│   │   │   ├── ...
│   │   │   ├── YoYo
│   │   │   │   ├── v_YoYo_g25_c05.avi

For pretraining videos, i.e., HowTo100m videos, please follow the official website to download the videos.

Discrete Videos with DALL-E VQ-VAE

In this section, we detail the process of converting raw videos into discrete tokens via DALL-E VQ-VAE.

Requirements

# 1, create conda environment
conda create -n dalle python=3.7 -y
conda activate dalle
# 2, install PyTorch 
conda install pytorch torchvision -c pytorch
# 3, install DALL-E
pip install DALL-E
# 4, install ffmpeg and other dependencies
conda install -c conda-forge ffmpeg
pip install ffmpeg-python lmdb tqdm

Getting Started

We provide an easy-to-use script extract_tokens.sh to extract VQ-VAE tokens from raw videos. The extracted tokens are stored in LMDB files. Below is a template to use this script.

# at project root
export VIDEO_ROOT=/path/to/video/root # this will be $MMAction2/data
export TOKEN_ROOT=/path/to/tokens  # somewhere to save the VQ-VAE tokens
export CUDA_VISIBLE_DEVICES=0  # only a single GPU is supported.
bash video2token/scripts/extract_tokens.sh DATASET_NAME SPLIT_NAME FRAME_SHORTER_SIDE CROP_SIZE CROP_TYPE USE_HFLIP FPS

The input arguments are:

Argument	Definition	Common Values
`DATASET_NAME`	dataset name	`kinetics400`, `ucf101`, `hmdb51`, `diving48`, `sthv2`
`SPLIT_NAME`	split name	`train` or `val` for [`kinetics400`, `sthv2`], `train_val` for [`ucf101`, `hmdb51`, `diving48`]
`FRAME_SHORTER_SIDE`	frame shorter side length	`128`, `160`, `256`, `320`
`CROP_SIZE`	output frame crop size	`128` for `FRAME_SHORTER_SIDE` in [`128`, `160`], `256` for [`256`, `320`]
`CROP_TYPE`	crop location	`top`, `center`, `bottom`
`USE_HFLIP`	use horizontal flip	`0`, `1`
`FPS`	#frames per second	integer, e.g., `2` or `4`

To extract VQ-VAE tokens for UCF101 val split videos at 2 FPS, center crop 256x256 from frames with shorter side resized (keep aspect ratio) to 320, run:

# at project root
export VIDEO_ROOT=/path/to/video/root # this will be $MMAction2/data
export TOKEN_ROOT=/path/to/tokens  # somewhere to save the VQ-VAE tokens
export CUDA_VISIBLE_DEVICES=0  # only single GP extraction is supported.
bash video2token/scripts/extract_tokens.sh ucf101 val 320 256 center 0 2

The extracted tokens will be stored in an LMDB file located at $TOKEN_ROOT/dalle_ucf101_train_val_fps2_hflip0_320center256, and they are ready to be used for training.

Acknowledgement

This code used resources from MMAction2, DALL-E, video_feature_extractor, ffmpeg-python. The code is implemented using PyTorch. We thank the authors for open-sourcing their awesome projects.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

video2token

video2token

README.md

Video2Token

Prepare and Preprocess Videos and Annotations

Requirements

Getting Started

Discrete Videos with DALL-E VQ-VAE

Requirements

Getting Started

Acknowledgement

Files

video2token

Directory actions

More options

Directory actions

More options

Latest commit

History

video2token

Folders and files

parent directory

README.md

Video2Token

Prepare and Preprocess Videos and Annotations

Requirements

Getting Started

Discrete Videos with DALL-E VQ-VAE

Requirements

Getting Started

Acknowledgement